# Graphs and Modeling Issues

# Introduction

In a previous Post, we discussed how graphs can be used to represent the causal and statistical dependency relations inherent in any statistical model. In this post we want to extend our reach to present how graphs, in particular directed acyclic graphs (DAGs), can be used to make clear certain problematic situations applied statisticians may face when modeling real world phenomena. To that end, we will explore how graphs represent 5 issues that crop up in statistical modeling

- The spurious correlation
- endogeneity/exogeneity
- confounders
- Simpson’s paradox
- Causation.

Spurious correlation we treated in the previous post. So, we shall use it to remind ourselves of the relationship between graphs and the probabilistic interpretation of linear models. If the previous post is fresh in your mind, feel free to skip to endogeneity.

# The Spurious Correlation

The spurious correlation is a situation in which two variables are correlated, but this correlation is due to a third variable that is the cause of the other two. A famous example is the relationship between chocolate consumption per capita and Nobel Prizes won by country. Many readers will, perhaps, have seen the graph below. It was published in The New England Journal of Medicine in a paper by Franz H. Messerli. The figure represents the relationship between chocolate consumption and Nobel Prizes won. It’s fairly obvious that there is a strong positive relationship between the two; the correlation (r=0.791) is relatively high. One might argue, as Dr. Messerli did tongue-in-cheek, that chocolate may improve cognitive function. However, more likely, there is a third variable that is the cause of both of these, namely wealth. Citizens in richer countries tend to consume more chocolate and richer countries tend to invest more in research and development, especially of basic science, where Nobel Prizes tend to be awarded. Thus, the correlation between chocolate consumption and Nobel Prizes is a spurious one.

The directed acyclic graph (DAG) in figure 1 easily represents the spurious correlation. Wealth is the parent of both chocolate consumption and Nobel prizes, but there is no direct connection between the later two. The graph can also be translated into the language of probability theory. Let W be a country’s wealth, C the chocolate consumption per capita, and N the number of Nobel prizes won by a country’s residents. Then, we can factor the joint distribution:

Each conditional distribution represents the dependency structure captured by the edges in the DAG, which in this case captures a causal relation. The conditional independence of chocolate consumption and Nobel prizes is also captured by the two representations: The DAG has no direct connection between the two; the factored joint density has factored the joint conditional density of Chocolate consumption and Nobel prizes into two univariate conditional distributions.

# Exogeneity and Endogeneity

Endogeneity has been taught as a situation in which a predictor is correlated with the error term. Other ways of explicating endogeneity include reference to simultaneity of determination of the response and a predictor, being determine by the model of interest, and others. A DAG has a much easier time of representing endogeneity and exogeneity. A common example in economics is the effect of education on future wages. We would like to know how education affects future wages. However, wage and educational attainment, measured by years in school, are both affected by natural ability (perhaps, genetic endowment or upbringing); certain people enjoy school more, or have more affinity in certain topics, thus stay in school longer. These affinities also push them to certain fields where they might make more money. If we fit a linear model to data produced by the observable features of this situation, the coefficient on educational attainment would be biased. The DAG below represents this situation. Ability is unobserved (A convention I follow is that unobserved variables will be represented by dotted borders). But, we postulate the edge between ability and educational attainment. This entails that educational attainment is endogenous in this DAG.

The DAG also allows us to see why educational attainment is endogenous; it has an indirect path to future wage. Consider a structural model that represents this situation:

By solving the second equation for ability and plugging the result into the first equation:

We can see two problems from the structural equation: 1) there is no way to identify the direct effect of education on wages (βe); 2) education is now correlated with the error term through νν. These are both captured by the DAGs backdoor path from education to wages through ability. Endogeneity occurs when a variable has a backdoor path to the outcome through a sequence of variables that are not controlled in the estimated model.

With the standard notation, we are not capable of representing exogeneity/endogeneity in the probability calculus. However, if we introduce a simple notation do(X=x) to represent the intervention by a researcher in a system to set the random variable X equal to the value x, then the variable X is endogenous if:

This notation tells us that the distribution of WW is not the same when we intervene to set X=x versus when we observe that X=x. The causal story this is capturing is that when we intervene in the system, all of the upstream causes of X cease to effect it; in the DAG, edges to X are all broken when we intervene. In the education/wage situation, if we intervened to set the years of school for all or a random sampling of students, then the link between ability and years of schooling would be broken. This would break the backdoor effect education has on wages through ability. Thus, we would be able to identify the direct effect on wages of education.

# Confounders

Confounding is often taught as a situation in which several nested models are fit and coefficient values for certain predictors change between the models. To explore this, consider a graph I found on Twitter march 8th:

The case that was being made by the OP was that restrictions do not work. The seven day average cases per 100,000 people was basically the same whether the state had restrictions in place (red) or not (grey). We can explore this by running a simple regression of log cases against a dummy variable for restrictions (the data is the same that was used to create the graph above, which can be found here). Table 2 column (1) shows that the dummy for restrictions is positive. Thus, states that have restrictions in place can expect their 7 day average case per 100,000 people to be about 1.15 higher than states without restrictions. However, looking at the states that have restrictions (Illinois, Wisconsin, Minnesota, Michigan, Ohio, and Indiana) vs those without restrictions (the Dakotas, Nebraska, Iowa, Missouri), the first are significantly larger and have a larger weighted density. We also have serially correlated observations; yesterday’s new cases and today’s new case are correlated, the correlation is >.99>.99. All of these facts can be confounding the relationship between restrictions and new cases. In columns (2) and (3) of table 2 we report the result of adding population and the lag of new cases (yesterday’s new cases) to our regressions. In both instances we see that the coefficient on restrictions decreases, even becoming negative though not significant in column (3).

Confounding is an insidious problem in applied statistics. The proper way to handle confounding is to control for possible confounding variables. In the case above we added the lagged value and the population of the states to our regression and saw drastic changes in the estimate of the effect of restrictions. In the DAG representation of this situation, the problem can be expressed parsimoniously. There is an edge between Population and Restrictions. Thus, there is a backdoor for the information in Restrictions to get to New Cases. By adding Population to the regression, we, in a metaphorical sense, stop information flowing through that path. We also added an edge connecting New Cases with itself to account for the serial correlation. We also need to account for this to close a backdoor from Restrictions to today’s New Cases. The analytic representation of the confounding relationship is actually identical to that of endogeneity, though for this model we can control for the now observed confounders to dispel the endogeneity:

The moral is that exogeneity can be thought of as no-confounding, by observed or unobservable variables.

# Simpson’s Paradox

Simpson’s paradox is the situation in which estimated effects are different when looking at the entire population from the estimated effect at when looking at every sub-population. We might think of an example of a treatment for a disease given to both men and women. We would have an instance of Simpson’s paradox if when considering the population as a whole the treatment does not cure the disease but for each sub-population, men and women, the treatment does cure the disease. In this situation, what may be going on is that the treatment is correlated with the subgroup, perhaps the treatment is taken voluntarily. Some patients choose to take it but others do not. If there are more men in the population or the disease is more severe amongst men, we might see that the treatment does improve the outcomes of men and women, but when aggregated together, total effect is nil or negative. The graphical representation of this would show the treatment to be correlated in some way with the group membership variable:

The analytic representation of Simpson’s paradox is difficult without the do-notation. In this case, intervening to hold Sex constant does not change the situation. However, holding Treatment constant does in fact change the situation. If we held Sex at male, then we could estimate the effect of the treatment on the disease state for males; likewise for females. However, if we randomized the treatment between sexes, this would, effectively, be breaking the edge between the treatment and the sex. The analytic expression that indicates a case of Simpson’s paradox has to look at the factorization of the entire joint distribution (let D=disease state, S=sex, and T=treatment):

# Causation

The graph structure of a DAG was initially recruited into statistics to represent the causal relationship between two variables. An edge between two variables indicates the origin variable is a direct cause of the destination variables. More generally, a directed path from one variable to another entails the first is a cause of the second, even if there are intermediate variables, though, the more intermediates the more attenuated the causal effect can expected to get.

The analytic representation requires the do-notation. In fact, the do-notation was originally adopted to represent causal relations in the probability calculus. The causal effect of X and Y is just the conditional probability under intervention:

As we have seen, if X is exogenous, then the above equals the conditional probability without intervention. So, for an exogenous X, the causal effect is

# Conclusion

In this post, we explored ways that graphical structures can be used to represent problem areas in applied statistics. Some of these issues have plagued applied statisticians for many years, with much ink spilled. The graph structures do not solve the applied issues surrounding them. However, as we will explore in our next blog post, the graph structures, treated as hypotheses about the interrelations between variables, entail some local tests that can be performed to identify some of these relations.

Originally published to Rpubs: