Differences between association and prediction studies

Key points

  • Association and prediction studies have different goals. Machine learning excels in prediction studies.
  • Association studies focus on understanding a phenomena. They look for relationships between variables and outcomes, but they might not have predictive power.
  • Prediction studies use many variables to create predictors. They learn patterns in the training data to make predictions on new data. They might be very accurate, but hard to interpret, and that is fine.


Introduction

In the computational psychiatry group at the University of Alberta we want to create tools that help with the diagnosis and prognosis of mental illnesses. Following that line, my current research focuses on the applications of machine learning algorithms to identify psychiatric problems.

One of the questions that commonly arises is: What are the “main” features that machine learning classifiers use to make the predictions? For example, in the paper “Learning stable and predictive network-based patterns of schizophrenia and its clinical symptoms”, Mina Gheiratmand created a tool that can identify people with schizophrenia with 74% accuracy using brain images. The immediate question was: Which parts of the brain contribute to the prediction? More often than not, a machine learning answer is: “It’s actually a combination of many of them”. This is one of the reasons why machine learning solutions are considered a “black box”. We cannot usually point to a single, or a small group of features, that can explain the prediction.

This is where the difference between association and prediction studies come into play. Both of them are very important, but they have different goals. Association studies attempt to gain a better understanding of a phenomenon, so they focus on finding group differences. Prediction studies attempt to build accurate classifiers that can make predictions at the subject level. The head of our lab, Russ Greiner, has an excellent talk describing the differences between both approaches. Machine learning is a great tool that can help with prediction studies, but it provides very limited help in association studies.

To gain further insight, let’s pose a simple situation. We want to study schizophrenia. What would be the approach of both studies?

Association studies

They are great for gaining an understanding of the phenomena under study. They also give hints of what to study next.

Association studies focus on explanatory power: How is a single feature related to the outcome?  In our simple example: Is the connectivity in the brain’s frontal cortex different in people with schizophrenia? A possible answer might be: People with schizophrenia have, on average, less connectivity in the frontal cortex. (This is just a toy example. I don’t know if people with schizophrenia have a different connectivity in the frontal cortex). These are the studies that also produce pretty pictures, such as this one.

fMRI resting state image
By Malaak N. Moussa, Matthew R. Steen, Paul J. Laurienti, and Satoru Hayasaka [CC BY 2.5 (https://creativecommons.org/licenses/by/2.5)], via Wikimedia Commons
These are the kind of studies that you might see in the news most of the time. Note that these studies make claims at the group level. This is part of the reason why many people often find counterexamples to those claims. For example, you read: “People who smoke increase their chance of getting lung cancer”. Then you say: “My grandma smoked since she was 15, she lived more than 90 years”. The study was talking about an effect that happens on average. It was not designed to make predictions.

When dealing with medical related problems, association studies look for biomarkers. A biomarker is loosely defined term, but in general it refers to single (or few) elements that we can identify to diagnose a disease. After people identify them, they can do further research to determine the precise role of that biomarker in the disease.

Prediction studies

These studies focus on predictive power, rather than explanatory power. They look for patterns in the data that can make accurate predictions on single subjects. Most of the time, these patterns involve many variables.

Going back to our example. A prediction study would require the connectivity in several regions of a specific patient’s brain. Then, using that information, it would predict if that patient has schizophrenia or not.

Prediction studies are very useful when we want tools that help us to make decisions. However, most of the time there is a trade-off between performance and interpretability. We live in a complex world, so we need tools that can learn complex patterns. On the other side, it is likely that these complex patterns are very hard to understand. They might just involve many variables.

A toy example

* For this example I’m creating fake data to illustrate a point. This is not based on any data involving people with schizophrenia.

Let’s assume that we want to diagnose schizophrenia, and we measure the activation of two regions of the brain in two different groups: healthy people (blue) and people with schizophrenia (orange). The following figure shows the distribution of the activation level in both groups. Below each graph is the p-value associated with the test that identifies statistically significant differences between the means of the red and blue groups.

Distribution of the activation level in two different parts of the brain. (Fake data)
Distribution of the activation level in two different parts of the brain. The graph on the left shows the activation in the motor cortex. The graph on the right represents the activation in the frontal cortex. The blue distribution represent the healthy people. The red distribution represents the people with schizophrenia. (Fake data)

The difference in the activation level of the motor cortex (graph on the left) is statistically significant between the blue and red groups. However, there appears to be no difference in the activation of the frontal cortex. We might consider the activation in the motor cortex as a possible biomarker; however, there is a lot of overlap between the distributions. It would be  very difficult to make predictions based only on it. The frontal cortex appears to be completely irrelevant.

If we want to predict schizophrenia based only on this information, we might want to analyze all the variables together. This is illustrated on the next figure.

Scatterplot showing the relationships between the activation in the motor and frontal cortex. (Fake data)
Scatterplot showing the relationships between the activation in the motor and frontal cortex. (Fake data)

After analyzing all the variables together we appreciate a very clear distinction between the red and blue groups. Now it seems very easy to classify a new subject as belonging to the blue or red group. Note that the information is exactly the same than in the first figure, but now we allow our method to look at all the information at the same time.

Final thoughts

I presented a toy example that includes only 2 variables. Unfortunately, real world datasets might contain thousand or millions of variables. Predictions studies might be able to find complex patterns in this case, but it predictions will most likely not depend on only few of them.

Machine learning is a very exciting field, and it can help us solve many important prediction problems. Unfortunately, most of the time it won’t be able to give an interpretable model, even if it excels at making predictions. There are some approaches that make a trade-off between interpretability and performance. For example, decision trees are very easy to interpret, but its performance is usually lower than more complex algorithms. Should we sacrifice accuracy for “interpretability”?

In general, I believe that the answer should be no. In my opinion, we need to have very clear what is our objective, and then use the tool that excels in fulfilling that objective. My supervisor often says: “Do you want a sturdy car, or do you want a fast one?” Of course we would love to have both, but that is often not the case. If we attempt to build a car that is both, sturdy and fast, we will end up failing having a suboptimal car in both aspects.

Of course, this is a very complex topic, mainly in the medical domain. Should we trust an algorithm that make very accurate predictions without understanding how is it doing them? I also believe that the answer is yes, although we are still far away from having an algorithm that reaches such levels of performance.

Want to learn more about machine learning?

For people interested in starting on machine learning I always recommend Andrew Ng’s machine learning course on coursera. It provides a very complete overview of the field and gives the intuition to start studying this area in more detail.