*Why do we use multivariable models?*

The purpose of statistical models in research, to put it very simply, is to help identify possible causal relationships between variables. Models can’t always achieve this aim – they might incorrectly identify a relationship that doesn’t really exist (a false positive), or alternatively their results might fail to find a relationship that really does exist (a false negative). This might happen for a variety of reasons, some of which occur at the study design or data collection stages, but sometimes it will occur because the variables included in the model weren’t chosen correctly. In this post I’m going to try to explain some basic principles of variable selection, and point out some common mistakes. My examples are all medical but these concepts are applicable to any type of research using statistical models to identify causal relationships.

There are two broad types of statistical models. There are univariable models that include one dependent variable (the outcome) and one independent variable (the exposure), and multivariable models that include the outcome variable and a number of other variables, either multiple exposures, or possible confounders.

A confounder is a variable that affects the appearance of the relationship between two other variables in a certain way. A confounder might make it look like two variables have a relationship when really they don’t, it might make the relationship look stronger or weaker than it really is, or it might make a real relationship disappear entirely. The main purpose of using multivariable rather than univariable models is to mitigate the impact of confounders, in order to get better estimates of the true relationships.

Here’s an example of a confounder: although men can get breast cancer, it is much more common in women. Women are also much more likely than men to be nurses, and much less likely to be engineers. If we were to do a study on occupation and the risk of breast cancer, then because gender affects both breast cancer risk and occupation, we might see a relationship between occupation and breast cancer where none really exists. Gender will confound the relationship between occupation and breast cancer risk. In a univariable model, it would look like being a nurse dramatically increased your risk of breast cancer compared to being an engineer (because most of the nurses would be women, and therefore most of the breast cancer patients would be nurses, not engineers). If we included both occupation and gender in a multivariable model, then the true relationship would become apparent – the model would identify that it was really gender, not occupation, which determines breast cancer risk.

Selecting the right variables for a multivariable model depends in large part on understanding, at least to some degree, the probable causal associations between different variables. In some cases this will be based on previous research (for example it is now well known that smoking causes lung cancer), and sometimes based on logic (for example it is obvious that having had a hysterectomy will reduce a woman’s risk of uterine cancer). Understanding these relationships allows the researcher to identify potential confounders of the relationship of interest, in order to include these in the model.

This approach is referred to as specifying the variables for inclusion in the model *a priori* (based on theoretical reasoning, rather than empirical observation of the data). However in many branches of science, a variety of other methods for choosing variables have become common.

*The problem with forward selection *

In many studies, the authors will say that they tested all univariable associations between each exposure and their outcome of interest, and then included only significant variables in their final model. The problem with this approach is that it will result in the exclusion of exposures which have a true relationship with the outcome, but for which relationship was hidden by a confounder.

For example, there is a true relationship between alcohol consumption and breast cancer, but men tend to drink alcohol more often and in larger amounts than women. If I were to look at alcohol consumption and the risk of breast cancer in a univariable analysis, it’s possible that I would not see any association – most of the heavy drinkers would be men, whose risk of breast cancer is very low, and this would hide the association. But if I included both gender and alcohol consumption, the true relationship would be uncovered – my model would show, correctly, that women who are heavier drinkers are at higher risk of breast cancer than women who don’t drink.

If I had used forward selection in this study, I would omit alcohol consumption from my final multivariable model of predictors of breast cancer, and I would never discover this association.

*The problem with p-values generally*

There is increasing recognition of the misinterpretation of p-values in the scientific literature. A p-value is a statistical expression of the probability that an association of the strength you saw would have been observed if there were no true relationship between the two variables, given the size of the sample that you used in your study. The reason that p-values are useful is that all studies are subject to random sampling variation – consequently there is always a possibility that a finding is simply a fluke (a false positive).

So for example, as far as we know, tall people are no more likely than short people to be good at maths. But let’s say that I recruited a small number of tall people and short people, and gave them a maths test. There’s a possibility that just by chance, I might select a few tall people who are very good at maths and a few short people who were very bad at maths, and therefore see an association – perhaps the tall people in my study would be twice as likely as the short people to score above 80% on the test.

The likelihood that I will accidentally recruit tall mathsy people and short non-mathsy people gets smaller as my sample size gets larger. If I choose five tall people and five short people, I would only need one or two tall maths whizzes to throw my results out. But if I recruited a thousand tall people and a thousand short people, I would need a lot more tall maths whizzes to affect the results, and the chances of this happening by accident are much smaller. (Think of how much less likely it is that you would flip a fair coin and get ten heads in a row, compared to three heads in a row.) For this reason, p-values get smaller as studies get larger, even when the effect size (e.g. a two fold increase in scores over 80%) remains the same.

As a consequence, a p-value is always to some degree simply a reflection of the sample size of the study. With extremely large studies, even a tiny difference between two groups will attain a very small p-value (e.g. p=0.001), even if it has no genuine importance. With small studies, a very large difference between two groups will only attain a large p-value (e.g. p=0.20), even if the difference is very important (e.g. a ten-fold increase in the risk of cancer). This issue is often not acknowledged when people interpret p-values, and it is especially a problem if people are using p-values to help design their statistical analysis (I’m looking at you, Big Data).

A much bigger problem with p-values is everything they can’t tell you. P-values are often mistakenly interpreted as the probability that the results of the study are “correct”, but in fact their meaning is much more specific, as I described above. The p-value only tells you the chance that you would see an effect the size that you did if there were truly no relationship between two variables. That’s it! And even this meaning is subject to a large number of assumptions – a given p-value will only be correct if your sample was truly representative of the population you are studying; if you measured all your variables without certain types of error; if you included all of the important variables in your model the correct way, and several other considerations. Ken Rothman and colleagues have an excellent and comprehensive paper on the various limitations of p-values and their correct interpretations. The language is a little technical, but the list of misconceptions should be useful to be most people.

The take home messages are these: building an accurate statistical model requires good content knowledge and an appropriate statistical approach, and even then, the results must be interpreted conservatively. Highly statistically significant findings might be the result of incorrect sample selection, incorrect model specification, or mere chance, while non-significant findings may be the result of each of these or inadequate sample size.

*So how do I build my model correctly? *

The pressure to publish means that many people are expected to participate in components of research without formal training, and this seems to be especially true in statistics. We don’t imagine that a statistician would be skilled at conducting qualitative interviews on delicate topics, or that they could be safely trusted to administer a medical treatment as part of a clinical trial. Model specification requires an in depth understanding of epidemiology and biostatistics *and* a detailed knowledge of the content area – as such, it’s best done either in collaboration between a content person and a statistics person, or by someone with knowledge and training on both fronts.

The most straightforward solution for non-statistical researchers will often be collaboration with a statistician or an epidemiologist. Often we can help you specify a model correctly in an hour or two, with your help to understand the topic and the known causal relationships that are relevant. Many universities offer free or low-cost statistical consulting services for this purpose – use them! In addition, many early career statisticians and epidemiologists have more limited demands on our time than senior statisticians, and might be happy to help you in exchange for inclusion as co-authors. Building collaborations with people in these fields will improve the quality of your research hugely, and often save you a great deal of time in the long run (as well as improving your chances on grants that undergo methodological review).

Statisticians and epis are not always available or accessible, but thankfully the training required to correctly specify multivariable models is not especially arduous and is not beyond most people conducting research. If you are going to spend a long time working on projects with statistical components, it is definitely worth investing the time to learn how to specify models appropriately. I often encounter PhD students who have spent three or four years on a research project without spending a month or two gaining the statistical skills to correctly analyse and interpret their data. This strikes me as a really unfortunate missed opportunity. Many universities offer short courses in statistics that assume minimal previous knowledge, often at discounted rates for staff and especially post-grad students. Statistics is not everybody’s cup of tea, but these courses will help you immeasurably in conducting your own research and evaluating that published by others in your field.

Anybody can pick up a hammer, but building a house requires collaboration between a number of people with different skill sets. You wouldn’t ask a plumber re-tile your roof, and you wouldn’t expect an architect to replace your dishwasher. Non-professionals can learn to do both those things, but consequences await those who attempt them without appropriate training. Statistics is no different, even though the consequences are often invisible. Your research is important, so it’s worth analysing your data correctly. You wouldn’t let a statistician with no clinical training go at one of your participants with a syringe – your data deserves the same respect!