What is misclassification? (Epidemiology)

Differential misclassification is probably the most complex topic we teach in introductory epidemiology, and it seems to be the one that students struggle with most. Most of the resources that we encounter don’t explain it very clearly for beginning students, so I thought I would make an attempt. This post is aimed at public health students starting out in epi, rather than laymen – there is a bit of terminology which may be hard for people with no public health experience to follow.

Misclassification is one type of measurement error. It refers to the incorrect identification of a binary variable (e.g. exposed / unexposed or outcome / no outcome). You can also have measurement error with regard to continuous variables, for example weight, age, height, blood pressure etc, but I will not be discussing that here.

In this post I’ll often be using the terms “case” to refer to someone with the outcome, and “control” to refer to someone without the outcome, in a reference to case-control studies. These studies have particular risks of differential misclassification, and this will also help me simplify the language in the post a bit. When I say “first variable” I mean the variable being misclassified (which could be either the exposure or the outcome). If we are talking about misclassification of the outcome, then the “other” variable is the exposure, and vice versa.

What is differential misclassification?

Misclassification comes in two types – differential misclassification and non-differential misclassification. “Differential” in this case refers to differential with regard to the other variable, that is

  • misclassification of the exposure that is different for people with and without the outcome, or
  • misclassification of the outcome that is different for people with and without the exposure

By “different” in this context we mean more or less likely to occur, since we are talking about binary variables. If you’re like most of my students, those sentences made your eyes glaze over a bit. So let’s look at some examples:

  • Doctors might investigate overweight (exposed) patients more thoroughly for cardiovascular disease (outcome) than non-overweight patients, leading to under-detection of heart disease in the non-overweight. In this case the measurement of the outcome (cardiovascular disease) depends on the exposure (being overweight).
  • People who know that they have lung cancer (outcome) might report their past smoking habits (exposure) differently to healthy people. See the 2×2 table below – 100 smokers without lung cancer have falsely claimed to be non-smokers, but all of the smokers with lung cancer have been honest.

Screen Shot 2017-05-10 at 9.24.22 am

There’s an important thing to note here – some textbooks say that non-differential misclassification occurs when “all groups” are equally likely to be misclassified, and this can be confusing for students because there are four subgroups in any study (unexposed controls, exposed controls, unexposed cases and exposed cases). So some students think that equal proportions of people from each of these four subgroups have to be misclassified – this isn’t correct. What has to be equal is the proportion of either cases or controls, or exposed and unexposed people, who are misclassified. For example, people with and without lung cancer (outcome) must be equally likely to falsely tell you that they do not smoke (exposure). By “groups”, these authors mean the two large groups according the other variable, not all four of the subgroups.

Non-differential misclassification occurs when the likelihood of incorrectly measuring the first variable is the same with regard to the other variable. For example:

  • If everybody is equally likely to be diagnosed with lung cancer (outcome), regardless of whether or not they smoke. Doctors investigate patients for the outcome (lung cancer) equally thoroughly, regardless of their exposure status (smoking or not smoking)
  • If everybody under-reports their alcohol consumption (exposure), regardless of whether or not they have colon cancer (outcome). See the 2×2 table below – half of the heavy drinkers have claimed to be low consumers of alcohol, and the proportion is the same in people with and without colon cancer.

Screen Shot 2017-05-10 at 9.26.21 am

Differential misclassification is most likely when the outcome and the exposure have both already occurred, and when the value of one is known to the person measuring the other. For example, I smoked for years, but I do not yet have any smoking-related illnesses – so how I report my smoking history to you today can’t depend on whether or not I will develop lung cancer in future. But if you were to ask me about my smoking after I got sick, I might report it differently – because I would know that I were ill, and this might make me more honest about my smoking history than I would have been when I were healthy.

Some students also get confused on this point – they think because the exposure is associated with the outcome, that people who get the outcome in future will be more likely to over- or under-report their exposure, even though it hasn’t happened yet. But the issue here isn’t whether smokers (who are all at high risk of lung cancer) are likely to report differently to non-smokers (who are mostly at low risk of lung cancer) – it’s whether smokers with lung cancer report differently to smokers without lung cancer. And they can’t possibly, if they don’t have lung cancer at the time that they report their smoking status. At that point in time, before they get sick, they’re all just smokers.

Why is differential misclassification especially bad?

Epidemiologists worry about differential misclassification more than non-differential misclassification because it can cause bias in either direction. Non-differential misclassification can (in almost all circumstances) only ever make an association look weaker than it truly is. This is because non-differential misclassification is essentially random mixing of the two study groups (exposed and unexposed, or cases and controls), which will always make them look more similar than they really are, bringing measures of association down. You can see this in the table below.

Screen Shot 2017-05-10 at 11.32.04 am

In the scenario above, the true odds ratio is 4. In the non-differential example, 10% of both the exposed cases and the exposed controls have said that they were unexposed, bringing the observed odds ratio down to 3.5. In the differential scenario, 50% of the unexposed controls have falsely said that they were exposed, while the cases have all reported their exposure status accurately, bringing the odds ratio up to 10! (Note that you could also have a scenario where differential misclassification resulted in an odds ratio that was nearer to the null value of 1.0 – it just depends which category of participants incorrectly report their status.)

Common misunderstandings

Misunderstanding 1: Mistaking systematic error for differential error

Some students get confused between error that is systematic (that is, always in the same direction) and error that is differential. For example, some students see a scenario where all participants under-report their weight or their alcohol consumption, and they think that this is differential because everybody is under-reporting rather than over-reporting. The important thing for distinguishing differential and non-differential misclassification isn’t the direction of the error, but whether it occurs equally for one variable regardless of the other.


Misunderstanding 2: Non-differential misclassification has no impact on study results because “everything balances out”

I think we often give students this impression since we are so much more worried about differential misclassification. I also suspect that some people get this idea from introductory statistics, if they know that random error does not affect means or proportions in single groups. But here we are discussing comparisons between groups, and that changes things. As can be seen in the 2×2 table above, non-differential misclassification does have an impact – it biases ratio measures towards the null value, because you’re mixing the two groups together and making them look more similar.


Misunderstanding 3: “Bias towards the null” means “The odds ratio gets smaller”

 The null value for ratios is 1, not 0, so bias towards the null for ratios means that the observed ratio is closer to 1 than the true value. This distinction doesn’t matter for ORs above 1 (provided they don’t get so small that they’re below 1), but if the true odds ratio were 0.4, bias towards the null would actually result in a larger number (e.g. 0.6 or 0.8), closer to the number 1.


How to tell the difference between differential and non-differential error 

Many students eventually feel that they understand the two types of misclassification in theory, only to get stumped by an example on an assignment or an exam. When assessing an example, step back and ask the following questions:

  • What is the exposure, when was it measured, and by whom?
  • What is the outcome, when was it measured, and by whom?
  • Given these facts, could the accuracy of the measurement of the exposure be affected by the outcome (or vice versa)?

If the exposure was measured before the outcome, then differential misclassification of the exposure is very unlikely, since the outcome isn’t known when the exposure is measured – whether I develop lung cancer in 2030 can’t affect what I tell you today about my smoking history. However, differential misclassification of the outcome is still possible, since the exposure happened first and was known when the outcome was measured, e.g. if my doctor investigates me more carefully for lung disease in future because she knows that I used to smoke.

If the people who measured the outcome don’t know the exposure status of the participant (or vice versa), differential misclassification is less likely. This is why we use blinding when possible – to stop researchers or patients reporting better outcomes if they know they are receiving a real drug (the exposure) rather than a placebo, for example.

If the exposure and the outcome are measured at the same time and the people doing the measurements are aware of them both, then differential misclassification is a risk. The next step to try to assess what impact it would have on the results of the study.

You can do this by thinking about whether the misclassification would make the exposure look more or less common than it really is among people with the outcome (or vice versa), and whether this would exaggerate or reduce the apparent difference between the two groups (with and without the outcome). This would then send the odds ratio away from one (more different) or towards one (more similar). Most students find this very hard at first, but it gets easier with practice. Use practice exam questions if you can, and try to think about this issue whenever you are reading a study on an area that interests you.

Posted in Uncategorized | Leave a comment

Starting out with a dSLR

Recently I tried to show a friend how to use a dSLR, and realised that shooting on full manual is a bit like driving – after a while you forget how many things you’re actually doing at once. It turns out it’s too much to explain in a twenty minute sitting, so, I thought I’d write it down. I’m going to stick to real key stuff – if there are some unclear parts, hopefully Google will lead you to more comprehensive explanations.

Settings – shutter speed and aperture

The light metre shows you whether a photograph is correctly exposed. There are two major settings you need to play with to expose a photograph correctly – aperture and shutter speed. Faster shutter speeds require larger apertures (lower F numbers) to get enough light into the camera for a correct exposure. The aperture is a feature of the lens, while the shutter speed is a function of the machinery in the camera body – not all lenses can create a large enough aperture settings to support high enough shutter speeds in all settings.

Shutter speeds slower than about 1/125 require a tripod, as this is slow enough for the shaking of your hands to blur the image. If you’re shooting outside, most apertures will be fine for shutter speeds faster than this. Inside, you will quickly run into trouble unless you have a lens with a lower limit of say F1.8 (compared to say, F4 for a cheaper lens or a lens intended for outside use).

For moving subjects (sports, kids, animals), you will want a high shutter speed (say 1/1000) and then whatever aperture will correctly expose the photo in the light conditions. For moving subjects in low light (live music), you will need a good lens capable of a wide aperture, since you would need to use a very long exposure with worse lens, which would result in a blurry photo. For portraits (outside), you will often want to use a large aperture (a low F number), since this will give selective focus – blurring the background behind the subject.

In both of these scenarios on a dSLR instead of shooting on full manual, you can select either the shutter speed (moving subject) or aperture priority (portraits) settings, and then let the camera do the work of determining what the other setting should be.

White balance

Lighting conditions interfere with photography vastly more than they interfere with eyesight – this is why photographs taken indoors often come out strange colours, for example with either an orange or blue cast. The automatic white balance functions on a camera are designed to compensate for commonly encountered lighting situations – fluorescent, incandescent, cloud cover etc, and address the colour problems this creates.

Have you ever noticed that when you take a smartphone photo of a sunset, all the colours disappear? This is your phone attempting to “fix” the intense colours you’re trying to photograph. (The solution to this is to ask to the phone to meter off the brightest part of the sky – this way you can trick it into making the rest of the image darker, which will preserve some of the colours).

If the pre-set white balance modes on your camera aren’t cutting it, you can also use custom white balance settings, but this is a bit of a pain.


One important difference between point and shoot cameras and dSLRs is the ability to focus manually. This can be useful, but to be honest the auto focus is often pretty good. Part of what you are paying for with a more expensive camera is more nuanced autofocus processing, and faster processing – auto focus won’t often be fast enough for a moving subject, for instance. This is less of an issue with a high F stop, since the field of focus is so deep – a child taking a step forwards or back in a park won’t move out of focus. This can be a problem with live music photography, though, for instance, where you are stuck with a low F to compensate for a slow shutter speed, photographing a fast-moving subject. I tend to use autofocus in most situations with the exception of live music and macro photography, where I don’t expect the camera to be able to work out what part of an image I want to be in focus.


This is the least technical component of photography and the most artistic, but one useful bit of advice I can give people starting out is to look out for things cluttering the edges of your images – this is almost never good. You don’t want a portrait of a friend of yours with half of someone else’s arm in the frame if you can help it. Likewise if you’re photographing a cool building, you don’t want half a car sticking into one side of the photograph. Keep objects either all in, or all out, when possible.

Finally, prepare for some disappointment. Your brain does a tremendous amount of processing for you, and a camera can’t do that. You will see a lot of things that are striking, but that don’t photograph well because there is a bunch of stuff cluttering up the space around them. I often notice a particular building or tree, and take my camera out only to notice power lines, cars, traffic lights, or plants between me and it. My brain was helpfully ignoring them, but the camera won’t. On occasion this results in pleasant surprises – one friend of mine was busy photographing a shark once, and got home to discover she hadn’t noticed the manta ray above it.

What’s the difference between a $300 smartphone, a $500 dSLR, and a $1,000 dSLR?

This seems like a relevant question for a lot of people. Realistically, your iPhone is not going to be much worse than a point and shoot around the same price point. Both cameras rely mostly on software, and iPhone and Samsung have invested big in making that software in their cameras really good. As you move up the price brackets with dSLRs, you are buying a bigger sensor (which can capture more detail), and more processing power (which helps with auto focus, metering, white balancing, etc). The difference between my dSLR and my iPhone in the park on a sunny day isn’t that large – it doesn’t really matter for Facebook happy snaps. The difference between the two at a gig in a dark bar is vast – my dSLR can take a detailed photo of a moving subject in low light, which my iPhone absolutely cannot. You’ve tried taking a picture of your friends in a dark bar with a smart phone, right? You get a red grainy mess. If I do that with my dSLR, I get a portrait that I can edit until it looks almost like it was taken in daylight. The sensor is capturing detail that my iPhone can’t see, and the software is correcting out the worst lighting and colour problems before the photo gets to my laptop where I can finish fixing it up. Likewise, my smart phone is not up to the task of taking a sharp, in-focus photograph of someone skateboarding or playing basketball, but my dSLR is.

Now, where the diminishing returns point is for you price-wise depends on what you want to do with your camera. If you want to take good portraits in low light, or do sports photography or something else that places a lot of mechanical or processing demands on the camera, you have to cough up. If you just want to take nicer pictures outdoors, then a lower end camera will do you fine. If you do want a mid-range camera, I recommend buying second hand – lots of people upgrade semi-regularly and the resale value isn’t great, even though the cameras are usually fine. Even an entry-level dSLR is the kind of thing people buy, use for a few years, take good care of because it cost them $500, then sell when they upgrade to something fancier. Ebay is your friend.

Posted in Uncategorized | Leave a comment

Why isn’t cancer treatment X covered by Medicare?

Every so often I see a crowd funding campaign for medical care that isn’t covered by Medicare (in Australia) or the NHS (in England). Sometimes it’s for an experimental treatment, other times it’s for alternative therapy, and occasionally it’s something that is established but not covered in the country in question. I suspect it’s not clear to most people how it’s decided what does and doesn’t get covered by Medicare, and maybe an explanation will help people decide if and when to support campaigns like these.

There are three main reasons that a particular treatment or medical service wouldn’t be on Medicare – 1) we don’t know if it if works yet, 2) we’re quite sure that it doesn’t work, 3) it works but the government considers it too expensive for the benefits it has.

Category 1 contains things like experimental drugs that haven’t yet been approved by the TGA (Australia’s version of the FDA). These are drugs that might be in early clinical trials, but maybe the person running the funding campaign hasn’t been able to participate in a trial in Australia for whatever reason. A lot of drugs don’t make it past early clinical trials because they turn out not to work or because they have unacceptable harms, but some do. I can see why someone with a life threatening illness would be willing to try experimental therapies if they had exhausted all of their other options. At the same time, there are (sometimes unknown and unquanitified) risks involved in taking experimental medicines, and the costs can be substantial. I guess this is one of those difficult case-by-case kinds of decisions.

Category 2 is things like homeopathic “treatment” for breast cancer. There’s no evidence that these therapies work, which is why they are not subsidised by Medicare. People who take these therapies instead of medical treatment that is known to work are putting themselves at high risk of progressing to invasive forms of cancer that may kill them. I personally wouldn’t support a campaign like this, and if it were being run by someone I knew I would try very hard to convince them to talk to their doctors about their genuine treatment options. Some forms of cancer are amenable to surgery, radiotherapy and/or hormone treatment if a patient doesn’t want chemotherapy. Some people with types of cancer that are very likely to be fatal even with treatment might chose not to have any treatment, and that’s their right – the treatment isn’t always worth it. But someone who does want treatment shouldn’t be turning to homeopathy.

Category 3 is things that are known to work but are beyond what’s called the government’s “willingness to pay threshold” for the benefits they provide. The government has a cut off point where it won’t pay beyond a certain amount for a given improvement in quality or quantity of life – for instance a chemo drug that gave you an average extra two weeks to live at a cost of $40,000 wouldn’t be approved. There are some more borderline cases though, where some people might have good reason to pay for treatments privately – some people are currently doing this with HIV preventative therapy, which hasn’t been added to the PBS yet (but hopefully will be at some point – it works brilliantly, it’s just pricey). I saw one gofundme for a specific type of prosthetic which is unusual and might have limited utility for the average (old) amputee, but would have given the young guy raising funds for it big improvements in his quality of life, and I contributed to that.

The government’s willingness to pay is based on the average patient – for younger people in particular, a 1 in 1,000 chance of remission might be worth $100,000, even if it probably isn’t worth it for a seventy year old. For a young parent with cancer, an extra couple of years of life on chemo that they could spend with their kids might be worth almost any amount of money, but for an eighty year old with dementia, the same might not be true. Reductions in quality of life due to the treatment itself are taken into account in these calculations, as well as just increase in the length of life. So, for example, a drug that gives you a good chance of living a lot longer with a high quality of life is more likely to be subsidised than a drug that does the same thing, but with a much lower quality of life due to side effects.

There are some therapies that the government won’t pay for but an individual might, and in some cases I can fully support that. However, particularly in the case of life-threatening illnesses, there’s a risk that desperation will drive people into financial ruin for a moon shot. Even people whose treatments are on Medicare can end up having chemo they really shouldn’t, because their doctors dodge the hard conversations about time frames, harms, and benefits. Some forms of cancer are almost universally fatal, and it’s only ever a question of six months without treatment or two years with. That is a hard conversation for doctors to have, and a much harder one for patients – so the conversation happens in vague terms instead of specific ones, and people often walk away with a very optimistic idea of the best-case scenario with their treatment.  This is awful enough even when the government pays, if it means that people accept terrible side effects without realising it’s only buying them an extra few months instead of chance at remission and living to be eighty. But it’s a catastrophe if, as well as that, it drives a family into financial ruin for a promise of something that was never going to be possible.(I urge you, if you are ever making these kinds of decisions, to ask your doctor for specific details about probable survival times with and without treatment. Being Mortal by Atul Gawande is a good book on these issues that everybody should read *before* they come face to face with them).

These are hard decisions to make. The person running the funding campaign often won’t know which of these categories the therapy they want falls into – for example people who believe in homeopathy will think it’s not on the PBS because the government are mistaken about its benefits, rather than because it doesn’t work. Someone who wants a therapy in the first or the third category might not realise the risks or the harms involved in taking it, in addition to the cost. I very rarely contribute to these kinds of campaigns unless I can see, like the case of the new prosthetic, that there’s very likely to be a substantial gain for the money put in. Whether you do is your choice, though I’d urge you not to encourage people seeking homeopathic therapy for highly treatable but potentially fatal conditions like breast cancer.

I should note that none of this is relevant when the campaign is to support costs of living, rather than treatment itself. A severe illness that results in an inability to work can impose severe financial hardship on people, and we don’t have particularly good systems for dealing with that in Australia. It also doesn’t apply to people who live in Australia or England but eligible to access the public health system, or to people who live in countries with more limited public health systems. These issues are specific to treatments for Australian citizens or permanent residents. I hope it was helpful.

Posted in Uncategorized | Leave a comment

How (not to) to build multivariable models

Why do we use multivariable models?

The purpose of statistical models in research, to put it very simply, is to help identify possible causal relationships between variables. Models can’t always achieve this aim – they might incorrectly identify a relationship that doesn’t really exist (a false positive), or alternatively their results might fail to find a relationship that really does exist (a false negative). This might happen for a variety of reasons, some of which occur at the study design or data collection stages, but sometimes it will occur because the variables included in the model weren’t chosen correctly. In this post I’m going to try to explain some basic principles of variable selection, and point out some common mistakes. My examples are all medical but these concepts are applicable to any type of research using statistical models to identify causal relationships.

There are two broad types of statistical models. There are univariable models that include one dependent variable (the outcome) and one independent variable (the exposure), and multivariable models that include the outcome variable and a number of other variables, either multiple exposures, or possible confounders.

A confounder is a variable that affects the appearance of the relationship between two other variables in a certain way. A confounder might make it look like two variables have a relationship when really they don’t, it might make the relationship look stronger or weaker than it really is, or it might make a real relationship disappear entirely. The main purpose of using multivariable rather than univariable models is to mitigate the impact of confounders, in order to get better estimates of the true relationships.

Here’s an example of a confounder: although men can get breast cancer, it is much more common in women. Women are also much more likely than men to be nurses, and much less likely to be engineers. If we were to do a study on occupation and the risk of breast cancer, then because gender affects both breast cancer risk and occupation, we might see a relationship between occupation and breast cancer where none really exists. Gender will confound the relationship between occupation and breast cancer risk. In a univariable model, it would look like being a nurse dramatically increased your risk of breast cancer compared to being an engineer (because most of the nurses would be women, and therefore most of the breast cancer patients would be nurses, not engineers). If we included both occupation and gender in a multivariable model, then the true relationship would become apparent – the model would identify that it was really gender, not occupation, which determines breast cancer risk.

Selecting the right variables for a multivariable model depends in large part on understanding, at least to some degree, the probable causal associations between different variables. In some cases this will be based on previous research (for example it is now well known that smoking causes lung cancer), and sometimes based on logic (for example it is obvious that having had a hysterectomy will reduce a woman’s risk of uterine cancer). Understanding these relationships allows the researcher to identify potential confounders of the relationship of interest, in order to include these in the model.

This approach is referred to as specifying the variables for inclusion in the model a priori (based on theoretical reasoning, rather than empirical observation of the data). However in many branches of science, a variety of other methods for choosing variables have become common.

The problem with forward selection

In many studies, the authors will say that they tested all univariable associations between each exposure and their outcome of interest, and then included only significant variables in their final model. The problem with this approach is that it will result in the exclusion of exposures which have a true relationship with the outcome, but for which relationship was hidden by a confounder.

For example, there is a true relationship between alcohol consumption and breast cancer, but men tend to drink alcohol more often and in larger amounts than women. If I were to look at alcohol consumption and the risk of breast cancer in a univariable analysis, it’s possible that I would not see any association – most of the heavy drinkers would be men, whose risk of breast cancer is very low, and this would hide the association. But if I included both gender and alcohol consumption, the true relationship would be uncovered – my model would show, correctly, that women who are heavier drinkers are at higher risk of breast cancer than women who don’t drink.

If I had used forward selection in this study, I would omit alcohol consumption from my final multivariable model of predictors of breast cancer, and I would never discover this association.

The problem with p-values generally

There is increasing recognition of the misinterpretation of p-values in the scientific literature. A p-value is a statistical expression of the probability that an association of the strength you saw would have been observed if there were no true relationship between the two variables, given the size of the sample that you used in your study. The reason that p-values are useful is that all studies are subject to random sampling variation – consequently there is always a possibility that a finding is simply a fluke (a false positive).

So for example, as far as we know, tall people are no more likely than short people to be good at maths. But let’s say that I recruited a small number of tall people and short people, and gave them a maths test. There’s a possibility that just by chance, I might select a few tall people who are very good at maths and a few short people who were very bad at maths, and therefore see an association – perhaps the tall people in my study would be twice as likely as the short people to score above 80% on the test.

The likelihood that I will accidentally recruit tall mathsy people and short non-mathsy people gets smaller as my sample size gets larger. If I choose five tall people and five short people, I would only need one or two tall maths whizzes to throw my results out. But if I recruited a thousand tall people and a thousand short people, I would need a lot more tall maths whizzes to affect the results, and the chances of this happening by accident are much smaller. (Think of how much less likely it is that you would flip a fair coin and get ten heads in a row, compared to three heads in a row.) For this reason, p-values get smaller as studies get larger, even when the effect size (e.g. a two fold increase in scores over 80%) remains the same.

As a consequence, a p-value is always to some degree simply a reflection of the sample size of the study. With extremely large studies, even a tiny difference between two groups will attain a very small p-value (e.g. p=0.001), even if it has no genuine importance. With small studies, a very large difference between two groups will only attain a large p-value (e.g. p=0.20), even if the difference is very important (e.g. a ten-fold increase in the risk of cancer). This issue is often not acknowledged when people interpret p-values, and it is especially a problem if people are using p-values to help design their statistical analysis (I’m looking at you, Big Data).

A much bigger problem with p-values is everything they can’t tell you. P-values are often mistakenly interpreted as the probability that the results of the study are “correct”, but in fact their meaning is much more specific, as I described above. The p-value only tells you the chance that you would see an effect the size that you did if there were truly no relationship between two variables. That’s it! And even this meaning is subject to a large number of assumptions – a given p-value will only be correct if your sample was truly representative of the population you are studying; if you measured all your variables without certain types of error; if you included all of the important variables in your model the correct way, and several other considerations. Ken Rothman and colleagues have an excellent and comprehensive paper on the various limitations of p-values and their correct interpretations. The language is a little technical, but the list of misconceptions should be useful to be most people.

The take home messages are these: building an accurate statistical model requires good content knowledge and an appropriate statistical approach, and even then, the results must be interpreted conservatively. Highly statistically significant findings might be the result of incorrect sample selection, incorrect model specification, or mere chance, while non-significant findings may be the result of each of these or inadequate sample size.

So how do I build my model correctly?

The pressure to publish means that many people are expected to participate in components of research without formal training, and this seems to be especially true in statistics. We don’t imagine that a statistician would be skilled at conducting qualitative interviews on delicate topics, or that they could be safely trusted to administer a medical treatment as part of a clinical trial. Model specification requires an in depth understanding of epidemiology and biostatistics and a detailed knowledge of the content area – as such, it’s best done either in collaboration between a content person and a statistics person, or by someone with knowledge and training on both fronts.

The most straightforward solution for non-statistical researchers will often be collaboration with a statistician or an epidemiologist. Often we can help you specify a model correctly in an hour or two, with your help to understand the topic and the known causal relationships that are relevant. Many universities offer free or low-cost statistical consulting services for this purpose – use them! In addition, many early career statisticians and epidemiologists have more limited demands on our time than senior statisticians, and might be happy to help you in exchange for inclusion as co-authors. Building collaborations with people in these fields will improve the quality of your research hugely, and often save you a great deal of time in the long run (as well as improving your chances on grants that undergo methodological review).

Statisticians and epis are not always available or accessible, but thankfully the training required to correctly specify multivariable models is not especially arduous and is not beyond most people conducting research. If you are going to spend a long time working on projects with statistical components, it is definitely worth investing the time to learn how to specify models appropriately. I often encounter PhD students who have spent three or four years on a research project without spending a month or two gaining the statistical skills to correctly analyse and interpret their data. This strikes me as a really unfortunate missed opportunity. Many universities offer short courses in statistics that assume minimal previous knowledge, often at discounted rates for staff and especially post-grad students. Statistics is not everybody’s cup of tea, but these courses will help you immeasurably in conducting your own research and evaluating that published by others in your field.

Anybody can pick up a hammer, but building a house requires collaboration between a number of people with different skill sets. You wouldn’t ask a plumber re-tile your roof, and you wouldn’t expect an architect to replace your dishwasher. Non-professionals can learn to do both those things, but consequences await those who attempt them without appropriate training. Statistics is no different, even though the consequences are often invisible. Your research is important, so it’s worth analysing your data correctly. You wouldn’t let a statistician with no clinical training go at one of your participants with a syringe – your data deserves the same respect!

Posted in Uncategorized | Leave a comment

To screen or not to screen

Something I discuss with friends and family occasionally is the risk / benefit balance when it comes to making decisions about things like mammograms or prostate cancer screening. This post is about screening for people with no known risk factors – people with serious family histories are a different kettle of fish.

The fact that screening might sometimes do more harm than good is a concept that’s only just getting into public discourse in a major way. In my own experience doctors don’t tend to present this side of things to patients – I’ve never had cervical cancer screening presented to me as being optional, for example, simply as something that I’m expected to have. I’ve never had a discussion with a doctor about the potential risks or harms of screening, or even had a doctor acknowledge that there are any. Ditto with sexual health screening. I suspect conversations about mammography, prostate, and colon cancer screening are similar.

Conversely, research shows that people dramatically over-estimate the benefits of screening in terms of their risk of advanced cancer, and their risk of death. Even doctors are often unaware of how high the risks of false positives can be with some tests, and might treat all patients with a positive test as though they’re definitely sick, which causes emotional distress and may require unnecessary treatment. At the extreme end of this is the bad side of prostate cancer screening, where men undergo invasive surgery with potentially serious, long-term effects on their sexual function and their urinary continence for cancer that might never have caused them any serious harm.

Recommendations around prostate cancer screening are changing, in recognition of these harms. I suspect that mammography guidelines will change in the near future as well. Australia is going to move to five-yearly rather than two-yearly cervical cancer screening, as well. My hope is that eventually, cancer screening will be approached more in the way that genetic testing is – as a complex decision with risks and benefits, which individual patients need to weigh up for themselves with the help of a doctor, rather than under the instruction of one.

If you participate in cancer screening, this is something you could discuss with your GP. Questions to ask are along the following line:

  1. What’s the risk of me getting this type of cancer, at my age? How serious is that type of cancer – what proportion of people who get it die within five years?
  2. What does the test involve? Is it uncomfortable or painful?
  3. What’s the likelihood that I will test positive? If I do, what would that mean?
  4. If I tested positive, what would the next steps be? What other tests of procedures would I have to have?
  5. What are the risks and the down sides of those procedures?

So for example, for a woman in her thirties, the risk of cervical cancer is about 1 in 10,000 in any given year, and 1 in 60 over her entire life. About 20% of women who do get cervical cancer will die of it. The risks for women who have been vaccinated will be lower. Abnormal pap smears indicating possible pre-cancerous changes are much more common, and if this finding is confirmed by a biopsy, require a surgical procedure to remove the abnormal tissue from the cervix. Cervical biopsies and surgery are about as great as they sound, although the surgery is normally done under general anaesthetic.

These questions might throw your GP, but this is a conversation that your GP should be capable of having, and it’s reasonable to expect to have it before you take a test. Screening is a procedure you’re offered for your own good – it shouldn’t be something your doctor twists your arm into accepting. The evidence for mammography and especially prostate cancer screening is quite weak – if you’re considering having these tests, you should have a detailed conversation with your doctor so you both understand the risks and benefits, and how these gel with your values and preferences.

These are also questions it doesn’t hurt to ask about sexual health screening. A lot of GPs take a “test for everything” approach which isn’t generally a very good idea, unless your sex life is pretty radical. For most people, risk of gonorrhea, syphilis, or hepatitis B or C is very low. Unless you’re at risk for specific reasons, a positive test result for one of those infections is likely to be a false positive unless it’s confirmed by a second test. This is also true of HIV – however confirmation for HIV is routine, since it’s such a serious diagnosis. If your GP is routinely testing you for everything, you can start requesting just specific tests. The Melbourne Sexual Health Centre has a web service that lets you check what tests are recommend for someone with your risk profile. If you feel better being tested for everything, that’s fine too – just be aware that you may get a false positive or two, so don’t hit the roof the first time it happens.

Usual disclaimers attach to this post: I am not a doctor, and I am not telling you whether you should or shouldn’t have any particular test. I am suggesting you have a discussion, and make an informed decision about each individual test you’re offered.


Posted in Uncategorized | Leave a comment

What is going on with the 2016 census?

So if you’ve been an Australian on the internet in the past month, you will know there is a furore in process about changes to the way identifying information (names and birth dates) will be used and stored after the 2016 census.

This post is an attempt to explain what’s going on in laymen’s terms, based on my understanding as a researcher who uses this kind of linked data from the Australian federal government. This post reflects my understanding; as far I as I know it is free of errors, but please let me know if you believe I’m mistaken about any of the facts. It’s also not exhaustive; I’m focusing on the parts I feel strongly about as a public health researcher.

What the ABS appears to be proposing is that as part of completing the census this year is people will be expected to give their names, dates of birth, genders, and addresses. This data will be stored for up to four years, an increase on the previous time period, and will be used to create a linkage key to allow the ABS to do additional data analysis with other datasets (pdf link, see page 7 ).

What’s a linkage key?

A linkage key is an ID number that allows the linking together of multiple datasets, via identifying information. For example, one of the data sources that the ABS has mentioned is health data, i.e. the Medicare database. Medicare knows your name, your date of birth, and your current address. Once the ABS has this information on your census record, computer folk can use the identifying information to match the two records together.

This is done by assigning an ID number (the linkage key) in one dataset, and then matching the same number to the name, date of birth and address in the second dataset, in order to subsequently link the two datasets together. So Jennifer Smith, born 5th April 1979, living on Canterbury Rd in Malvern, gets ID number 300002 assigned to her in her Medicare record, and then this number gets matched to the census record with the same name, date of birth, and address. Then Jennifer’s name, date of birth and address can be deleted from both datasets, and the datasets can be linked together just using her ID number. This process is repeated for everybody. Eventually a researcher or a statistician ends up with a dataset containing everybody’s Medicare data, everybody’s Census data, and everybody’s ID numbers, but nobody’s name, address, or date of birth. The names, dates of birth, and addresses can then be discarded or destroyed. (This is what I assume the ABS is referring to when they say they will never share identified information about you – at the point it’s shared, your name has been removed from it).

Why do this kind of research?

There is a lot of information about people that is relevant to their health, but isn’t routinely recorded in their medical records. A key example in Australia is whether somebody identifies as Aboriginal and Torres Strait Islander, which is normally not recorded in their Medicare data, but is recorded in the census. Linking Medicare and census therefore allows detailed research into issues affecting the health of Aboriginal and Torres Strait Islander people, at the population level. This kind of research is immensely valuable, and whole-of-population linked datasets are the holy grail in terms of data sources. This is just one example; you can do data linkage research on anything if the information you need is contained in two or more separate databases; the ABS has also mentioned employment and unemployment as a topic of interest.

So what’s the problem?

The problem, at least for me as a researcher, is that people aren’t being given the option to withhold consent and opt out of this research. I don’t think people are morally obligated to allow their personal data be used to research if they don’t want it to be. I certainly don’t think they should be legally obligated to provide their names and dates of birth so that this kind of research can take place, if they would rather complete the census anonymously. The primary purpose of the census is to count and describe the Australian population, and names and dates of birth are not necessary for that; the data linkage research opportunity is a bonus, and I don’t believe it should be compulsory for people to participate in that.

Other people seem to object to the data being collected at all, regardless of what it’s used for later, and to my mind that’s potentially reasonable as well. But I’ll leave the debate about that for data security people, since it’s not really my area.

Don’t they already know all this stuff anyway?

My first reaction when I saw people starting to get upset about the census was, “Oh, wow, people have no idea how much of their personal data is already stored and can potentially be linked”. The government does indeed already have a great deal of data about us; they have our tax records, our Medicare and prescription medication data, our drivers licenses and traffic infraction histories, our criminal records, and on and on. But different government departments hold each of these data sets, and they’re not routinely allowed to link your data together without your consent.

Research involving data linkage has been going on in health for quite a long time, but unconsented linkage projects are very rare, and require a very compelling case for public benefit before they can proceed. Whether they should ever be approved at all is an important ethical question, and in my opinion, one which deserves a public debate. We’re starting to have that debate now. It is strange that the census is what prompted it, given that our Medicare or Centrelink data might be much more sensitive? Perhaps, but that doesn’t mean the debate isn’t worth having.

People are entitled to know how their data are collected, stored, and used. They’re entitled to have opinions about that, and to have strong feelings about which data they’re happy to share and what they prefer to withhold. If the research community want the public to support data linkage research, then we have to convince people that our work is sufficiently valuable that they should donate their data to assist us. I don’t think we’re entitled to people’s medical records and employment histories just because we want them; I think it should be up to individuals whether to participate in the kind of research we conduct. And if people don’t want to participate, I think they should be able to opt out.

Posted in Uncategorized | Leave a comment

Thoughts on getting fitter / healthier / losing weight

I thought I would put all general advice on this in one place since it seems like something people are interested in sometimes. This is a hodgepodge of things I learned during my physiology degree, from watching other people lose weight and experiment with different types of exercise, having a bunch of gym friends who are also big exercise nerds, and from personal experience.

I think my previous post on healthy eating was too long, so this will be more dot points than endless exposition. There’s no justification for anything but I’m happy to provide my rationale if you’re interested in it.

General advice on exercise:

The best exercise regime is the one you’ll do. If you can’t fit more than once a week into your schedule, or you need to work up to doing more than that really slowly, don’t worry about it. Do what you can and see how you go. If you need to build it into every day or make it a habit or you’ll never do it at all, do that, even if it’s just walking to and from work.

Find the kind of exercise that you will do – everybody is different. Try a few different things and see what works for you. Exercise comes in a wide variety of forms – solo or with a friend or in a group; self directed (e.g. gym), vaguely structured (e.g. club / team sports), or highly structured (cross fit, aerobics); low (walking), medium (cycling), or high intensity (running); at home, in a gym, outside, or somewhere else.

I find personally that I am useless at exercising at home – I am full of good intentions but I just never do it. I need a specified time to go to another place and exercise, whether that’s to the gym, or to swing dancing, or by walking to and from work. But maybe you are good at self-motivating and home exercise will work for you! Try both and see which is easier.

I also find that running makes me feel like I am definitely going to die after less than two minutes, but that lifting weights is fun and that swing dancing are fun even though I also feel like I will die a little bit sometimes. Maybe you will be better at cardio than me! Most people are.

I fucking hate people telling me what to do and crossfit would make me stab someone, but I know heaps of people who need / love encouragement or instruction from other people. Try lifting weights on your own, try it at a club or with a personal trainer, and see if you like any of them. (Do get at least some instruction from someone knowledgable before lifting heavy weights, though, if you’ve never done it before, or your spine will explode and that will be really sad. I personally think Crossfit is a great way to get hurt unless you’re already very fit.)

If you are trying to lose weight from square one, walking a lot is a really good place to start. You don’t need to kill yourself with any crazy shit, just walking twenty minutes or half an hour once or twice a day will do wonders if you’re starting from zero. Otherwise cycling, swimming, dancing or other cardio – most weight lifting isn’t likely to help you lose weight, unless you are doing a *lot* of it, and this can be pretty tough on the body if you’re not used to it.

General thoughts on food:

Try eating healthier rather than simply eating less – I think that adding more healthy food to your diet is much easier than giving up everything you love forever (unless you are a hardcore all or nothing type person, in which case, do that!). I find it a lot easier to a mix of the somewhat healthy and somewhat unhealthy than to live exclusively on salads. Personally I exercise so that I can drink beer and eat snickers bars, so, y’know – I am not advocating asceticism here.

Having said that… More vegetables. More. More of them. Still more. Mooooooore.

Seriously though. There is no meal that is not made healthier by the addition of veggies – they add fibre, slow digestion, contain vitamins, and don’t contain (any meaningful amount of) fat or sugar. They tick all the boxes! Find whatever vegetables that aren’t potatoes that you can stand, and just go as crazy as you can go, at least one meal a day. I’ve found the one decent veggie stir fry within walking distance at work and because I am incredibly boring, I’ve been eating it for lunch basically every day for three years. You too could be like me! Live dangerously!

Cut down on alcohol, especially beer. Sorry. It’s not that different, nutritionally, to soft drink. (Cut down on that too.) Try to drink fewer days of the week and / or less when you do drink. Alternate alcoholic drinks with water or soda water when you’re out, and / or switch to a boring drink like vodka-soda instead of beer or things mixed with coke.

Cut down on obvious junk food (duh), and try not to fill up on empty carbs (rice / pasta / bread). Replace carbs with veggies for bulk wherever possible, since you’ll be hungry if you’re used to eating very carby meals. If you must have some kind of carby staple, barley is a good substitute for rice (and you cook it the same way).

Learn to read the bare basics of the nutritional information – read the per 100g column, and see how much sugar what you’re eating contains per 100g. If it contains more than 10g sugar per 100g, it’s junk food. If you’re trying to choose between products, choose the one with fewer calories / kilojoules per 100g.

Don’t worry about fat too much, provided you’re not living on ice cream or just eating cream with a spoon out of a jar in the fridge. (Don’t do that.)

Stop eating when you’re full and take the rest to go, or leave it – don’t clear your plate out of habit when you’re given a huge serve at a pub or whatever.

The food stuff requires organisation – if you’re used to eating out a lot, you need to either eat differently (Asian food is good for veggie heavy stuff, think stir fries etc), or cook at home more. Learn either a couple of healthy things that you can cook quickly (e.g. stir fries), or stuff you can cook once or twice a week and refrigerate / freeze (e.g. minestrone, curries / stews / pies with lots of vegetables, roast veggie salads). I’m terrible at doing enough shopping to cook every day, so I go the once a week route – but like with exercise, work out what you’re likely to actually do. Maybe walking to the store after work to buy vegetables to make a stir fry would be two birds with one stone for some people.

I find tracking what you eat, even just for a week, can be revealing. I had no idea how much beer I was drinking until I did that, and it was kinda horrifying. I’m glad I know now, though.

Finally, if you are trying to lose weight, don’t get disheartened and don’t have crazy expectations. Most people don’t lose more than a couple of kilos a month, and even though it’s possible to get hectic and lose weight faster than that, most people can’t sustain that or keep the weight off. Plus flying full tilt into crazy exercise from zero is a great way to injure yourself.

Obviously doing all this stuff at once would be major life upheaval for some people and it’s not mandatory. Any of it will make you healthier regardless of whether or not you’re trying to lose weight. Eat less unhealthy stuff, move your body more, that’s about all there is to it really.

Posted in Uncategorized | Leave a comment