Viewpoint Expert opinions on trends and controversies in autism research.
Illustration and animation by Abigail Goh
Opinion / Viewpoint

Quest for autism biomarkers faces steep statistical challenges

by ,  /  23 August 2016
The Experts:

Jon Brock

Research Fellow, Macquarie University

Tim Brock

Analyst, Data to Display

Every week, scientists publish new studies reporting differences between people with and without autism. Many of these studies involve some form of biological test — for example, a brain scan, a blood test or a measure of eye movements.

Coverage of these studies in academia and the popular media often suggests that the test in question could serve as a ‘biomarker’ for autism — an objective way of determining whether someone has the condition. The hope is that biomarkers could one day allow clinicians to identify people with autism earlier, more accurately and more efficiently than is currently possible.

These are noble objectives. But with any potential biomarker, finding a difference, on average, between people with and without autism is only the first small step toward clinical utility.

When it comes to diagnosis, clinicians don’t care about group averages. No test is perfect, but for a biomarker to be useful, clinicians need to be able to look at the test results of an individual and determine with some degree of confidence whether that person has autism.

Our aim in this article is to explain some of the key statistics that scientists use to determine just how useful and accurate a test is — in both the lab and the clinic.

Making the cut:

The first point to note is that most tests produce a range or distribution of scores across the population (see graphic below). Even if the distributions differ for people with and without autism, there is almost always some overlap between the two groups. The best we can do is set a cutoff and say that anyone scoring above the cutoff has tested positive.

Colliding curves: The distributions of scores for people with autism (purple) and those without (red) overlap on a hypothetical test.

Colliding curves: The distributions of scores for people with autism (purple) and those without (red) overlap on a hypothetical test.

Having set a cutoff, researchers can quantify the accuracy of the test in terms of its ‘sensitivity’ and rate of false positives. Sensitivity refers to the proportion of people with autism who are correctly identified as having autism. The false-positive rate is the proportion of people without autism who are incorrectly identified as having the condition. (Sometimes researchers report the ‘specificity’ or ‘true negatives’ of the test instead of the false-positive rate.)

The sensitivity and false-positive rate both depend on the chosen cutoff. If we lower the cutoff score for autism, more people with the condition test positive and the sensitivity increases (see graphic below). But this also means capturing more people who don’t have autism, thereby increasing the rate of false positives.

Research papers typically represent this trade-off by plotting the test’s ‘receiver operating characteristic (ROC).’ This shows the sensitivity of a test plotted against its false-positive rate for each possible cutoff score.

Test trade-offs: The curve (green) shows how a biomarker’s sensitivity and false-positive rate vary with different cutoffs.

Test trade-offs: The curve (green) shows how a biomarker’s sensitivity and false-positive rate vary with different cutoffs.

ROC curves tell us how well a test discriminates between people with and without autism. The less overlap there is between the two distributions, the better the test discriminates between the two groups, and the more the ROC arches above the diagonal (see graphic below).

Archery indicator: The more drawn back the ‘bow’ of an ROC curve, the better the test discriminates between people with and without autism.

Archery indicator: The more drawn back the ‘bow’ of an ROC curve, the better the test discriminates between people with and without autism.

True diagnosis:

Many scientific reports stop at this point. It’s tempting to assume that a strongly arched ROC means that the test could serve as a helpful diagnostic.

The trouble, however, is that the sensitivity and false-positive rate really only make sense in a context where we know from the outset who has autism and who doesn’t. Sensitivity, for example, tells us how well the test identifies people we already know to have autism.

In real-life scenarios, we don’t usually know the person’s true diagnosis beforehand — that’s the whole reason for administering the test.

Consider, for example, a parent whose child has just tested positive for autism. What they really need to know is the likelihood that their child actually has autism. We refer to this as the test’s ‘positive predictive value’ — the likelihood that a positive test result is accurate (see bar graph).

Positive prediction: In this sample (upper bar), half of the people have autism (purple). Of those who have tested positive for autism (dark sections), 81 percent actually have autism (bottom bar). This is the test’s positive predictive value.

Positive prediction: In this sample (top bar), half of the people have autism (purple). Of those who have tested positive for autism (dark sections), 81 percent actually have autism (bottom bar). This is the test’s positive predictive value.

Base rates:

There’s one final complication. Unlike the sensitivity or false-positive rate, the positive predictive value depends on the proportion of people being tested who truly have autism. We refer to this as the sample’s autism ‘base rate.’

In a typical study, the base rate is around 50 percent: People with autism make up half the sample. But in many contexts outside the lab, the base rate is much lower: Most people taking the test won’t have autism.

Say that our hypothetical biomarker is being used to screen all children in a particular age range for autism. According to the latest estimate from the U.S. Centers for Disease Control and Prevention, the prevalence of autism in the United States among school-age children is about 1 in 68. Clinicians could therefore expect to test roughly 67 children who don’t have autism for every one child who does.

Changing the base rate from 1 in 2 (50 percent) to 1 in 68 (1.5 percent) has a dramatic effect on the positive predictive value. In our fictitious example (see graphic below), it falls from 81 percent to a much less helpful 6 percent. In other words, for every 6 children the test correctly identifies, it would misidentify 94 children as having autism who don’t.

Rising uncertainty: When the autism base rate drops from 1 in 2 to 1 in 68 (represented by the relative sizes of the distributions), the positive predictive value for our hypothetical test plummets from 81 percent to 6 percent (lower bar).

Rising uncertainty: When the autism base rate drops from 1 in 2 to 1 in 68 (represented by the relative sizes of the distributions), the positive predictive value for our hypothetical test plummets from 81 percent to 6 percent (lower bar).

In the context of population screening, a positive result would only indicate an increased risk of having autism. Clinicians would need to follow up with further tests to confirm a diagnosis.

Until such assessments are completed, ‘autism risk’ is nothing more than the test’s positive predictive value. The base rate issue means that most children flagged as ‘at risk’ will not actually have autism.

So although early identification has potential benefits, an ‘at risk’ label may also cause unnecessary stress to large numbers of families. In some cases, children may receive intervention for autism that isn’t present, diverting resources away from children who genuinely have the condition.

Difficult distinctions:

An alternative to population screening is to target groups in which the autism base rate is already higher than that of the general population — for example, children who have an older sibling with autism, or those whose parents or doctors are concerned about their development. This approach makes a lot of sense. But we can’t simply adjust the base rate in our calculations and assume that the test retains the same sensitivity and false-positive rate.

Consider, for example, a hypothetical test that identifies genetic variations associated with autism. These variations may also be relatively common in the siblings of individuals with autism. So even if the genetic test discriminates well between people with autism and unrelated individuals, it may do a much poorer job of differentiating between affected and unaffected members of the same family.

Researchers would need to conduct a new study to determine how well the test performs in this high-risk population.

At present, autism diagnosis is a difficult, time-consuming and resource-intensive affair. Early signs are often missed and, even when recognized, children can wait years for a formal diagnosis. Adults may struggle their entire lives without anyone recognizing the difficulties they face.

It is important then for researchers to continue developing better ways of identifying and diagnosing people with autism. But when considering potential autism biomarkers, it’s also important to be aware of the challenges inherent in translating an exciting research finding into a test that is clinically useful.

Moving the mark:

Move the sliders to see the effects of changing a hypothetical test’s effect size (a measure of how well it discriminates between groups), the prevalence of autism in the population (base rate), and the cutoff for a positive test.

Effect Size: 1.50

Cutoff: 2.00

*Technical note: For illustrative purposes, our figures assume that test scores have normal distributions with equal variance for the two groups (people with and without autism). Test score is represented as the number of standard deviations from the average of the control group. The effect size is then equivalent to the difference of the averages.

12 responses to “Quest for autism biomarkers faces steep statistical challenges”

  1. Planet Autism says:

    Isn’t the whole point that any such test for autism biomarkers would only be PART of an autism assessment, thereby making this article null and void? There would be clinical and perhaps even genetic testing in addition.

    • Jon Brock says:

      Hi PA. Thanks for the question.

      Actually, I think sometimes the idea really is that biomarkers will be used in isolation. For example, any time someone talks about using biomarkers (eg genetic tests, placenta examinations, eye-tracking measures to name a few recent examples) to identify infants early enough to prevent autism symptoms developing, they’re implying that intervention would proceed before traditional follow-up assessments were completed (because these rely on behavioural symptoms that haven’t yet emerged).

      Even if a positive biomarker result is followed up with a thorough clinical assessment, it’s still important to understand the biomarker’s statistical properties. If the positive predictive value is low then large numbers of individuals will be sent for what turn out to be unnecessary follow-up assessments, potentially at great financial and emotional cost to families.

      However the biomarker is to be employed, the questions we always need to be asking are: what information is the biomarker providing about the individual? Is that new information? In other words, does it add to the information we already have about the individual that led to them being tested? And how will that information be acted upon by families and professionals?

      Answering all those questions relies on an understanding of sensitivity, false positive rates, positive predictive values, and base rates.

      • Elisabeth Whyte says:

        It’s also important to talk about the potential harm of false positives in screening tests. Lets take cancer mammogram screening as an example (since the follow-up testing is fairly well mapped out, and it’s already commonly done in the population). Given the low rate of actually detecting a true cancer, you want that first step screening to be really, really accurate. Imagine being told you probably have cancer when another test 2 or more weeks later tells you that you don’t actually in fact have cancer. False positives on mammograms also sometimes lead people to have biopsy procedures, surgery, or chemotherapy when they didn’t actually need those procedures done. The actual real harm of false positives on mammogram screening tests have recently led some people to suggest they not be done on groups at lower risk for cancer. As a real estimate, if 10,000 women are being screened yearly with mammograms, 9 women will have their lives saved, but a thousand of them will have false positives where they are told they probably have cancer when they don’t actually have cancer (from: Those thousand false positives are a huge deal, with a real human cost, and are something the cancer organizations take very seriously.

        The same principle holds for screening tests for autism, ADHD, depression, or any other potential diagnosis with a low base rate. Keep in mind that all diagnostic tests have the same problem as the screening tests described in the first post.

        Even if you can’t make a diagnosis based on only the screening test, you cause a host of other problems that pose a true risk worth worrying about. This is especially true given that you may have to wait weeks, months, or years with a ‘positive’ autism screener before you know for sure if you (or your child) has autism or not. Prenatal genetic tests might also lead to unnecessary abortions if you have a false positive in your genetic screening. For other tests, a few hundred or a few thousand people who go thru needless followup testing or interventions. This is all a really huge deal in terms of emotional and financial burdens on the family that happen as the result of false positive screening tests. There is a real human cost for getting it wrong – so it’s our ethical responsibility to get it as right as possible.

        • Jon Brock says:

          Thanks Liz. And of course, this issue doesn’t just apply to biomarkers. There’s currently a big push for universal screening for autism (based on behavioural checklists rather than biomarkers) but it’s not clear to me that advocates have really thought through the implications in terms of the large numbers of false positives that will inevitably occur.

          • swimfree says:

            agreed; in my view the only model that one biomarker as sole diagnostic tool would work is infectious diseases and mendelian genetics of course. the rest of diseases generally must use several different instruments to get a complete and accurate diagnosis, which is and will be with biological markers for ASD as well. with the great heterogeneity, asd had long ways to get to a reasonable biomarker world – before even venturing there, it has to resolve its phenotypical variation issue. to me it seems like approaching from shared symptoms perspective (like NIMH’s RDoC) would be a more plausible and faster way to find some markers.

  2. Bernard Carroll says:

    Nice presentation. Try thinking in Bayesian terms… check out this discussion: Prior probability matters. Also, diagnosing a case is not the same as defining a disorder.

  3. Elizabeth B Torres says:

    Great article! I’d like to start by the very last sentence of it (i.e. the Technical Note). Although all present methods in autism research and other research concerning mental health issues assume normality in the data, we have discovered in my lab at Rutgers University that the biophysical rhythms generated by the nervous systems give rise to non-Gaussian distributions in general. All analyses assuming normality, linearity and stationarity of behavioral data generated by the nervous systems need to be re-examined under a radically different statistical framework. Check out our work on autism since we started there, but as it turns out, non-normality, non-stationarity non-linearity and stochasticity are general properties of the biophysical rhythms generated by the primate nervous systems. Time to go back to the blackboard and start afresh!

    • swimfree says:

      hence using logistic regression models with binomial or poisson distributions is one way of avoiding normal distribution assumptions. i am new to the neurodevelopmental world, but I noticed ANOVA is a favorite statistical method for comparisons.

      • Elizabeth B Torres says:

        Indeed, one problem is that students are not trained to think of empirical estimation vs. theoretical assumptions. This transfers to researchers as well and the problem is perpetuated in research that makes a “one size fits all” assumption without trying to personalize the approach to account for the person’s neurodevelopmental changes. We recently tracked 36 neonates for 6 months and assessed their neuromotor control development in relation to their physical growth. Just in those 6 months alone the changes in the probability distributions characterizing their motions and how they evolved over time would not fit with the parametric (e.g. ANOVA) assumptions the field makes in general. Then a systematic cross-sectional study from 3 to 60 years old involving somatic-motor based data revealed similar trends. Current statistical approaches to developmental (physiological) data will have to change if we want to make progress. But even ordinal (discrete) data from subjective inventories (so commonly used in neurodevelopmental research to correlate with other continuous physiological data) does not distribute normally when one factors in chronological age and uses incremental (derivative) values rather than absolute scores. I hope people pay attention to these issues.

Leave a Reply

Your email address will not be published. Required fields are marked *