Sunday, August 23, 2009

How not to compare data

People throw around a lot of figures and charts when talking about autism in an attempt to prove their point. Some of the time the information is valid and other times the comparison is as valid as comparing apples and ducks.

Take for example a recent post written by Sullivan at Left Brain Right Brain -

The basic point of his post was to compare this chart from Thoughtful House (TH)

to one that was his own creation that is similar to the one below and based on data from the National Survey of Children's Health (I didn't want to rip off his chart so I made one of my own, feel free to compare it to the one on Sullivan's original post).

He then talked about how the charts don't look anything, that the magnitudes of the numbers are completely different, and furthermore that the NSCH data is "basically flat" and shows no evidence of an epidemic of autism.

Well, I have to agree with him, the charts don't look anything alike. But there is a good reason for that - the data on the charts aren't directly comparable.

Lets first start with some definitions.

The data on the TH chart is labeled a "crude" incidence. It is the basically the yearly change in the number of children with a label of autism served under the Individuals with Disabilities Education Act (IDEA). This number is then assumed to represent the number of new cases and all of those cases are assumed to be in the youngest age. There are number of problems with this approach, but like the title says, it is very crude. There are some comments on the LBRB post here and here that describe the data in more detail.

The data on the NSCH is the number of children found during the survey in 2007 who currently have autism or another ASD. This number is broken out by the age of the child and then normalized to be a number of children per 10,000 children based on the number of children in the age group. This is basically a prevalence figure.

The CDC tells us the difference between "incidence" and "prevalence" is
Incidence is the number of new cases of disease in a defined group of people over a specific time. Prevalence is the number of existing disease cases in a defined group of people during a specific time period. Public health professionals use prevalence measures to track a condition over time and to plan responses at local, regional, and national levels. Incidence is very difficult to establish because the exact time a person develops an ASD is not known.
And what is an epidemic?
In epidemiology, an epidemic occurs when new cases of a certain disease, in a given human population, and during a given period, substantially exceed what is "expected," based on recent experience
So, what are the problems with the comparisons that Sullivan did?

First and foremost, the data being compared don't even represent the same thing. The figures from the TH are meant to represent incidence while the ones from NSCH are prevalence. You can't directly compare them to each other as they don't represent the same thing.

The second problem is that the group being represented isn't the same. The TH data represents children who are being served under the IDEA act and is based on the classification of a child in a public school system. A particular child with autism maybe classified as having autism, having some other problem, not having a problem, or not even included in the data if they don't attend public school. The NSCH data is a survey meant to be a representative sample of the entire population of children. So the data on the two charts does not even represent the same underlying population.

The third problem is that the time periods that are being compared aren't at all equivalent. The numbers from the TH graph were derived from historical IDEA prevalence figures published over a number of years. The numbers of the NSCH are a snapshot of what the current prevalence was in 2007.

The last problem is that an epidemic isn't defined by numbers on a chart, it is defined by something being substantially more common than was expected. The NSCH data from 2007 indicates that the current prevalence is about 1 in 100 (although it does vary by age and substantially by state). Prior to this data being released the accepted figure from the CDC was 1 in 150, and that estimate was from 2002.

If the data from NSCH is accurate and autism is 33% more common now than 5 years ago I think that would count as a substantial increase. The CDC is expected to publish similar findings in the near future and, if they confirm that number, they are going to have to either call an epidemic or provide some other explanation for the increase. Or they might just weasel out of it again, who knows. (My money is on the weasel).

But the point is that you don't look at a chart of prevalence figures and declare there is no epidemic without comparing it to what the expected prevalence is. The CDC is "the" authority in the US for this sort of thing so it is there estimate that counts.

So, how could the comparison have been done better?

Well, it would have helped to be able to compare like to like, but given the data available from IDEA this isn't possible. But, I am going to try to do the next best thing. I am going to use the data from the 2007 and 2003 NSCH surveys and attempt to back into a "crude" incidence.

I will freely admit that this isn't the most scientific approach and does suffer from some of the same flaws that I mentioned above. However, I think the results are closer to how the chart from TH is being generated and will demonstrate what I am saying about comparing apples to apples (well, maybe apples to oranges in this case, but certainly not apples to ducks).

To produce the following chart I took the data from the 2007 and 2003 NSCH data sets and broke out the rough prevalence grouped by the child's age. I then shifted the 2003 data so that the ages of the children between the two data sets would line up properly (ie a child in 2003 that was 5 would be 9 in 2007) and be directly comparable. The result is the following.

There are a few things of note here.

First, while there has a been a push in recent years to diagnosis autism at a younger age, the figures from the youngest age are very likely understated. So the first two ages (2 and 3) for 2007 is likely low as is the 2003 data shown at ages 6 and 7.

Second, there is something off at age 12 in the 2007 data - the number is substantially higher than its surrounding points. If you look at the original 2003 data there was also something off at the unadjusted age 12 - it was too low compared to the the surrounding ages. I don't know what is going on there but it is strange.

If you notice on the chart above there is a a very clear upwards trend as you go from older to younger children. This says to me that there are more younger children being diagnosed than older ones and that the number is growing. Unfortunately, this chart cannot tell you why this is happening.

As a final comparison I took the difference between the prevalence on the above chart to show a (very) crude incidence between 2003 and 2007. You should disregard the points at age 6 and 7 but the overall increasing trend is clear.

Doesn't this look more like the chart from Thoughtful House? It is amazing what happens when you try to compare data properly .


  1. I don't think any of the folks at lbrb?? are interested in doing an objective analysis of the data. They believe as an article of faith that autism is entirely genetic, no environmental factors and therefore no real increase.

  2. Well, they do like to talk about being evidence based, so maybe they will pay attention when they make mistakes of their own. I'm not going to hold my breath waiting for them to admit it but stranger things have happened.