Wednesday, July 31, 2013

Data Suck (A Ranting Tour Through the Life of a Statistician)

You're probably complaining about empty promises.  I told you few weeks ago that I was going to deliver commentary on the abortion debate and illustrate the reasons why I vote pro-choice.  So far, I've failed to make good on that promise.  I assure you, it isn't for a lack of trying.

The next phase I've been trying to present is the impact that a blanket ban on abortion would have on the US population growth.  I'm getting close, but I keep running into problems compiling the data.  And so, today, I'm going to blow off some steam and you get to see why it is so difficult to really make informed decisions about controversial topics like abortion.  It's really quite simple.  Data suck1.

Alright, maybe that isn't completely accurate.  When you have good data that were gathered in a systematic way with a specific goal in mind, they are usually wonderful to work with.  However, with a topic like abortion, no one is gathering data in a systematic manner.  The data to answer a question with any kind of complexity have to be gathered from multiple sources.

My current task is to forecast US Population growth under two separate conditions; one in which abortion remains legal, and one in which abortion is banned.  The idea is to compare the populations between the two scenarios.  It is also a necessary step into broader discussions about economic, welfare, and social policies.  So it's an important base of information in my hypotheses about what happens if we make abortion illegal.

Now, consider what we need to know about forecasting population growth.  First, we need to know the current population.  Next, we need to know the annual birth rate.  Third, we need to know the annual death rate.  That part is easy enough.  But that doesn't tell us much about the population except the gross population.  What would be more informative, especially in light of subsequent discussion to follow, would be an understanding of population growth by race.

So now we need the populations, birth rates, and death rates by race.  A lot of that can be found through the CDC.  And for forecasting population growth under legal abortion (the current state of things), that would probably be enough.  What gets tricky is forecasting population growth if abortion were illegal.  Theoretically, this would have an effect on the birth rate, so we need to know how many abortions are performed within each race.  With this information, we can calculate the new birth rate for each race.  Unfortunately, we can't apply that birth rate to forecasted population numbers.

You see, we can increase the birth rate and calculate next year's population quite easily.  However, all those additional births are less than 1 year old for the next year (thank you, Captain Obvious).  This means that the new birth rate doesn't apply to them because their birth rate is known to be 0.  In other words, the age of the population is important in forecasting population growth when we change anything about the current conditions.

So now we're working in three dimensions.  US Population by race and age.  Still sound easy?  Well let me complicate it further.

I can find population data from 2011 by age and race.  the racial categories available to me are

  • African American
  • American Indian or Alaskan Native
  • Asian or Pacific Islander
  • Caucasian
  • Hispanic
The age categories are divided into 5 year segments starting with 0-4, 5-9, 10-14, etc and ending with > 100.  

Birth rate data are available in the same racial categories from 2009 and in similar age categories starting at age 10 and ending at age 50.  But I had to pick them out carefully from a document over 100 pages long.  On top of that, the birth rates were calculated from the population of women (not the total population) and so the numbers won't translate exactly to the 2011 data.

And we can't get abortion data by age and race at all.  The best we can do is find the proportion of abortions by race and the proportion of abortions by age.  But the categories are different.  For race, I can find 
  • African American
  • Caucasian
  • Hispanic
  • Other
And for age I can find
  • < 20
  • 20-24
  • > 25

In order to get abortions by age and race, I had to assume that age and race are independent with respect to abortions (probably not true).  And then I had to make assumptions about how many of those abortions occurred in 10-14 year olds, 15-19 year olds, 25-29 year olds, etc. 2

So far, I've succeeded in building the US population, the current US birth rates, and I'm very close to having projected birth rates under the condition of illegal abortion.  I even have the forecasting routine written and it produces a lovely graph that I'm really very pleased with.  And at this point, after many hours of searching for data, entering tables, writing code, and scratching notes on paper, I've just come to a crucial realization:

I haven't put together the death data yet!

Head, meet desk.

Okay, if you've made it through all of my rant so far, here are the take home messages

  1. When I finish this analysis, it will be flawed.  I will do my best to admit those flaws and explain my assumptions.  At the same time, while it won't be perfect, it will be a decent approximation.
  2. There's a reason that it's so hard to make informed decisions on controversial issues.  The data are hard to compile.  It's rare to find a data set on a controversial subject that allows you to see all the nuance and character of what you are trying to measure.
  3. My head hurts.
Happy Wednesday, everyone!

1 If that sounds weird, keep in mind that data is a plural noun. The singular form is datum.
2 I chose to assign the abortions proportionally to the number of births in each age group. It seems like a relative safe assumption, but probably introduces a little bit of bias.


  1. At the risk of sounding stupid, it seems to me you only need this forecast to explain why you're pro-choice if the only reason you're pro-choice has to do with abortion's effect on population growth. If that's the only reason, and you haven't done this analysis before, then your pro-choiceness has a poor basis. Do you have other reasons for being pro-choice? You could go ahead and discuss those without the forecast you're trying to make, and we could discuss those.

    1. Not stupid, George. Lack of clarity of my future intent was the risk I took when I decided to rant.

      Population growth isn't why I'm pro-choice. I'm pro-choice because I think making abortion illegal would be bad social policy that would have impacts on poverty rates, crime rates, welfare expenditures, and unemployment. My hypotheses on these impacts are theoretical and untested, and my current goal is to test those hypotheses. But before I can make forecasts and extrapolations on increases in poverty and crime, I need to develop some basis of the future population.

      So, that's where I'm going with this.

      And if anyone is interested, if we were to make abortion illegal today and assume that nobody dies ever again the US population would break 1 billion in about 100 years.