Friday, May 25, 2007

Washington Scandals and Baby Names

Updated 5/26/07: Added Technical Appendix

The appearance this week of Monica Goodling before the House Judiciary committee sparked a conversation in the Political Arithmetik household about a previous Monica related Washington scandal. It perhaps says something about our household that this provoked a search for empirical evidence concerning the effect of the Clinton-Lewinsky scandal on the popularity of Monica as a name. Was it urban legend that the scandal had an effect? Was the effect large or small? Was it immediate? Let's run the numbers.

Monica was a reasonably popular name in the early 1970s, ranking between 39th and 56th in the decade of the 1970s. As it happens both Monica Lewinsky and Monica Goodling were born in the summer of 1973, two weeks apart, when the name was ranked 40th, its second highest ranking. (Monica ranked between 59 and 141 in the decade of the 1960s.) [My thanks to my colleagues at the coffee shop for suggesting I check tennis player Monica Seles, who turns out to also be a 1973 baby. Granted, she isn't connected to a DC scandal despite being born in 1973, and being born in the former Yugoslavia makes the relevance to our current investigation a tad suspect.]

If we were going to pick a name to go with a DC scandal from babies born in 1973, better bets would have been Jennifer, Amy, Michelle, Kimberly, Lisa, Melissa, Angela, Heather, Stephanie or Rebecca, the top 10 girls names that year. But Monica at 40th wasn't rare by any means.

The 1970s were the peak years for Monicas. By the 1990s the name had slowly but steadily declined to rank between 76th and 88th during 1990-1997.

And then the events of 1998 intervened. The Clinton-Lewinsky scandal broke on January 21, 1998, reached its fevered peak by the end of 1998 with the impeachment of President Clinton and was resolved by the Senate's failure to convict on February 12, 1999. Of course that didn't prevent late night comics from continuing to milk the material for months, years, perhaps forever after.

The impact on parents was immediate, but not as drastic as I had expected. There were 11 months of 1998 in which the scandal's impact could be felt. And the ranking of Monica dropped from 79 in 1997 to 105 in 1998, a substantial but not precipitous drop. Of course events were unfolding during this year, so perhaps it is reasonable to focus on 1999, by which time surely every expectant parent in America would be aware of the Clinton scandal.

And in 1999 the ranking of Monica did fall dramatically, to 151, just a bit below where it stood in 1960.

So indeed, the impact of the scandal produced an immediate and substantial response, as one would surely expect. No urban legend this.

But what I find fascinating is the continued decline since 1999. I would expect the impact to be greatest in the immediate aftermath of the infamous episode and to level off or perhaps even abate thereafter. Instead, the data suggest a much slower response and a much longer diffusion of unpopularity through the population. Having dropped 72 places between 1997 and 1999, the popularity of Monica dropped ANOTHER 99 places from 1999 through 2006, the last year for which we have data, to now stand at the 250th name on the popularity list.

One interesting speculation is to consider the effect of the Clinton-Lewinsky scandal on the parents who are just now having baby girls. Many of them would have been in their teens or early 20s during the height of the scandal, compared to parents of 1999 or 2000 who would have been on average 7 or 8 years older. I wonder if the impact of the scandal was larger on teenage and college age parents to be. These are ages not noted for consumption of political news, but they are ages extremely well known for crude sexual humor, for which Kenneth Starr provided an abundant supply of raw material. So I wonder if this cohort that is now giving birth was somewhat more affected by the scandal than were even slightly older cohorts who were past the age of campus humor as well as early sexual development. That could explain the continued and steady decline in use of Monica as a girl's name. It would also predict a leveling off once cohorts start to dominate births who were too young to understand the Clinton-Lewinsky scandal at the time.

The alternative is a slow diffusion of unpopularity throughout the culture, which is having an increasing effect regardless of personal experience with the scandal. If so, there is little reason to expect a leveling off of ranking. But there is also a puzzle about why the cultural diffusion is as slow as it has been.

It seems unlikely that Monica Goodling's testimony will significantly reduce the already declining popularity of the name. But given the current standing of "Monica", it is much less likely that a DC scandal in 2035 or so will feature a Monica in the staring role.

Prospective parents may want to visit the source of these data, the Social Security Administration's Popular Baby Names site here.

A superb academic study of the sociology of naming babies is A Matter of Taste: How Names, Fashion and Culture Change, by Stanley Lieberson.

Technical Appendix (added 5/26/07)

Warning: This is the really geeky part. Unless you think log2(x) is really cool, you might want to turn back now!

"Professor M" posted a comment on the cross post of this article at Rather than "geek up" Pollster, I'm replying here. (This was supposed to be a "just for fun" post, after all.)

His/her comment is:
Hmmm. Try graphing the percent of babies given the name Monica in each year instead of the popularity rank. I think your discussion might change.
The good Professor M makes an excellent point. Let's think why. The actual rate of name use is quite small, even for the most popular names. For example, in 2006 the most popular name for girls was Emily. That name was used for 1.0267% of girls born. The number 2 name was Emma, 0.9159%. This difference in percentages is actually rather large. When we get down to ranks 101 and 102 we find Mya at 0.1602% and Amanda at 0.1599%. When we get down to Monica at 250, the rate is 0.0650% and for Carly at 251 the rate is 0.0649%.

So the rate of name use gets closer together for adjacent ranks as we go from more popular to less popular ranks. In my plots above, a change of one rank is the same vertical distance in the plot whether we are going from 1 to 2 or 100 to 101 or 250 to 251. But the percentage rates would not be changing by the same amount for each of those ranks. Instead, the difference in percentage rate would be getting smaller as we go from more popular to less popular rankings. In techie terms, the relationship between rank and percentage use is non-linear. And that can produce a different look to the plot, as Professor M suggests. So let's take a look.

I've converted the percentages into rate per 10,000 girls born, just to avoid the decimal points. That makes no difference for the look. So let's look at what Professor M suggests:

And behold! As Professor M suggested, the look is a bit different. What appears as a continued sharp drop after 1999 in my plot of rankings, now looks more like a continued decline but not so sharp, and much more of the decline came between 1997 and 1999. Also, the declining popularity of Monica between 1973 and 1997 appears more substantial, dropping from 41 per 10000 to 22 per 10000.

So Professor M's point is well taken. The change in rates are significantly different from the change in ranks. The popularity of Monica has continued to decline since 1999 but not nearly so dramatically as it appears in my ranking graph.


Is the raw percentage (or per 10,000) rate the right measure either? As the rate approaches zero, it becomes impossible to decrease by a constant amount. From1973 to 1997, the rate of use of Monica fell from 41.0 to 22.1 per 10,000, a decline of 18.9. But in 1999 the rate was 10.96 per 10,000. It would be impossible for that to decline by another 18.9, lest we end up with a negative rate of name use! The point is, a constant change in the raw rate is impossible as we approach low incidence of the name. So perhaps linear change in the rate is also not a good way to model this.

An alternative is to think of the "half life" of the name use. This equates a fall of 1/2 from say 40 to 20 per 10000 with an equivalent proportionate change from 10 to 5 per 10000. This makes proportionate declines equal across the entire range of name rates. In effect, this says a fall of 1/2 in usage rate is the same wherever it occurs.

A simple way to measure this is to use the log base 2 of the rate per 10000. In base 2, each unit increase on the log2 scale is a doubling of the rate. So 1=log2(2), 2=log2(4), 3=log2(8), 4=log2(16), 5=log2(32) and 6=log2(64). Those values cover the range of Monica rates, and the critical point is that each 1 unit increase is a doubling and each 1 unit decrease is a halving of the rate of use.

Replotting the data on this log2(rate per 10000) scale produces the following:

Now we see that from 1973 to 1997 the log2 rate fell from 5.4 to 4.5, or almost a full unit, representing a halving of the rate. From 1997 to 1999 it fell from 4.5 to 3.5, another halving. And from 1999 to 2006 from 3.5 to 2.7, a bit less than half again.

On this scale of proportionate change then, the drop from 1997 to 1999 is huge, a full halving of the rate (from 22.1 per 10,000 to 10.96) in just 2 years. The subsequent decline from 10.96 to 6.50 is a 41% decrease in rate over 7 years.

Now this plot is not identical to my ranking plot, but it is pretty close. The qualitative description in my original post applies pretty well to this one as it did to the ranking plot. So I stand by my original comments.

I had not looked at these issues before Professor M's comment, so I am very grateful to him/her for pointing this out. And indeed, as we saw above, the raw rates do look somewhat different. But on reflection, prompted by that comment, I think the log2 rate is probably the most reasonable way to look at this. The ranks alone can be misleading because the equal intervals between ranks distort the changes in rate. But the raw rates are also misleading because changes cannot remain constant when there is a lower limit of zero usage which we approach. Proportionate change seems more compelling in this case, and log2 is a convenient and easy to understand approach to this.

And one last technical point. The plot of rate against rank is strongly non-linear, as Professor M implies. The plot of log2(rate) against rank is much closer to linear, though with some continued bend. This is why my final log2 plot above more closely resembles the rank plot. Since log2 rate is close to linear with rank, the two plots must look quite similar.