Hi friends,
Hope you all have been well these last two weeks.
The Long Read: The Surname Problem
I found this great dataset from the Census Bureau of American surnames that occurred at least 100 times in the 2010 Census. Ever since, I’ve been trying to find a good use for it.
Something that had always struck me as odd is how at conferences and in my classes we always bucketed into two groups of 13 letters (last names starting with A-M and N-Z), yet obviously some letters were more popular than others.
So I decided to figure out what the optimal split would be for 2, 3, 4, and 5 buckets, down to the 2nd letter. I’d offer an accompanying graph, but it is illegible with all 676 combinations (AA, AB, etc.). So here is the distribution by just first letter:
And here are the optimal splits (for your next breakout activity, sign-in desk, whatever):
2 buckets: AA-LA | LB - ZZ
3 buckets: AA - GO | GP - OK | OL - ZZ
4 buckets: AA - ED | EE - LA | LB - RI | RJ - ZZ
5 buckets: AA - CU | CV - HI | HJ - MH | MI - SB | SC - ZZ
I was somewhat surprised by this! As it turns out the typical A-M | N-Z is not that bad, you’d only be one letter off and have a split of 45 % | 55%.
Yet we do see a greater discrepancy when we get to the smaller buckets.
At five buckets our list is split as 20.0% | 20.0% | 19.8% | 20.0% | 20.2% while the typical version (A - E | F - J | K - O | P - T | U - Z) gives 26.4% | 19.4% | 21.2% | 24.1% | 8.8%. That typical version is not great! The largest group would be 3 times as large as the smallest.
Will anyone take this advice and use this better version? Probably not! The split-by-number-of-letters heuristic is not really all that bad.
But now you know.
Also, as luck would have it, while playing with this data I came across a few more fun pieces of American Surname Trivia(TM) that I have to share.
Lengths
The shortest names were, expectedly, 2 characters. There were 168 of them! That’s a fourth of all possible two-letter combinations! The most popular were Li, Le, Wu, Yu, and Ho. Despite having quite an impressive spread of all possible combinations, only .33% of Americans have 2-letter surnames.
The Census Bureau capped the longest names at 15 characters. However, at this length there were 44 names. I’m guessing here, but they seemed primarily:
South Asian (Balasubramanian, Lakshminarayana, etc.),
German (Schwartzenberge, Schattscheneider, Schwindenhammer, Gerstenschlager, etc.),
Greek (Panagiotopoulos, Anagnostopoulos, Konstantopoulos, Paraskevopoulos, …sensing a trend here…),
or something else (e.g. Degroseilliers, Transfiguracion).
Only .00328% of Americans (9641 to be exact) were in this group.
Race(s) / Ethnicity
The data also provided the percent of respondents that identified into the following buckets: (1) White, (2) Black, (3) Asian or Pacific Islander, (4) American Indian or Alaskan Native, (5) two or more races, or (6) Hispanic†. From this we can take a look at the most distinct surnames (i.e. highest proportion for a given group) and the most popular surnames (i.e. surname held by the most members of a given group).
Most distinct surnames (including number of Americans with surname, % of those with surname in group)
White: Burkemper | 513 | 99.81%
Black: Adeyeye | 288 | 99.56%
Asian or Pacific Islander: Behera | 288 | 99.65%
American Indian or Alaskan Native: Keyonnie | 144 | 99.31%
Two or more races: Osmani | 861 | 9.99%
Hispanic: Alejandres | 415 | 99.76%
Most distinct surnames held by at least 10,000 Americans
White: Stoltzfus | ~16,000 | 99.00%
Black: Smalls | ~12,000 | 90.49%
Asian or Pacific Islander: Xu | ~26,000 | 98.25%
American Indian or Alaskan Native: Yazzie | ~15,000 | 94.56%
Two or more races: Persaud | ~12,000 | 8.94%
Hispanic: Ruvalcaba | ~11,000 | 97.95%
Most popular surnames (including number of Americans of that group with surname):
White: Smith | ~1,700,000
Black: Williams | ~ 774,00
Asian or Pacific Islander: Nguyen | ~422,000
American Indian or Alaskan Native: Smith | ~22,000
Two or more races: Smith | ~535,000
Hispanic: Garcia | ~1,100,000
Those are not typos—there are just so many Smiths out there that it is the most popular name in three different groups.
The Smiths
So then let’s talk about just how popular it is! There were approximately 2.4 million Smiths in 2010. If the Smiths were a state, they would be the 36th largest in the union, right between Kansas and New Mexico.
Even more extreme, the top 10 names‡ were made up of a whopping 14.4 million Americans, which would be the 5th largest state, between New York and Pennsylvania.
—
That’s all I could get out of the data! If you decide to take a look, do let me know if you find anything else interesting in it.
† All races given by the Census data are non-Hispanic only (e.g. non-Hispanic White, non-Hispanic Black, etc.) and are mutually exclusive (e.g. “White” means respondent only identified as White, whereas someone who identified as both White and Black would be counted as “Two or more races”).
‡For the interested: Smith, Johnson, Williams, Brown, Jones, Garcia, Miller, Davis, Rodriguez, and Martinez
The Links
Books are bad.* How might we make them better? Honestly, this piece is so good I considered just sending it as the only link this week.
Remote work is not that bad* and so we’re never going back to the old way (s/o Ben Porter).
Good news!* Voter education works.
Once again, the New York Times Agrees with Me* [$] You need a purpose in retirement if you want to avoid death.
Yale owns a 367-year-old bond.* Just in case you forgot how much I love consols.
Lagniappe
I had the pleasant surprise of getting a phone call from one of my best friends this past week. We hadn’t spoke in some time. It was wonderful to catch up—we spoke for 2 hours and 52 minutes and I came away from the conversation feeling refreshed, energized, and grateful.
While it might seem counterintuitive, I think we all have grown a little less likely to reach out to our close, long-distance friends during the pandemic. We spend so much time on calls and chats that we just want to be away from it at the end of the day.
As such, I’d recommend this week you call whomever is the closest friend you haven’t spoken to in at least a month. They’ll appreciate it and you’ll enjoy it.
Graph(s) of the week
[Pew] I don’t know about anyone else, but the most surprising number here for me is the 38% for White churches.
[Paper] A survey of Danish people found that most people think their income is closer to the middle of the distribution than it really is.
Keep the faith,
Harrison