Friday, October 31, 2014

Coding: Name Distributions for 12k Semi-random Names

I wanted to create a table of people's names so that I have a set of data for testing purposes. I wanted the names to be somewhat more random than people I know and needed to find possible exceptions to my assumptions on names. As noted in my previous post, I have ran into quite a few exceptions and some were quite difficult to workaround (ie Muhammad ibn Musa al-Khwarizmi or Georges-Louis Leclerc, Comte de Buffon).

Before I get to some of the problems, I wanted to just post some data since it is just one of those things that I just like to do with my free time even though there may be no value to it. This is for my set of 12k names which I pulled from any lists of names that I could find (US presidents, popular scientists, soccer team members, veterans, etc.) with some diversity.

The top 10 first names are:

  1. John - 40
  2. James - 29
  3. William - 21
  4. Thomas - 17
  5. Robert - 17
  6. Michael - 15
  7. David - 14
  8. George - 12
  9. Mark - 12
  10. Richard - 12
The top 4 last names are:
  1. Brown - 9
  2. Johnson - 8
  3. Smith - 5
  4. Stewart -5
My random data search is quite dominated by males. The top female name is Susan (8). Just off the top of my head, I think at least 95% of the names are male. I will try to focus on more female dominated industries. I stop at 4 for last names because there were too few overlaps in last names.

I also did a direct aggregate of names, so names with different spellings or abbreviations would have a lower count. I have thought about creating some sort of normalized table for names (ie John, Jonathan, Johnny, Johny, etc.).

As for unique spellings:
First names - 655
Last names - 1055

Another small issue with the names are the use of accent marks. On the traditional US keyboards, accents are not easily accessible so accents may be left out (ie Zoe vs Zoƫ). These are counted separately in the SQL aggregates.

One of the biggest problems to this count is Asian names especially Chinese ones. Most Asian names are family names first. Although almost all Chinese family names are a single characters, there are rare exceptions to this rule. Unfortunately, I cannot read the names. Even if I assume that whatever I find are single characters, I am not exactly sure how to manage surname first whether I should enter them into the last name field because that is traditionally the family names for Americans (which most systems are based off of) and because names are split so that formalities can be added to American traditions or keep the literal that it is the first part of the name then.

This does make a difference if I were to create metrics similar to what I have above. It will be more important to keep the list as first names as opposed to given names and last names as opposed to family names. This becomes even more complicated as I read that some places like Iceland and India have other traditions to their names where there is no "family" names similar to American or Asian cultures. There are some that include the location, parent's, or parents' names. 

Or even ancient times where people only had a single name. How would I enter Alexander the Great? Given that I keep 'von' and 'de' in the last names, the most logical method is to have 'the Great' as the last name.

Also some people changed their names or inherits new names. I did not have a method for this except to just keep the first name that they had (or at least I think it was the first one). In the future, I will likely have to create an alias table to track people with multiple names.

So there came to be a lot of work to dealing with names than I had originally planned. And this shows how software planning could easily be thrown out the window. What most would probably estimate to be only a couple hours could turn into days because the architecture may change thus rippling other changes.