My approach to guessing a gender from a first name.
Hi!
A short time ago, i decided to try and build an API that would try to guess the gender of a first name. I thought this might be useful for segmenting user lists for campaigning, analytics or similar.
My first approach was to use a dataset of approved names from a few European countries. This was in the believe that most countries had lists like this (Which they don't) and i planned to add them as i went along. I got wiser and the first feedback i got also told me that the API should be able to do probabilistic guesses and if possible, also offer some sort of localization filter to achieve more accurate guesses.
I decided to take an approach of using large, growing datasets of user profiles from social networks. Each entry containing a first name, a gender, a country_id and language_id. At last, i exposed this datamodel through http://genderize.io
It responds in JSON. Simple example: http://api.genderize.io?name=robin
I am now looking to get some feedback on my new approach. What do you think of this way of doing guesses. What do you think of the API? Any feedback is welcome.
The API is completely free by the way. > A short time ago, i decided to try and build an API that would try to guess the gender of a first name. Obviously you need to run a test that uses a list of real people's names and genders to measure the method's accuracy. But remember the following points: * People might resent any effort to pin down their gender in a commercial or advertising context. * The negative outcome for a gender misidentification may be much greater than the positive outcome for a correct one. * Gender-neutral names are becoming increasingly fashionable among well-educated parents, i.e. people who have money. On that basis and in my opinion, unless you can get above 90% accuracy, it's not worth doing. Some popular gender-neutral names: http://www.babynames1000.com/gender-neutral/ http://thestir.cafemom.com/pregnancy/157282/25_best_genderne... http://en.wikipedia.org/wiki/Unisex_name#English A quote: "Unisex names have been enjoying a decent amount of popularity in English speaking countries in the past several decades." Also a large number of Sikh names are gender neutral. To anyone who is interested in implementing this in a product: don't. To be fair: do it if you must. But don't let the user see the gender field as it changes. If someone has a name that's associated with the opposite gender (or they believe themselves to be of another gender), seeing the change to that gender in the gender field will make them sad, annoyed, or irritated. At best, they will chuckle at the failed attempt to predict their gender. This is one of those things that, when they work as intended, users don't notice it and it doesn't improve their experience that much, but when it fails, they notice and the annoyance hurts your image. I see that you are missing various Swedish names, like Gudrun. I don't know if you can get the full list of names, but you can get the list of names which were given to at least 10 girls in the last decade or so at: http://www.scb.se/Pages/TableAndChart____31028.aspx and for boys at: http://www.scb.se/Pages/TableAndChart____31036.aspx You can also go to http://www.scb.se/Pages/NameSearch.aspx?id=259432 and do a search for name. For example, there are 990 people in Sweden with Strömgren as a last name. It seems that "Gudrun" isn't that popular these days as fewer than 10 girls get that name. A different set of names is available from http://en.wiktionary.org/wiki/Category:Swedish_given_names . I don't have need for this data and I can't comment about the effectiveness of the API. You can get top-1000 US names for a given year by going to http://www.ssa.gov/OACT/babynames/#ht=1 , selecting a year, change "Popularity" to "Top 1000" and submitting the form. (For example, your search doesn't have 'Lowell', which was #172 in the US in 1940.) Good luck! Thanks a lot. I'll look into this.
Regarding missing names. I'm adding around 10.000 profiles a day to the dataset. So it get's better by the minute :) Not that huge a dataset yet. Hey, nice job! I do a lot of work (both professionally and as a volunteer) coding stuff for genealogical and historical non-profit organizations, and I could totally see an API like this being useful to us. Do you accept donations of name data sets from the 19th century Austro-Hungarian Empire? :-) Also, I would love to learn more about how the service actually works on the back-end. Hey. thanks!
The API doesn't actually use name sets like that. Though that was my first approach. I changed it to use lists of profiles from social networks. So when a name is requested it looks up every profile with that name and counts the number of times each gender is represented. If you use any localization parameters it will of course only look up profiles associated with the particular country or language.
I quickly realized with the initial approach that my lists would never be sufficient, since most countries allow for almost any name to be given and when combining lists from the whole world, a lot of names would end up as unisex, that's why i went for a probability factor instead. Also i'm hoping that by using social profiles, it might one day be able to tell the gender of Superman or Catwoman and things like that. People can after all call themselves what they want on the internet. I've actually thought about adding like a baseline of names from different lists though, to backup the names that are not yet represented in the dataset. Do you have a link to the names you are mentioning? Could be interesting. Check out http://search.geshergalicia.org/ . Many, but not all, of the people mentioned in the 87 data sets (and counting!) that make up this database have a gender explicitly declared. Locale is the former province of Galicia in the Austro-Hungarian Empire, which is today eastern Poland and southwestern Ukraine. Time period is mostly 19th century and some early 20th century. Ethnicity is strongly biased towards Ashkenazi Jewish, but we also have some data sets that have representation of all the people in the community at that time, such as tax lists or phonebooks or school lists. I can get you data in JSON or XML, let me know. I also have access to another large given name database that could be useful to you -- but that one is entirely Ashkenazi Jewish from what used to be northeastern Hungary, from roughly 1850 to 1906. Now this is interesting, I entered in my name (Dillon) which is gender-neutral and it returned: {"name":"dillon","gender":"male","probability":"1.00","count":1} I'm interested is how its decided there was a 100% probability that I'm male (It was correct though). ..Dillon is gender neutral? I understand that this could be said about any name (which makes predicting gender a non-trivial problem), but I am pretty sure I've never met a female Dillon. Hey Dillon.
It only found one match for "Dillon" in the dataset, which was a male. If Dillon is gender neutral the probability will probably change into some decimal when the dataset grows. I searched http://api.genderize.io/?name=batman and received
{"name":"batman","gender":null}