A better approach to determining gender from a first name
genderize.ioPlease just don't.
There is no point antagonising people by guessing information about them wrongly - particularly if it's something they've become sensitised to by it occurring frequently.
If you need to know someone's gender (and largely, you don't), then ask them.
It certainly doesn't have to be user facing. Gender may be used to make user behavior modeling or prediction more accurate.
Fair point. For aggregate statistical analysis this may well be useful.
I would think this would be by far it's most likely use case. It was the first thing I thought of when I saw the link.
It also assumes two distinct genders which is a fallacy.
Sure, but that doesn't preclude it from being a useful metric for something like ad targeting.
It does exactly that. Don't look for natural metrics in an artificial universe.
Many would argue that "being a useful metric" for an inherently non-useful purpose is worse than a waste of time.
This.
It doesn't necessarily need to be 100% accurate to be useful. For example, you might use something like this to choose the gender of randomly generated NPCs in a game. In that case, it probably doesn't matter that the gender of names is always correct but it would add to the realism if it was. I'm not sure why you would generate a NPCs from a single list of names rather than one of male and one of female names but I'm sure there are other examples of problems where this kind of service could work well.
Think outside the box. This has many other use cases.
This is not thinking outside the box. This is painting the interior of the box with refractive ink and calling it innovative.
This. A thousand times this.
> {"name":"marijn","gender":"female","probability":"1.00","count":1}
Except, of course, that I am male. My name is used for both genders. The thing completely failed on a few other ambiguous names I tried. I'll second AndrewDucker's opinion—just don't.
This result is really saying that 1 out of 1 tested Marijns are female - since they have only tested one Marijn, you should consider the result in this light.
The numbers are honest enough to admit that the result is crap in this case - this type of statistical openness should be encouraged.
The same goes for the following, a name used for both genders in Italy:
{"name":"maria","gender":"female","probability":"1.00","count":700}
failed with my name, too.
{ "name": "kay", "gender": "female", "probability": "0.93", "count": 57, "country_id": "US", "language_id": "en" }
How is that a "fail"? The probability is listed as 93%.
100% of probability doesn't make it a certainty either.
Interesting from a machine-learning perspective - but this strikes me as a solution looking for a problem.
If any service needs to know gender (and I'm having a hard time thinking of times you NEED to know gender - dating sites?) - why not just ask? surely in a situation where you're reliant on having accurate gender information, guessing from $firstname and getting it wrong is worse than asking.
An analogous problem is automatic language and country detection. It's convenient when it works transparently, but can be a huge hassle when it guesses wrong.
Here is a place where it's helpful and guessing wrong is ok. Consider a movie information site, which for some reason knows your name (but nothing else).
Male homepage: Die Hard, Star Wars, Bridget Jones.
Female homepage: Bridget Jones, Twilight, Star Wars.
Both males and females are shown primarily movies they are more likely to be interested in and your bounce rate goes down.
Why not just ask? More form fields means less conversions. Using the service one can ask for the gender later during the registration process only if confidence in the sex detection is lower then a defined threshold.
You don't need to know the data for most things. The times you do (dating sites etc.), you need it to be accurate and should ask explicitly.
I'm sure this is interesting from a statistical point of view, but does the tech scene really need yet more reinforcement of a binary view of gender?
You want an enlightened view of the complexity of sensitively handling transgender people, non-binary genders and other gender and sexual minorities?
This is Hacker News. Such enlightened thought is frowned on by our new brogrammer overlords. Here's your beer.
Wow, that's misplaced hostility.
If you wanted to deride the fact that many folks here won't spent multiples of effort on special, experimental, no-right-answers-and-likely-to-be-criticized-for-it-if-you-even-try cases that affect minuscule fractions of their potential user base, well...get in line behind the IE5 advocates, I guess.
Someone recommends using a free form entry for gender. No amount of normalization will fix the "ham sandwich" entries (except that we know they are nearly all male), so you'd trade the integrity of a small percentage of your data for the appearance of "making an effort" for the vanishingly small percentage. Net fail.
Just to be clear, my primary feeling here is that -- in the hypothetical case where gender matters -- you're best served by keeping it simple: (female | male | other/it's complicated | prefer not to answer). This should serve all cases equally.
The solution is already available by the very same folks who would just as easily think nothing of presenting that "female | male | other/it's complicated | prefer not to answer" field though.
In lieu of questioning who the user is, may I suggest see what the user does instead? Behavior, in the end, is the real key to unlocking demographic potential. What purchases are made, what items were clicked on, where they had browsed and what titles on pages attracted them on the site can tell far more than a simple field.
This, of course, takes time and testing plus accumulation of data. But since a fair number of HN users also like to exalt the status of "Big Data", there's a more productive use for it.
Don't try to finagle who the user is or isn't. Just find out what they want.
ah yes, because trying to be decent & inclusive to persecuted minorities is the comparable to supporting an obsolete browser - yr analogy is a perfect illustration of the grotesque intersection between the Valley & bigotry.
also, "the hypothetical case where gender matters" is a glorious illustration of straight privilege
> a glorious illustration of straight privilege
Gender has nothing to do with orientation. Please don't propagate such normative misunderstandings.
If I may restate for clarity: in most cases of software implementation, a user's gender is not important data (obvious exceptions include medical and related fields).
Generally, gender should not be requested. Where requested, it should not be required. Where required, one should have no compunction against answering randomly.
That's prerogative, not privilege.
opted for "straight privilege" as an alternative to "cisprivilege"that I thought HN people were more likely to understand. I'd have thought it was obvious I wasn't actually talking about orientation
I agree, but it's interesting to think about how you would handle such information programmatically. Even for a case where you let users input their own gender, it's tricky.
Do you simply add some extra genders? Male-to-female transexual, female-to-male transexual, intersex? No matter how many categories you add, you'll always annoy someone for missing them out. Does 'genderqueer' and 'genderfluid' count as the same category, or different ones?
Maybe just add a free text form for people to input their gender? But then it's impossible to normalise if you want to do any analysis.
Maybe we should just be enlightened and ignore gender altogether? But sometimes knowing your user's gender is really important, and it seems weird to discard this data because some people don't fit. Maybe the best compromise is simply to have 3 categories - male/female/other - though even then you'll get complaints. "Who are you calling 'other'?"
Anyone have any other thoughts?
PS: I seem to see way more people in tech complaining about brogrammers than actual brogrammers.
Easy. Store in your database (or whatever) as free text.
In UI, provide a form with "Male, Female, Other". If they click other, reveal an optional text field where they can enter what they wish. Store.
Normalize the synonyms of male and female to lower case "male" and "female" when doing analysis. You don't have to get 100% normalisation. But you'll probably get 80-90%.
Contemplate what you are actually using the data for, listen to users.
To be honest, if you manage to not put transgender as a sexual orientation, you'll be doing better than most people.
Good answer, that sounds like a reasonable compromise.
re: the brogrammer thing, that's little more than a sign of the breadth of their insidious reach...
The "probability" return value appears to be a straight average; it returns 1 for "Peter", which is almost guaranteed to be incorrect - all it takes is a single female Peter, anywhere on the planet.
A better approach, in the absence of more complex models, would be to use Laplace's sunrise formula.
My great-aunt is a nun, her name became Peter Claver.
She isn't the only one either, there are hundreds of them that took their name from a Catholic saint.
You're kidding right? Guessing gender for a "show hacker news" with a .io domain is a clear case of "done is better than perfect."
Adding 1 to the numerator and 2 to the denominator is a trivial improvement, not pie-in-the-sky whiteboarding that prevents you from shipping.
Yes, to fix the "girls named Larry" bug.
In morphologically rich languages (like Russian) the most discriminative feature for detecting gender could be the word shape of last name or middle name, not the first name. So in many languages there is no way to have meaningful gender prediction by analyzing just the first name. Relative gender frequency for the first name is an useful information, but it is just not enough for reliable gender prediction.
I guess it needs a better training DB, it returns {"gender": null} for not-so-common names in languages other than English...
http://api.genderize.io/?name=eloi&language_id=ca
http://api.genderize.io/?name=tomeu&language_id=ca
http://api.genderize.io/?name=rigoberta&language_id=es
http://api.genderize.io/?name=presentaci%C3%B3n&language_id=...
Credit for distinguishing between names in languages, though! Joan returns female in English, but male in Catalan.
This project is a fine example of the "Falsehoods Programmers Believe About Names" http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...
Bear in mind that in some languages this problem doesn't exist. In Polish for example, all female names end with an "a". There is not a single exception from that rule, so if you see a name ending with an "a" it is always a female name.
And in Iceland you can reliably determine gender from a person's second name, ending in either -son or -dottir.
Yes, but probably there are people in Iceland and Poland with foreign names. I know that you both wanted to explain that there is a rule in some language and it's nice to find out about that information, but as long as it doesn't apply to everybody, I don't consider this a solution.
In Poland you cannot give children foreign sounding names. So unless we are talking about foreigners visiting,then it does apply to everyone. Like I don't know a single person to whom that rule would not apply.
I thought Hackers News had more people speaking more/other languages than english.
A lot of complaints, excluding the binary gender complaints, totaly forget about how languages like portuguese / french have male / female differences for nouns and other language constructs.
Let´s say I have to build a phrase where I have the user profession like engineer and I don't know upfront, for portuguese male would be "engenheiro" or " engenheira" for female. It does have a lot of practical uses. And with a big enough training, the decision to use for that user is on your hands.
For Icelandic names, it's easy to identify the gender by looking at the last name. For example Bjarni Benediktsson is definitely male while Katrín Jakobsdóttir is definitely female.
Another strategy is to use gender-neutral terms until you find out the gender, as asking directly might be considered rude in some cultures.
Is the first name in Iceland the family name or is there something else going on here?
I'm just guessing, but..
"Benedikt-SSON is definitely male while Katrín Jakobs-DÓTIR is female"
(Hey, I swear it was before taking a look http://en.wikipedia.org/wiki/Icelandic_name)
Doesn't know my name. http://api.genderize.io/?name=timofei
{ "name": "петя", "gender": "female", "probability": "1.00", "count": 1 }
Yeah, how about no.
I like this from a usability standpoint. Just as some forms auto-fill the city/state based on the zip (and might get it wrong), this enables something similar. And it might get it wrong, but if your mom gave you a girl's name* blame her.
It also seems accurate:
Pat = about 50/50 David = All man Jessica = All woman
Also, wrt to "binary gender identity" complaints, are we all college freshmen here?
* my own name (Nord) sucks and gave a gender of null. Spent my whole life being called Nerd, Nora, etc. I'm not flipping out.
> wrt to "binary gender identity" complaints, are we all college freshmen here?
We aren't, which is exactly why it's a problem.
Nobody is saying a form can't have "other", etc. as an option. Just that someone's MVP that guesses gender is allowed to do that without political-correctness police.
I fail to see how this API needs to accommodate transpeople in its 0.1 release.
You have failed at empathy.I fail to see how this API needs to accommodate transpeople in its 0.1 release.Speaking to the fluidity of human gender, "Other" is the majority of the spectrum and defaulting to binary is just as naïve as defaulting to ASCII as expected input in an application/API written in 2013.
Restricting yourself that early in the release cycle (and I'm still dubious of the merits of this project), doesn't bode well for its future.
Edit: I just read your comment history and, if I'm not mistaken, you're already biased. Or would you care to elaborate what you wrote here? https://news.ycombinator.com/item?id=6451454
>> Speaking to the fluidity of human gender, "Other" is the majority of the spectrum
I agree 100% about the fluidity of human gender, but rather than lecture people via a form/api etc. it is probably simplest to have words that most people use like male and female and something ("other", "trans", "enlightened", whatever) for the 3rd option.
>> I just read your comment history and, if I'm not mistaken, you're already biased. Or would you care to elaborate what you wrote here? https://news.ycombinator.com/item?id=6451454
First of all that was a joke in the spirit of the Hangover 2. Might not have been that funny, but was an attempt at humor based on what it was responding to.
Secondly, I believe you are "biased" :) because though it is chronologically juxtaposed to this comment that's probably a coincidence because if you go through my comment history it might be the only one touching on the subject (I think, no guarantees).
edit - I did make a prison rape joke a year+ ago (https://news.ycombinator.com/item?id=4148572) but I actually heard from people that it was hysterical* because of the play on "backbone".
* hysterical is a sexist word, I know.
The simplest is a text field with: "What prefix would you like us to use?" The End.it is probably simplest to have words that most people use like male and female and something...There are no assumptions, no assignment of labels, not one bit of imposing your cultural norms on anyone else. The hardest part of getting over biases is acknowledging that you have them.
Learn some sensitivity, please.
I'm either being trolled or bullied here.
A text field adds time to type out (which can lose customers, alienate handicapped etc.), all to accommodate an exception, rather than a rule.
I don't care if you have a slider, dropdown, circle, whatever, but for usability, a gender option should have poles that require 0 or 1 clicks to get to (though a text-field for further elucidation is okay). Continuing down this path, the further step is saying a shoe-size option insults amputees and lymphedema victims and should be a text-field...
Edit - Another fact is that the person filling out this form might not be the person described by the form (say for a CRM tool) in which case it matters more to KISS.