Settings

Theme

A better approach to determining gender from a first name

genderize.io

27 points by Stromgren 12 years ago · 62 comments

Reader

AndrewDucker 12 years ago

Please just don't.

There is no point antagonising people by guessing information about them wrongly - particularly if it's something they've become sensitised to by it occurring frequently.

If you need to know someone's gender (and largely, you don't), then ask them.

  • vdaniuk 12 years ago

    It certainly doesn't have to be user facing. Gender may be used to make user behavior modeling or prediction more accurate.

    • AndrewDucker 12 years ago

      Fair point. For aggregate statistical analysis this may well be useful.

      • jonahx 12 years ago

        I would think this would be by far it's most likely use case. It was the first thing I thought of when I saw the link.

  • sambeau 12 years ago

    It also assumes two distinct genders which is a fallacy.

  • spuz 12 years ago

    It doesn't necessarily need to be 100% accurate to be useful. For example, you might use something like this to choose the gender of randomly generated NPCs in a game. In that case, it probably doesn't matter that the gender of names is always correct but it would add to the realism if it was. I'm not sure why you would generate a NPCs from a single list of names rather than one of male and one of female names but I'm sure there are other examples of problems where this kind of service could work well.

  • Kiro 12 years ago

    Think outside the box. This has many other use cases.

    • eksith 12 years ago

      This is not thinking outside the box. This is painting the interior of the box with refractive ink and calling it innovative.

  • hartror 12 years ago

    This. A thousand times this.

marijn 12 years ago

> {"name":"marijn","gender":"female","probability":"1.00","count":1}

Except, of course, that I am male. My name is used for both genders. The thing completely failed on a few other ambiguous names I tried. I'll second AndrewDucker's opinion—just don't.

  • ronaldx 12 years ago

    This result is really saying that 1 out of 1 tested Marijns are female - since they have only tested one Marijn, you should consider the result in this light.

    The numbers are honest enough to admit that the result is crap in this case - this type of statistical openness should be encouraged.

  • sdoering 12 years ago

    The same goes for the following, a name used for both genders in Italy:

    {"name":"maria","gender":"female","probability":"1.00","count":700}

    • k__ 12 years ago

      failed with my name, too.

      { "name": "kay", "gender": "female", "probability": "0.93", "count": 57, "country_id": "US", "language_id": "en" }

brey 12 years ago

Interesting from a machine-learning perspective - but this strikes me as a solution looking for a problem.

If any service needs to know gender (and I'm having a hard time thinking of times you NEED to know gender - dating sites?) - why not just ask? surely in a situation where you're reliant on having accurate gender information, guessing from $firstname and getting it wrong is worse than asking.

  • clarkm 12 years ago

    An analogous problem is automatic language and country detection. It's convenient when it works transparently, but can be a huge hassle when it guesses wrong.

  • yummyfajitas 12 years ago

    Here is a place where it's helpful and guessing wrong is ok. Consider a movie information site, which for some reason knows your name (but nothing else).

    Male homepage: Die Hard, Star Wars, Bridget Jones.

    Female homepage: Bridget Jones, Twilight, Star Wars.

    Both males and females are shown primarily movies they are more likely to be interested in and your bounce rate goes down.

  • vdaniuk 12 years ago

    Why not just ask? More form fields means less conversions. Using the service one can ask for the gender later during the registration process only if confidence in the sex detection is lower then a defined threshold.

    • tommorris 12 years ago

      You don't need to know the data for most things. The times you do (dating sites etc.), you need it to be accurate and should ask explicitly.

batemanesque 12 years ago

I'm sure this is interesting from a statistical point of view, but does the tech scene really need yet more reinforcement of a binary view of gender?

  • tommorris 12 years ago

    You want an enlightened view of the complexity of sensitively handling transgender people, non-binary genders and other gender and sexual minorities?

    This is Hacker News. Such enlightened thought is frowned on by our new brogrammer overlords. Here's your beer.

    • quesera 12 years ago

      Wow, that's misplaced hostility.

      If you wanted to deride the fact that many folks here won't spent multiples of effort on special, experimental, no-right-answers-and-likely-to-be-criticized-for-it-if-you-even-try cases that affect minuscule fractions of their potential user base, well...get in line behind the IE5 advocates, I guess.

      Someone recommends using a free form entry for gender. No amount of normalization will fix the "ham sandwich" entries (except that we know they are nearly all male), so you'd trade the integrity of a small percentage of your data for the appearance of "making an effort" for the vanishingly small percentage. Net fail.

      Just to be clear, my primary feeling here is that -- in the hypothetical case where gender matters -- you're best served by keeping it simple: (female | male | other/it's complicated | prefer not to answer). This should serve all cases equally.

      • eksith 12 years ago

        The solution is already available by the very same folks who would just as easily think nothing of presenting that "female | male | other/it's complicated | prefer not to answer" field though.

        In lieu of questioning who the user is, may I suggest see what the user does instead? Behavior, in the end, is the real key to unlocking demographic potential. What purchases are made, what items were clicked on, where they had browsed and what titles on pages attracted them on the site can tell far more than a simple field.

        This, of course, takes time and testing plus accumulation of data. But since a fair number of HN users also like to exalt the status of "Big Data", there's a more productive use for it.

        Don't try to finagle who the user is or isn't. Just find out what they want.

      • batemanesque 12 years ago

        ah yes, because trying to be decent & inclusive to persecuted minorities is the comparable to supporting an obsolete browser - yr analogy is a perfect illustration of the grotesque intersection between the Valley & bigotry.

        also, "the hypothetical case where gender matters" is a glorious illustration of straight privilege

        • quesera 12 years ago

          > a glorious illustration of straight privilege

          Gender has nothing to do with orientation. Please don't propagate such normative misunderstandings.

          If I may restate for clarity: in most cases of software implementation, a user's gender is not important data (obvious exceptions include medical and related fields).

          Generally, gender should not be requested. Where requested, it should not be required. Where required, one should have no compunction against answering randomly.

          That's prerogative, not privilege.

          • batemanesque 12 years ago

            opted for "straight privilege" as an alternative to "cisprivilege"that I thought HN people were more likely to understand. I'd have thought it was obvious I wasn't actually talking about orientation

    • IsaacL 12 years ago

      I agree, but it's interesting to think about how you would handle such information programmatically. Even for a case where you let users input their own gender, it's tricky.

      Do you simply add some extra genders? Male-to-female transexual, female-to-male transexual, intersex? No matter how many categories you add, you'll always annoy someone for missing them out. Does 'genderqueer' and 'genderfluid' count as the same category, or different ones?

      Maybe just add a free text form for people to input their gender? But then it's impossible to normalise if you want to do any analysis.

      Maybe we should just be enlightened and ignore gender altogether? But sometimes knowing your user's gender is really important, and it seems weird to discard this data because some people don't fit. Maybe the best compromise is simply to have 3 categories - male/female/other - though even then you'll get complaints. "Who are you calling 'other'?"

      Anyone have any other thoughts?

      PS: I seem to see way more people in tech complaining about brogrammers than actual brogrammers.

      • tommorris 12 years ago

        Easy. Store in your database (or whatever) as free text.

        In UI, provide a form with "Male, Female, Other". If they click other, reveal an optional text field where they can enter what they wish. Store.

        Normalize the synonyms of male and female to lower case "male" and "female" when doing analysis. You don't have to get 100% normalisation. But you'll probably get 80-90%.

        Contemplate what you are actually using the data for, listen to users.

        To be honest, if you manage to not put transgender as a sexual orientation, you'll be doing better than most people.

      • batemanesque 12 years ago

        re: the brogrammer thing, that's little more than a sign of the breadth of their insidious reach...

Filligree 12 years ago

The "probability" return value appears to be a straight average; it returns 1 for "Peter", which is almost guaranteed to be incorrect - all it takes is a single female Peter, anywhere on the planet.

A better approach, in the absence of more complex models, would be to use Laplace's sunrise formula.

  • huxley 12 years ago

    My great-aunt is a nun, her name became Peter Claver.

    She isn't the only one either, there are hundreds of them that took their name from a Catholic saint.

  • mjolk 12 years ago

    You're kidding right? Guessing gender for a "show hacker news" with a .io domain is a clear case of "done is better than perfect."

    • stephencanon 12 years ago

      Adding 1 to the numerator and 2 to the denominator is a trivial improvement, not pie-in-the-sky whiteboarding that prevents you from shipping.

kmike84 12 years ago

In morphologically rich languages (like Russian) the most discriminative feature for detecting gender could be the word shape of last name or middle name, not the first name. So in many languages there is no way to have meaningful gender prediction by analyzing just the first name. Relative gender frequency for the first name is an useful information, but it is just not enough for reliable gender prediction.

bromagosa 12 years ago

I guess it needs a better training DB, it returns {"gender": null} for not-so-common names in languages other than English...

http://api.genderize.io/?name=eloi&language_id=ca

http://api.genderize.io/?name=tomeu&language_id=ca

http://api.genderize.io/?name=rigoberta&language_id=es

http://api.genderize.io/?name=presentaci%C3%B3n&language_id=...

Credit for distinguishing between names in languages, though! Joan returns female in English, but male in Catalan.

eksith 12 years ago

This project is a fine example of the "Falsehoods Programmers Believe About Names" http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...

gambiting 12 years ago

Bear in mind that in some languages this problem doesn't exist. In Polish for example, all female names end with an "a". There is not a single exception from that rule, so if you see a name ending with an "a" it is always a female name.

  • TillE 12 years ago

    And in Iceland you can reliably determine gender from a person's second name, ending in either -son or -dottir.

    • zuppy 12 years ago

      Yes, but probably there are people in Iceland and Poland with foreign names. I know that you both wanted to explain that there is a rule in some language and it's nice to find out about that information, but as long as it doesn't apply to everybody, I don't consider this a solution.

      • gambiting 12 years ago

        In Poland you cannot give children foreign sounding names. So unless we are talking about foreigners visiting,then it does apply to everyone. Like I don't know a single person to whom that rule would not apply.

nefasti 12 years ago

I thought Hackers News had more people speaking more/other languages than english.

A lot of complaints, excluding the binary gender complaints, totaly forget about how languages like portuguese / french have male / female differences for nouns and other language constructs.

Let´s say I have to build a phrase where I have the user profession like engineer and I don't know upfront, for portuguese male would be "engenheiro" or " engenheira" for female. It does have a lot of practical uses. And with a big enough training, the decision to use for that user is on your hands.

casca 12 years ago

For Icelandic names, it's easy to identify the gender by looking at the last name. For example Bjarni Benediktsson is definitely male while Katrín Jakobsdóttir is definitely female.

Another strategy is to use gender-neutral terms until you find out the gender, as asking directly might be considered rude in some cultures.

Grue3 12 years ago

Doesn't know my name. http://api.genderize.io/?name=timofei

anonemouscoward 12 years ago

{ "name": "петя", "gender": "female", "probability": "1.00", "count": 1 }

Yeah, how about no.

ludicast 12 years ago

I like this from a usability standpoint. Just as some forms auto-fill the city/state based on the zip (and might get it wrong), this enables something similar. And it might get it wrong, but if your mom gave you a girl's name* blame her.

It also seems accurate:

Pat = about 50/50 David = All man Jessica = All woman

Also, wrt to "binary gender identity" complaints, are we all college freshmen here?

* my own name (Nord) sucks and gave a gender of null. Spent my whole life being called Nerd, Nora, etc. I'm not flipping out.

  • masklinn 12 years ago

    > wrt to "binary gender identity" complaints, are we all college freshmen here?

    We aren't, which is exactly why it's a problem.

    • ludicast 12 years ago

      Nobody is saying a form can't have "other", etc. as an option. Just that someone's MVP that guesses gender is allowed to do that without political-correctness police.

      I fail to see how this API needs to accommodate transpeople in its 0.1 release.

      • eksith 12 years ago

          I fail to see how this API needs to accommodate transpeople in its 0.1 release.
        
        You have failed at empathy.

        Speaking to the fluidity of human gender, "Other" is the majority of the spectrum and defaulting to binary is just as naïve as defaulting to ASCII as expected input in an application/API written in 2013.

        Restricting yourself that early in the release cycle (and I'm still dubious of the merits of this project), doesn't bode well for its future.

        Edit: I just read your comment history and, if I'm not mistaken, you're already biased. Or would you care to elaborate what you wrote here? https://news.ycombinator.com/item?id=6451454

        • ludicast 12 years ago

          >> Speaking to the fluidity of human gender, "Other" is the majority of the spectrum

          I agree 100% about the fluidity of human gender, but rather than lecture people via a form/api etc. it is probably simplest to have words that most people use like male and female and something ("other", "trans", "enlightened", whatever) for the 3rd option.

          >> I just read your comment history and, if I'm not mistaken, you're already biased. Or would you care to elaborate what you wrote here? https://news.ycombinator.com/item?id=6451454

          First of all that was a joke in the spirit of the Hangover 2. Might not have been that funny, but was an attempt at humor based on what it was responding to.

          Secondly, I believe you are "biased" :) because though it is chronologically juxtaposed to this comment that's probably a coincidence because if you go through my comment history it might be the only one touching on the subject (I think, no guarantees).

          edit - I did make a prison rape joke a year+ ago (https://news.ycombinator.com/item?id=4148572) but I actually heard from people that it was hysterical* because of the play on "backbone".

          * hysterical is a sexist word, I know.

          • eksith 12 years ago

              it is probably simplest to have words that most people use like male and female and something...
            
            The simplest is a text field with: "What prefix would you like us to use?" The End.

            There are no assumptions, no assignment of labels, not one bit of imposing your cultural norms on anyone else. The hardest part of getting over biases is acknowledging that you have them.

            Learn some sensitivity, please.

            • ludicast 12 years ago

              I'm either being trolled or bullied here.

              A text field adds time to type out (which can lose customers, alienate handicapped etc.), all to accommodate an exception, rather than a rule.

              I don't care if you have a slider, dropdown, circle, whatever, but for usability, a gender option should have poles that require 0 or 1 clicks to get to (though a text-field for further elucidation is okay). Continuing down this path, the further step is saying a shoe-size option insults amputees and lymphedema victims and should be a text-field...

              Edit - Another fact is that the person filling out this form might not be the person described by the form (say for a CRM tool) in which case it matters more to KISS.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection