Settings

Theme

Text Mining South Park

kaylinwalker.com

202 points by eamonncarey 10 years ago · 52 comments

Reader

nanis 10 years ago

I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven't read the whole thing yet.

But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.

When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.

If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

No concept of "statistical significance" is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, ALL the words spoken by all the characters).

FYI, all budding data "scientists" ...

  • bonoboTP 10 years ago

    Why so bitter and angry? As far as I can see, his calculations make sense and lead to interesting results. Instead of philosophical nitpicking, why not help him improve his understanding by explaining how you would have calculated/formalized/modeled this thing, so the scare-quote data "scientists" can learn something?

    By the way, we definitely don't hear all words that these characters speak in their lives. It's implied in the story that there are conversations that we don't get to see in the actual episodes, but nevertheless these imaginary characters speak a lot more. For example we don't see each and every breakfast, lunch and dinner discussion, we don't hear all their words in the classroom etc.

    Now of course the sampling isn't random, because the creators obviously "select" the more interesting bits of the characters' lives, but in statistics we always make assumptions that simplify the procedure but are known to be technically wrong.

    • sdenton4 10 years ago

      But we must protect ourselves from the research parasites! Man the ramparts and ready the harsh words!

  • vsbuffalo 10 years ago

    You're treating this sample-is-the-population issue as if it's resolved in the statistics literature. It is not. Gelman has written on this [1][2], as the issue comes up in political science data frequently. As Gelman points out, 50 states are not a sample of states—it's the entire population. Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

    Treating his population as a large sample of a process that's uncertain or noisy and then applying frequentist statistics is not inherently wrong in the way you say it is. It may be that there's a better way to model the uncertainty in the process than treating the population as a sample, but that's a different point than the one you make.

    [1]: http://andrewgelman.com/2009/07/03/how_does_statis/

    [2]: http://www.stat.columbia.edu/~gelman/research/published/econ... (see finite population section)

    • dragonwriter 10 years ago

      > Similarly, the Correlates of War [3] data is every militarized international dispute between 1816-2007 that fits certain criteria—it too is not a sample but the entire population.

      Its the entire population of wars meeting a certain criteria in that time frame. If that is the topic of interest, then it is also the whole population. OTOH, datasets like that are often used in analysis that is intended to apply to, for instance, "what-if" scenarios about hypothetical wars that could have happened in that time frame, in which case the studied population is clearly not the population of interest, but is taken to be -- while there may be specific reasons to criticize this in specific cases for reasons other than "its the whole population, not a sample" -- a representative sample of a broader population.

    • bonoboTP 10 years ago

      Exactly. There is an interpretation where the "population" is interpreted as a mathematical ideal process (with potentially infinite information content) and any real, physical manifestation is considered a "sample".

      The old-school interpretation is stricter and considers both the "population" and the "sample" to be physical real things. It's understandable because these methods were developed for statistics about human populations (note the origin of the terminology), medical studies etc. (The word "statistics" itself derives from "state").

      Somehow, frequentist statisticians are usually very conservative and set in one way of thinking and do not even like to entertain an alternative interpretation or paradigm... I'm not sure why it is so.

    • nanis 10 years ago

      As an economist, I am also aware of the logical contortions we have to go through to be able to run regressions on historical data (i.e. pretty much all of economic data). None of this applies here. The data generating process consists of the minds of the writers.

      For your reasoning to be applicable here, you have to put together a model of the data generating process from which you can derive a proper model that allows inference. What exactly are the assumptions on P( word_i | character_j ) that make it compatible with these particular tests' assumptions?

  • walkerkq 10 years ago

    Hi, I'm the author. I appreciate the time you've taken to read and provide constructive criticism of my work. Here's my full write up (on GitHub, so it should continue to work): https://github.com/walkerkq/textmining_southpark/blob/master...

    I was working under the assumption that we do not know ALL the words since the show's been renewed through 2019. This covers the first 18 seasons.

    Additionally, the counting up their most frequent words produced results with very little semantic meaning - things like "just" and "dont" - which can be seen in this (really boring) wordcloud: https://github.com/walkerkq/textmining_southpark/blob/master...

    Looking into the log likelihood of each word for each speaker produced results that were much more intuitive and carried more meaning, like ppod said below: I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking.

    • nanis 10 years ago

      The point I am making is simple: You can calculate whatever you want to calculate, but there is no room for statistical testing because you do not have a probability sample, and, no sampling variation.

      Yes, there will be future episodes, but you are not claiming that you are predicting what these characters will say in those future episodes (in which case your whole setup is rather inappropriate).

      Also, I suggest you think very hard about this statement:

      > The log likelihood value of 101.7 is significant far beyond even the 0.01% level, so we can reject the null hypothesis that Cartman and the remaining text are one and the same.

      Even if the statistical test you employed were appropriate, this is not the conclusion you draw from it.

      Also, are you confusing p = 0.01 with 1% or did you really choose p = 0.00001 as the significance level for your test?

    • wodenokoto 10 years ago

      A simple tf-idf would get you similar results without a t-test.

      I think that is what parent is implying.

  • minimaxir 10 years ago

    > If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.

    From the text, the author is performing statistical testing (chi-sq) for which words are most unique to a character, not which words they say the most. (although the two metrics are somewhat correlated)

    • nanis 10 years ago

      As I said, I could not read the whole thing. As I was skimming, I noticed the tests, tried to load the main page, and I was disconnected.

      Once again, "words that are most unique" to character is a parameter that can easily be counted from the set of ALL words with no sampling uncertainty because, yes, we have the population.

      • minimaxir 10 years ago

        I wouldn't say easily. Keep in mind that checking if something is "unique," it needs to be checked against every other character as well.

        For example, the Top 5 Unique Words for Randy Marsh per the analysis are:

        stan, stanley, lorde, shelly, son

        I downloaded the dataset and quickly calculated the Top 5 Most Frequently Said Words for Randy from the entire population. Those are:

        what, stan, yeah, ok, huh

        All characters on the show are saying those words (Except "stan"). That's why log-likelihood/tdfif is used on a per-character basis.

      • ppod 10 years ago

        I think the idea is that what we are really trying to measure is something unobservable like the underlying nature of the character or the writers' tendencies to give characters certain ways of speaking. We can say that Stan uses a word at a rate certain rate corrected for that words base rate in the corpus, and compare this with the rate for another character. If that difference in rate is very small, it's true that we still know for certain that the difference is absolutely true for this corpus, but it may not reflect any substantive difference between the characters.

        If this is the view taken, then the population is all of the text that might have been generated by the data-generating process of the scripts -- things like the writers' mental models of the characters. In this view the actual scripts are just a sample from all of the scripts that could have been written while keeping the variable of interest (the characters' character) constant.

  • nanis 10 years ago

    Also, I am going to go out on a limb here and guess that R's `read.csv` doesn't do what one hopes it would when fed this CSV:

        10,3,Brian,"You mean like the time you had tea with
        Mohammad, the prophet of the Muslim faith?
        Peter:
        Come on, Mohammad, let's get some tea.
        Mr. T:
        Try my ""Mr. T. ...tea.""
        "
    
    Well, it seems people are not understanding the problem with this line. Here is the screenshot of the original script: http://imgur.com/pcu5N2U

        Brian: 	You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? [flashback #3]
        Peter: 	Come on, Mohammad, let's get some tea. [Mohammad is covered by a black box with the words "IMAGE CENSORED BY FOX" printed several times from top to bottom inside the box. They stop at a tea stand.]
        Mr. T: 	Try my "Mr. T. ...tea." [squints]
    
    There, three characters speak.

    However, R's read.csv will assign all three characters' speech to Brian: http://imgur.com/gLpPKdl

       > x[596, ]
           Season Episode Character
        596     10       3     Brian
                  Line
        596 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \n
    
        > x[597,]
            Season Episode Character
        597     10       3     Brian
                                                    Line
        597 You mean like the time you had tea with Mohammad, the prophet of the Muslim faith? \nPeter:\nCome on, Mohammad, let's get some tea. \nMr. T:\nTry my "Mr. T. ...tea." \n
    
    as well as seemingly duplicating part of the conversation.

    PS: In addition, both Muhammad and Mohammad appear, presumably under-counting the references to the prophet.

    • minimaxir 10 years ago

      I took a look at the code in the author's GitHub repository.

      The data sources are CSVs in this repository: https://github.com/BobAdamsEE/SouthParkData/

      Looks like all the data is preprocessed, with everyone mostly having only 1 line. (Actually, it appears the line you note in 10-3 is broken!) You can make an argument that the script isn't processed correctly, but that's beyond the scope of the analysis, although a note might be helpful.

      • bobadams5 10 years ago

        It's my repository. I'll look at how the python script handles flashback events later today. Thanks for the feedback!

        • bobadams5 10 years ago

          It appears that there are two issues that affect small parts of the captured datasets:

          1) Colored character names are not handled properly. I looked for <th> tags, not <th bgcolor="beige"> tags.

          2) Character names that start with a lower case character are not handled. This may have to do with other episodes using lower case prefixed table headers for stage directions, I have to double check.

    • masukomi 10 years ago

      why not? that's a valid single csv record with 4 "columns". When surrounded by quotes it IS legal for a csv entity to span multiple lines.

      • nanis 10 years ago

        And, did you notice that the other lines comprise other characters' speech?

    • philh 10 years ago

      Just tested, it handles that fine. (R 3.1.3)

      • nanis 10 years ago

        Sure, if you mean attributing Mr. T and Peter's speech to Brian is fine, then, yes, it handles it fine.

  • ZoF 10 years ago

    This implies there aren't future episodes upon which this type of statistical analysis could be applied.

    This also strongly implies you think the author is a 'budding data scientist' out of his/her league.

    This is very much a 'sample' given the context that South Park is still releasing new episodes.

    FYI all elitist 'statisticians' ...

    • nanis 10 years ago

      If one is trying to figure out what characters will say in future episodes based on their speech in previous episodes, then you are in a prediction context, not significance testing context.

      As far as I can tell, there are a lot of people out of their leagues going around with the title "data scientist".

      This is not a sample. This is a census at this point in time. The fact that there will be another population tomorrow does not change the fact that you have the entire population of all words spoken by all characters up to today.

      I am not a statistician. I am an economist who knows enough about statistics and econometrics to know when a significance test is applicable.

      Also, do note the issue that R's csv parsing is going to mis-attribute some characters' speech to others. GIGO speaks loud.

      • ZoF 10 years ago

        You're the worst kind of intelligent person tbh.

        Why be a nitpicking pedant when it is clear this is intended as a throwaway exercise whose only application is predictive...?

      • toupeira 10 years ago

        You're the one calling people "data scientists", OP didn't even use the word "science" anywhere in the article.

  • make3 10 years ago

    Would the fact that he/she does not have the future text in his sample/population and that he uses this dataset as a sample of all the southparks to be ever written (in a prediction mode) make this make sense

  • JoeAltmaier 10 years ago

    Hm. The show is still running? Then the show can be considered a sample of what the characters (ok, the writers) will say/put in their mouths. The statistics then have predictive value.

    • nanis 10 years ago
      • JoeAltmaier 10 years ago

        By that definition: a complete sample?

        • nanis 10 years ago

          No.

          > A complete sample is a set of objects from a parent population that includes ALL such objects that satisfy a set of well-defined selection criteria.[3] For example, a complete sample of Australian men taller than 2m would consist of a list of every Australian male taller than 2m. But it wouldn't include German males, or tall Australian females, or people shorter than 2m ...

          So, the entire set of all words spoken by South Park characters, by definition, is the population of all words spoken by South Park characters.

          For this to be a complete sample, it needs to be a sample out of a larger population. What is that population?

seankross 10 years ago

Here's the accompanying GitHub repo: https://github.com/walkerkq/textmining_southpark

wodenokoto 10 years ago

> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]

What does that mean? Does he remove words that are only said once or twice?

Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?

  • minimaxir 10 years ago

    Relevant line in code:

       # remove sparse terms
       all.tdm.75 <- removeSparseTerms(all.tdm, 0.75) # 3117 / 728215
    
    I believe it corresponds to the tfidf factor.
cadab 10 years ago

I've found an image, which i'm guessing it taken from the site: http://imgur.com/IEudyni, worth looking at if the sites still down.

LoSboccacc 10 years ago

I would have loved to see log characterization for the canadians characters, even if they aren't part of the main cast

dropdatabase 10 years ago

This is amazing, I wonder what results you'd get from The Simpsons

rhema 10 years ago

Pretty interesting. This Large Scale Study of Myspace (http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_...) paper shows a similar method for finding characteristic terms, using Mutual Information.

peg_leg 10 years ago

This should be nominated for an igNobel

agentgt 10 years ago

I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).

Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.

  • cdubzzz 10 years ago

    Maybe because of Kyle's decision to not give long speeches last season (:

  • flashman 10 years ago

    It would definitely change. For instance I'd expect Kyle's words-per-sentence (or at least his 90th percentile sentence length) to be higher than Stan's, due to his speeches.

gulbrandr 10 years ago

  Error establishing a database connection
Someone has a cached version please?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection