A PhD student's perspective on research in NLP in the era of LLMs

126 points by morgangiraud 3 years ago · 54 comments

Reader

rhdunn 3 years ago

I've been interested in NLP for tagging stories based on topics and themes (detectives, werewolves, murder mystery, etc.), so need accurate disambiguation of parts of speech and ways of detecting uses of metaphore, similies, etc. to describe those. I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.

I've looked at several existing NLP frameworks (Open NLP, Stanford NLP) and none of them are accurate enough -- they fail on things like adjectives and old english second person pronouns. This makes them practically unusable for proper sense diambiguation, lemma and part of speech based rules, etc.

The Open NLP tokenizer is also terrible at tokenizing title abbreviations ("Dr", etc.) and things like the use of "--" to delimit text, which is frequently found various Project Gutenberg texts. You can train the Open NLP tokenizer, but it works on what it has seen, so you need to give it every variation of "(Mr|Mrs|Miss|Ms|Rev|Dr|...). [A-Z]" for it to tokenize those titles; the same for other tokens.

nl 3 years ago

Have you tried Spacy?
I find it substantially better than other tools as PoS tagger.
Also worth noting the that your assertion that you need these features to classify genres isn't obviously true to me at all.
- rhdunn 3 years ago
  
  No I haven't. Thanks for the pointer.
  For detecting uses of nouns like werewolf/werewolves, or vampire/vampires, I at least need the lemma to avoid writing different cases or a regex for each noun. Likewise, lemmatization can be used to handle different spellings (e.g. vampyre, or were-wolf). Similarly for verbs.
  Lemmatization works best when it is coupled with part of speech tagging, so you avoid removing the -ing in adverbs for example.
  Part of speech tagging also helps avoid incorrect labeling, such as not tagging 'bit' in "a bit is a single binary value" as the verb "to bite".
  That's for the simple case.
  Then there are more complex cases, like generalizing "[NP] was bitten by the vampire.", where NP can be a personal pronoun (he, she, etc.) or a name. There can also be other ways to say the same thing, e.g. "The vampire bit [NP] neck." where NP is now the object form (his, her, etc.) not the subject form. With UniversalDependencies or similar style dependency relations, you could match and label sentence fragments of the form "verb=bite, nsubj=vampire, obj=NP" (like in the first sentence) and "verb=bite, nsubj:pass=NP, obj=vampire" (like in the second sentence).
  Without NLP, it becomes even harder to detect split variants like "cut off his head" and "cut his head off", which are the same thing written in different ways. I want to detect things like that and label the entire fragment "beheading", including other noun phrase variants.
  With more advanced NLP features -- like coreference resolution (resolving instances of he/she/etc. to the same person), and information extraction (e.g. Dracula is a vampire) -- it would be possible to tag even more sentences and sentence fragments.
  - nl 3 years ago
    
    I'd encourage you to try plain old text classification on ngrams. An ngram approach will pick up lemmas fine, although Spacy will do lemmatization if you prefer.
    But I did a lot of work on this type of thing and the only time I found this sentence analysis approach was useful as classifier features was in a legal context where there were variants of very specific language we wanted to find.
    There it worked because we could write rules on the features without relying on training data.
    Tf-idf on ngrams using a rolling window would certainly work to detect the beheading variants you gave as examples.
    Again: try without the parsing features. There's a good reason they are rarely used in classifiers: they are too unreliable to improve performance over simple approaches.
  - chaxor 3 years ago
    
    I don't see why a simple TFIDF with ~10 LoC and a few minutes doesn't make this at least reasonably/crudely done.
viksit 3 years ago

at this point the gpt api will do all of that for you! or you can use the embeddings api and build your own systems.
- rhdunn 3 years ago
  
  The problem with GPT and other LLMs is that they don't tokenize words at a word or morpheme level, it's just blocks of up to 4 characters, so you get tokens like `!"` instead of two separate tokens. -- That makes it harder to write custom tools on top of, unlike e.g. the output/model of things like the universaldependencies project.
  - DougBTX 3 years ago
    
    Do you strictly need that level of tokenisation precision to meet your high-level goals?
    
    morkalork 3 years ago
    
    This is my first reaction as well. Talking about tokenization and POS tagging is getting lost in the weeds when one has goals like this:
    >I also want to be able to assess how much of the text is about a given topic, so that if I'm interested in reading a detective story from e.g. the Project Gutenberg collection, I don't want it to pick up a story where a detective is only mentioned in one paragraph.
    This more like NLU than an NLP problem isn't it? It's like tracking how much of a Harry Potter book contains Voldemort content without knowing ahead of time that he may be referred to as He Who Must Not Be Named, You-Know-Who, The Dark Lord and so on. One would have to first identify the thing you're interested in, then learn when characters/the author invent new ways to refer to it, and carry all those forwards to find new instances. Fun!
    
    rhdunn 3 years ago
    
    I also want to tag and highlight those parts of the document. For that, I need to know where the label starts and ends, which you can't really do when you don't have control over the tokens.
    It's also hard to write custom inference/tagging rules, like in the case you mentioned w.r.t. Voldemort, if you don't know what the tokens look like.
    
    chaxor 3 years ago
    
    Spacy is a decent suggestion here. They have pretty good ways of writing tagging rules.
    All of this does seem to be extremely excessive to choose a book genre though. I would imagine the number of books after a simplistic clustering technique would be rather small to flip through, so I really don't understand the use case at all.
    If you have very few books (few thousands) then you can apply more fine grained analyses in reasonable amounts of computation, such as contextualized embedding methods. But if the point is to select a book, there no real benefit since the simple 2 second term frequency methods would narrow choices down to only a few books.
    If you have billions of books, contextualized embeddings become quite expensive to produce and use (several weeks or months of processing, petabytes of storage, etc), so it's not really feasible as an individual, But the extra querying capability does help narrow the large set down.
  - viksit 3 years ago
    
    perhaps a spacy pipeline using gpt and huggingface?
- hawski 3 years ago
  
  I'm sure GPT will cover more ground much easier, but aren't LLMs in general much more resource intensive? It seems they will remain expensive to run for quite some time.

dontupvoteme 3 years ago

Are papers becoming blogs?

teruakohatu 3 years ago

It has 10+ pages of references. More like a well researched lit review than a blog post with a few links.
- low_tech_love 3 years ago
  
  Yes, but putting a link to a Google Form on the abstract is a bit too much.
  - whatyesaid 3 years ago
    
    It's arxiv, it's not a final version
codethief 3 years ago

I'd say the culture of scientific publishing precedes blogs by "a few" decades. There have always been papers that were blog-like in nature.
gumby 3 years ago

Consider the letters section of Phyicical Review, that turned into a weekly journal Physical Review Letters back in the 50s. It was a blog or forum back when papers were distributed on paper.
Basically because of the slow pace of review and publication the letters column became a way to talk about recent results or problems, and then follow up letters (i.e. comments on the blog posts) became common. So the editors decided to hive it off and speed up its publication schedule.
sdwr 3 years ago

Thank **ing god! Hiding behind jargon and "the process" is an indicator of having nothing to say. I see this as a rolling up of the metaphorical sleeves, a sign that stuff is actually happening.
SecretDreams 3 years ago

Always have been.
penguin_booze 3 years ago

Yeah, wait for them to be twitter threads: 1/n - where (n >> 1).
vpastore 3 years ago

Only the one on arxiv. People shouldn’t take anything coming from that site seriously
- maxov 3 years ago
  
  What a strange recommendation. I do research in CS theory and machine learning and I try to find arXiv preprints when I can, they are usually more complete than conference versions of papers. If you stick to papers from authors you know or to those from well-known conferences, arXiv is often times the best source.
- chaxor 3 years ago
  
  This is an extremely wrong viewpoint. "Many extremely important* papers have been published on arxiv alone - no other publisher. Just off the top of my head, without looking, ELMo from allenAI (the paper that started the contextualized representations revolution for BERT/attention) is only on arxiv.
- diffeomorphism 3 years ago
  
  That sounds highly misleading. It is the standard preprint site in many disciplines. So you should take it as seriously as any preprint and conversely if a published article is not on the arxiv, that would be suspicious.
- mschuetz 3 years ago
  
  Many of the best papers appear an Arxiv first. In some fields, it is customary to put your preprint on Arxiv before/during the submission to the peer reviewed venue.
  Arxiv is vital for quickly developing research fields.
- Frost1x 3 years ago
  
  Eh, there are plenty of actual pre-prints you can find on arXiv. Typical flow is you find a paper of interest published elsewhere then search arXiv for the same title, authors, etc. and can even find related work. A lot of publication isn't doing much in terms of genuine peer review but formatting, catching minor errors like grammar, typo, missing references etc. arXiv is great as a legitimate source for valid information tucked behind a paywall. It's not the only option but it's one of them (checking the primary author's personal site, researchgate, or associated research institutions repositories are also there, as well as contacting the author directly).
- infamouscow 3 years ago
  
  So the math, physics, and computer science departments of every major university should be ignored? What a brain dead comment.
  - vpastore 3 years ago
    
    Serious and trustworthy research is done on peer reviewed journals, not websites where everyone can post whatever he/she wants
    
    infamouscow 3 years ago
    
    Serious and trustworthy according to whom, exactly?
    The term peer review was virtually non-existent prior to the 1960s. And despite that, nearly everything in modern society can ultimately be attributed to breakthroughs that happened prior to the advent of peer-review.
    https://books.google.com/ngrams/graph?content=peer+review&ye...
    
    vpastore 3 years ago
    
    Prior to the 1960s there was no internet.
    
    infamouscow 3 years ago
    
    I think you misunderstand how accessible the Internet was, and the history of open-access journals dating back to the late 40s. arXiv started in 1991. None of it explains why the peer review started surfacing in published books.
    It does correlate perfectly with when modern scholars point to when the institutions were captured.
- 11101010001100 3 years ago
  
  I extend the same courtesy to this site.

dcl 3 years ago

A PhD Student's Perspective... ~20 authors.

SecretDreams 3 years ago

It takes a whole village (of uncredited MASc students) to get a PhD!
est31 3 years ago

The particular headline pattern is used a lot and 99% of the time it indeed is at the front of opinion pieces written by individuals. But this time, it's different. In the syntax, in this specific form, the plural and the singular come out the same. E.g. in the sentence "I accompanied my friend to his parent's house", it can be either the house of his single mom/dad or the house of his two parents living together.
- agalunar 3 years ago
  
  Like @bdsa points out with their example, the singular and plural are actually spelled differently, viz. "parent's house" and "parents' house", despite being pronounced the same way.
  https://en.wikipedia.org/wiki/English_possessive#Nouns_and_n...
  - est31 3 years ago
    
    Oh that's interesting, thanks! Not a native speaker but I dimly remember hearing about this in english class, thanks for reminding.
    I guess then the PhD student is indeed grammatically a singular then. It can still refer to a PhD student in general though instead of a particular one.
    https://www.rit.edu/ntid/sea/processes/articles/grammatical/...
  - wahahah 3 years ago
    
    I pronouce the latter as ~parent-siz; having them the same never sat right with me.
- bdsa 3 years ago
  
  "I accompanied my friend to his parents' house"
reesul 3 years ago

Title is off yeah, but the abstract does effectively say that the paper was written amongst a cohort of PhD students

totorovirus 3 years ago

well I see topics like NLP in ethics, healthcare, etc, which I think is a sign they don't have much to do here.

whatyesaid 3 years ago

Yeah those aren't good especially stuff like healthcare a PhD student can't get the needed data.
The other issue is, if you do focus on LLMs, it's too hyped your research would be too overlapping/competing especially as you've got a dissertation to write. It's a hard problem.

teleforce 3 years ago

According to the article, the original research on language model was kick started by Claude Shannon's early contributions in Markov chain model of English words.

If you are in the field of Information and Communication Technology (ICT) there are hardly any area in the field which their fundamentals do not have Shannon's hands in it.

Leonard Kleinrock once remarked that he has to focus on the exotic queuing theory field that later leads to the packet switching and then Internet because most of the fundamentals problems in electrical and computer engineering (older version of ICT) have already been solved by Shannon.

etamponi 3 years ago

Isn't the main problem with NLP research now that you'll need a ton of money to run your experiments? How can an "average" PhD researcher hope to validate their hypothesis if they need several thousand dollars per test?

nl 3 years ago

No.
There are plenty of research directions that are outlined in this document that don't require huge compute budget.
jasmer 3 years ago

Much of the interesting pure research can be done at smaller scale, the larger models are arguably more product engineering than research. At least from a certain perspective.
Randomizer42 3 years ago

That is not a problem. There are literally thousands of ways to conduct any experiment at the fraction of the cost.
- LoganDark 3 years ago
  
  Thanks. Whenever I think I want to run something on a cluster of A100s, I'll just remember the thousands of ways instead, and train my LLMs that way. There's a reason why humans didn't need computers until just recently, after all. I must be so blind to not have seen this earlier.

al__be__rt 3 years ago

this paper doesn't appear to have been edited by any reputable journal, so take its authenticity at face value...

lgessler 3 years ago

I've got some bad news for you about how papers published at NLP conferences are reviewed.

Settings

A PhD student's perspective on research in NLP in the era of LLMs

Keyboard Shortcuts