A Breakthrough in Temporal Expression Understanding

13 points by caente 6 years ago · 6 comments

Reader

marcos here (one of the authors). i know the word "breakthrough" in the title is a "little" ambitious, but i really think we've done something interesting ... we'd like to publish so this is a way to collect questions/comments! soot away.

wistfully 6 years ago

Do you filter out garbage lines in emails prior to setting up for training or you include the entire email?

marjimbel 6 years ago

the memory blows up with the length of encoder sequence. for that reason we truncate the email at ~300 tokens, which is for the vast majority of cases enough to capture the relevant info. other than that we don't get rid of any "garbage" lines. instead, we let the NN (eg. the attention layer) figure out which lines are irrelevant
- wistfully 6 years ago
  
  Thanks. How do you deal with threads where older emails with headers are in the body of text. Do you just process the latest/last email.
  Also: any comments about the output language and how you taught labelers this language.
  - marjimbel 6 years ago
    
    we use mailgun's email thread parser. which is not perfect but good enough, and keep only the latest email (we ignore text from older emails in thread).
    figuring out the output language is where a lot of the domain knowledge comes to play. although at the end the output language is just whatever you need it to be for your task. it may have a very small vocab size depending on what you are trying to "translate" to. for us it is a sequence of words and numbers that express the various types of calendric intervals that corresponded to the input emails. teaching human labelers this sort of task is challenging if you want to get high-quality labels. we had a team of in-house labelers labeling data (the human-in-the-loop) and we closely interacted with them and monitored the quality of their data and gave them feedback through various methods. we also perpetually improved guidelines for them. it was a pretty intense iterative process because we wanted their labels to be as accurate as inherent ambiguity in human language allows. so a lot of work went into that.
    
    wistfully 6 years ago
    
    Really appreciate the responses. I'm working on a much much simpler use case for a client that has only thousands, not millions of data points. Perhaps your paper could simplify/shrink your data and/or output to illustrate the key points. Cheers.

Settings

A Breakthrough in Temporal Expression Understanding

Keyboard Shortcuts