A Breakthrough in Temporal Expression Understanding
x.aimarcos here (one of the authors). i know the word "breakthrough" in the title is a "little" ambitious, but i really think we've done something interesting ... we'd like to publish so this is a way to collect questions/comments! soot away.
Do you filter out garbage lines in emails prior to setting up for training or you include the entire email?
the memory blows up with the length of encoder sequence. for that reason we truncate the email at ~300 tokens, which is for the vast majority of cases enough to capture the relevant info. other than that we don't get rid of any "garbage" lines. instead, we let the NN (eg. the attention layer) figure out which lines are irrelevant
Thanks. How do you deal with threads where older emails with headers are in the body of text. Do you just process the latest/last email.
Also: any comments about the output language and how you taught labelers this language.
we use mailgun's email thread parser. which is not perfect but good enough, and keep only the latest email (we ignore text from older emails in thread).
figuring out the output language is where a lot of the domain knowledge comes to play. although at the end the output language is just whatever you need it to be for your task. it may have a very small vocab size depending on what you are trying to "translate" to. for us it is a sequence of words and numbers that express the various types of calendric intervals that corresponded to the input emails. teaching human labelers this sort of task is challenging if you want to get high-quality labels. we had a team of in-house labelers labeling data (the human-in-the-loop) and we closely interacted with them and monitored the quality of their data and gave them feedback through various methods. we also perpetually improved guidelines for them. it was a pretty intense iterative process because we wanted their labels to be as accurate as inherent ambiguity in human language allows. so a lot of work went into that.
Really appreciate the responses. I'm working on a much much simpler use case for a client that has only thousands, not millions of data points. Perhaps your paper could simplify/shrink your data and/or output to illustrate the key points. Cheers.