Settings

Theme

Weak supervision to isolate sign language communicators in crowded news videos

vrroom.github.io

58 points by matroid a year ago · 52 comments

Reader

akira2501 a year ago

> I believe that we can solve continuous sign language translation convincingly

American Sign Language is not English, in fact, it's not even particularly close to English. Much of the language is conveyed with body movements outside of the hands and fingers, particularly with facial expressions and "named placeholders."

> All this is to say, that we need to build a 5000 hour scale dataset for Sign Language Translation and we are good to go. But where can we find this data? Luckily news broadcasters often include special news segments for the hearing-impaired.

You need _way_ more than just 5000 hours of video. People who are deaf of hard of hearing, in my experience, dislike the interpreters in news broadcasts. It's very difficult, as an interpreter, to provide _worthwhile_ translations of what is being spoken _as_ it is being spoken.

It's more of a bad and broken transliteration that if you struggle to think about you can parse out and understand.

The other issue is most interpreters are hearing and so use the language slightly differently from actual deaf persons, and training on this on news topics will make it very weak when it comes to understanding and interpreting anything outside of this context. ASL has "dialects" and "slang."

Hearing people always presume this will be simple. They should really just take an ASL class and worth with deaf and hearing impaired people first.

  • al_borland a year ago

    I know an interpreter who is a CODA. Her first language was sign language, which I think helps a lot. I once asked her if she thought in English or ASL and she said ASL.

    During the pandemic she’d get very frustrated by the ASL she saw on the news. Her mom and deaf friends couldn’t understand them. It wasn’t long before she was on the news regularly to make sure better information was going out. She kept getting COVID, because she refused to wear a mask while working, because coving up the face would make it more difficult to convey the message. I had to respect the dedication.

    • rasur a year ago

      During Covid, my son (who is deaf and attends a deaf school) has issued masks with transparent windows at the front, especially for assisting with lip-reading for deaf users. This is in Switzerland though - I don't know if this innovation reached across the Atlantic ;)

      • zie a year ago

        They did, but it wasn't like they were available in March of 2020 when the world was halting. When getting news and communication was important.

    • fsckboy a year ago

      >I once asked her if she thought in English or ASL and she said ASL

      It's common to think that we think in languages, but at a fundamental level, we simply don't. Ever have the experience of not being able to remember the word for something? you know exactly what you are thinking, but can't come up with the word. If you thought in your language, this wouldn't happen.

      • lolinder a year ago

        There are layers to thought, and the layer that is most conscious is, at least for me (and from what I hear a lot of other people), in my native language. Further, achieving fluency in a second language is often associated with skipping the intermediate step of translating from English, with thoughts materializing first in the second language.

        You're correct that there does seem to be a layer lower than that—one that can materialize as either a native or a second language—but it's not inaccurate to talk about which language we "think" in because many of us actually do constantly materialize thoughts as language without any intention of speaking them.

        • fsckboy a year ago

          you can talk to yourself in your head using your native language. But that's not evidence that you are thinking in your language, it's thinking of your language. when you do math, are you thinking in your language? when you drive a car or play a video game or a sport, or make love, are you thinking in your language? I'll answer for you, no, you aren't.

          Do people who grow up without language (there have been plenty examples, deaf people for example) simply not think? do cats and dogs and chimpanzees not think?

          • lolinder a year ago

            I don't think we actually disagree on the facts so much as on the buckets we sort them into.

            I very much count my internal English monologue as thinking in my language and would understand that to be what someone is asking about if they asked what language I think in. At the same time, I totally agree that there are layers to thought that never rise into the monologue.

          • juliendorra a year ago

            Thinking is probably not a single unified process, but several simultaneous activities at different levels. Including the unconscious ones. So you might very well both be right, as we may be thinking sometimes simultaneously in language and in other ways (like when we talk and cook, when we write, etc.)

          • drdeca a year ago

            There are kinds of thought which are facilitated by language.

      • 01HNNWZ0MV43FF a year ago

        The other day I forgot the name of a street in my town, right as I was trying to tell someone to turn onto it.

        I said, "I don't know the name of the street - It's the son of (CHARACTER) from (VIDEO GAME)" and then I remembered the name. I was completely right about the game character, apparently I'd remembered that mnemonic but temporarily forgotten the character's actual name.

      • numpad0 a year ago

        It's often said that a picture is worth a thousand words. And there's no succinct picture that reliably convey that concept, at the same time. Human cognition must be simply cross-modal.

    • refulgentis a year ago

      I got a insanely dumb question, that probably has an obvious answer: why is it so critically important to have ASL? It seems to me you could skip having someone to the side adding the layers of emoting imperfectly and just watch the anchor and captions? Note, I'm definitely the one missing something. When first met with a contradiction, check your premises, is my motto, not lecture on why they're wrong.

      • amdsn a year ago

        Think of it more like "why is it important to have Spanish as an option for captions in an area that knows it has a large audience of L1 Spanish speakers?". English/<insert dominant local language> is pretty much a foreign language to many signers. Their native language is ASL or whatever other sign language they know, and these languages aren't just 1:1 mappings of words in the local dominant language to hand signals. They have many dimensions of expression for encoding meaning like facial expression/body motions, speed of the sign, amount of times they repeat the sign that have grammatical meaning that spoken languages express with inflection (noun cases, different verb forms) or additional words. For example, where an English speaker would use words like "very" or "extremely" or choose adjectives with more intense connotations an ASL signer would repeat or exaggerate the sign they want to emphasize (often by signing it more quickly, but it frequently involves intensifying multiple parts of the sign like the entire motion or the facial expression as well).

        • thaumasiotes a year ago

          > For example, where an English speaker would use words like "very" or "extremely" or choose adjectives with more intense connotations an ASL signer would repeat or exaggerate the sign they want to emphasize

          小小比小更小

          大大比大更大

          If you search for "a big big guy", you'll find no end of English examples, either. I would be surprised if any language didn't use repetition for emphasis.

      • umanwizard a year ago

        What captions? There’s no widely-used written form of ASL.

        If you meant the English-language captions, well, it should be apparent why some people prefer content to be dubbed in their native language rather than reading subtitles in a different language that they understand less well.

        • refulgentis a year ago

          Yes I meant $NATIVE_LANGUAGE_OF_VIEWER, and that example didn't help me I'm afraid, I'm quite a dense one! I appreciated you checking if I meant written ASL captions. :) The example of dubs left me confused because dubs brings in an extra sense that isn't applicable in the ASL case. There, in either case, we're watching someone who isn't the character. I can't square that with the existential fervor the woman in OP felt

          • umanwizard a year ago

            > Yes I meant $NATIVE_LANGUAGE_OF_VIEWER

            Right, the native language of most ASL users isn’t English. It’s ASL. And there is no such thing as captions in ASL (because it has no widely-used written form), which is why you have interpretation.

      • jdietrich a year ago

        Most sign language users are functionally illiterate.

        https://academic.oup.com/jdsde/article/17/1/1/359085

  • matroidOP a year ago

    Thanks for the feedback. You raise great points and this was the reason why we wrote this post, so that we can hear from people where the actual problem lies.

    On a related note, this sort of explains why our model is struggling to fit on 500 hours of our current dataset (even on the training set). Even so, the current state of automatic translation for Indian Sign Language is that, in-the-wild, even individual words cannot be detected very well. We hope that what we are building might at least improve the state-of-the-art there.

    > It's more of a bad and broken transliteration that if you struggle to think about you can parse out and understand.

    Can you elaborate a bit more on this. Do you think if we make a system for bad/broken transliteration and funnel it through ChatGPT, it might give meaningful results? That is ChatGPT might be able to correct for errors as it is a strong language model.

    • lolinder a year ago

      > Do you think if we make a system for bad/broken transliteration and funnel it through ChatGPT, it might give meaningful results?

      No, because ChatGPT's training data has practically no way of knowing what a real sign language looks like, since there's no real written form of any sign language and ChatGPT learned its languages from writing.

      Sincerely: I think it's awesome that you're taking something like this on, and even better that you're open to learning about it and correcting flawed assumptions. Others have already noted some holes in your understanding of sign, so I'll also just note that I think a solid brush up on the fundamentals of what language models are and aren't is called for—they're not linguistic fairy dust you can sprinkle on a language problem to make it fly. They're statistical machines that can predict likely results based on their training corpus, which corpus is more or less all the text on the internet.

      I'm afraid I'm not in a good position to recommend beginner resources (I learned this stuff in university back before it really took off), but I've heard good things about Andrej Karpathy's YouTube channel.

    • wizzwizz4 a year ago

      I think you think it's a magic box. There's not actually such thing as a "strong language model", not in the way you're using the concept.

      > We hope that what we are building might at least improve the state-of-the-art there.

      Do you have any theoretical arguments for how and why it would improve it? If not, my concern is that you're just sucking the air out of the room. (Research into "throw a large language model at the problem" doesn't tend to produce any insight that could be used by other approaches, and doesn't tend to work, but it does funnel a lot of grant funding into cloud providers' pockets.)

      • canjobear a year ago

        “Throw an LM at it” is the only approach that has ever produced human level machine translation.

        For theory on how a strong target-language-side LM can improve translation, even in the extreme scenario where no parallel “texts” are available, https://proceedings.neurips.cc/paper_files/paper/2023/file/7...

        • wizzwizz4 a year ago

          You're mixing up cause and effect. The transformer architecture was invented for machine translation – and it's pretty good at it! (Very far from human-level, but still mostly comprehensible, and a significant improvement over the state-of-the-art at time of first publication.) But we shouldn't treat this as anything more than "special-purpose ML architecture achieves decent results".

          The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot. It's truly awful at translation, when "prompted" to do so. (Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.)

          I haven't read that paper yet, but it looks interesting. From the abstract, it looks like one of those perfectly-valid papers that laypeople think is making a stronger claim than it is. This paragraph supports that:

          > Note that these models are not intended to accurately capture natural language. Rather, they illustrate how our theory can be used to study the effect of language similarity and complexity on data requirements for UMT.

          • canjobear a year ago

            It’s true that the Transformer architecture was developed for seq2seq MT, but you can get similar performance with Mamba or RWKV or other new non-Transformer architectures. It seems that what is important is having a strong general sequence-learning architecture plus tons of data.

            > The GPT architecture, using transformers to do iterated predictive text, is a modern version of the Markov bot.

            The Markov nature only matters if the text falls outside the context window.

            > Perhaps surprisingly so, until you step back, look at the training data, and look at the information flow: the conditional probability of the next token isn't mostly coming from the source text.

            I’m not sure what you’re getting at here. If it’s that you can predict the next token in many cases without looking at the source language, then that’s also true for traditional encoder-decoder architectures, so it’s not a problem unique to prompting. Or are you getting at problems arising from teacher-forcing?

            Basically the question was how an LM could possibly help translation, and the answer is that it gives you a strong prior for the decoder. That’s also the basic idea in the theoretical UMT paper: you are trying to find a function from source to target language that produces a sensible distribution as defined by an LM.

  • umanwizard a year ago

    > American Sign Language is not English

    I'm not sure I understand your point. Chinese is also not English but machine translation of Chinese to English can be done.

    You're right that laypeople often assume, wrongly, that a given country's sign language is an encoding of the local spoken language. In reality it's usually a totally different language in its own right. But that shouldn't mean that translation is fundamentally impossible.

    • lolinder a year ago

      They didn't say it's fundamentally impossible, they said you need way more than 5000 hours to do it and that you'd need better quality sources than news interpreters.

  • WesternWind a year ago

    Just to note this is for ISL, Indian Sign Language, not ASL, American Sign Language.

  • bluGill a year ago

    Lifeprint.org has plenty of free asl courses taught by a deaf person. Highly recommended for everyone but as with any language it takes a lot of study to be useful.

  • kobalsky a year ago

    > It's more of a bad and broken transliteration that if you struggle to think about you can parse out and understand.

    it seems to be more common to see sign language interpreters now. is it just virtue signaling to have that instead of just closed captions?

    • jallmann a year ago

      Many deaf people do prefer sign language as an accessibility option over reading captions, even if the interpreting can be hit-or-miss.

    • matroidOP a year ago

      Also, in India, many hearing-impaired people know only ISL.

      • lolinder a year ago

        This is also true in the US. People pick up enough written English to get by most of the time, but it's often quite broken and clearly a second language. I know hearing impaired people native to the US with substantially worse English than the average European.

      • egberts1 a year ago

        Just so you know, "hearing-impaired" implies that a person has a flaw whether a person is born with it (it is natural to them) or impacted in later life (hearing-challenged).

        Most non-offensive way to refer to a group of people without perfect hearing is "hard-of-hearing or deaf".

    • bluGill a year ago

      Sign languages is nothing like spoken language and so reading a spoken language is hard.

  • voidingw a year ago

    The blog post references translating between English and Indian Sign Language (ISL). I interpreted that to mean translating between spoken English and ISL, not ASL and ISL.

    Regardless, I’m curious how (dis)similar ISL is to ASL.

    • matroidOP a year ago

      That is correct. We want to translate between English and ISL. English, because it is by and large the language of the Web and I think we should try to connect ISL to it rather than Indian Languages.

      From my understanding, they are quite dissimilar. A person who knows ISL will not understand ASL, for example.

      • thaumasiotes a year ago

        > A person who knows ISL will not understand ASL, for example.

        That isn't the best example; it would still be true of language pairs that were very similar to each other.

egberts1 a year ago

Using news broadcast as a training model to populate LLM is a poor precedence.

Repetition of a sign usually indicates an additional emphasis.

The dialect needs to be all covered and multiply mapped to its word.

Furthermore, YouTube has an excellent collection of really bad or fake ASL interpreters in many news broadcasts, so bad, really really bad, worse than Al Gore Hanging Chad news broadcast or the "hard-of-hearing" inset box during Saturday Night Live News broadcast.

You still need an RID-certified or CDI-certified ASL interpreter to vet the source.

https://m.youtube.com/watch?v=GwSh0dAaqIA

https://rid.org/certification/available-certifications/

zie a year ago

1st: I sign ASL not ISL like the OP is talking about.

In the ASL world, most news translations into ASL are delayed or sped up from the person talking and/or the captions if they happen to also be available.

You are going to have sync problems.

Secondly, it's not just moving the hands, body movements, facial expressions, etc all count in ASL , I'm betting they count in ISL as well.

Thirdly the quality of interpretation can be really bad. Horrendous. it's not so common these days, but it was fairly common that speakers would hire an interpreter and mistakenly hire someone willing to just move their arms randomly. I had it happen once at a doctors office. The "interpreter" was just lost in space. The doctor and I started writing things down and the interpreter seemed a little embarrassed at least.

Sometimes they hire sign language students, you can imagine hiring a first year french student to interpret for you, it's no different really. Sometimes they mean well, sometimes they are just there for the paycheck.

I bet it's a lot worse with ISL, because it's still very new, most students are not taught in ISL, there are only about 300 registered interpreters for millions of deaf people in India. https://islrtc.nic.in/history-0

We are still very much struggling with vocal to English transcriptions using AI. Despite loads of work from lots of companies and researchers. They are getting better, and in ideal scenarios are actually quite useful. Unfortunately the world is far from ideal.

The other day on a meeting with 2 people using the same phone. The AI transcription was highly confused and it went very, very wrong.

I'm not trying to discourage you, and it's great to see people trying. I wish you lots of success, just know it's not an easy thing and I imagine lots of lifetimes of work will be needed to generate useful signed language to written language services that are on-par with the best of the voice to text systems we have today.

  • matroidOP a year ago

    Thanks Zie for the message. I'm sorry to hear about your "interpreter" encounter :(

    I do think these problems are much, much worse for ISL as you rightly noted.

    I think I should have been careful when I said "solve" in my post. But that really came from a place of optimism/excitement.

    • zie a year ago

      I'm glad you are excited about it. I don't want to discourage you, keep working at it, who knows what might come of it!

      Just know, it won't be a weekend hack to solve the problem(s) though.

hi-v-rocknroll a year ago

I'm wondering how long it will take for LLMs to be able to generate complete (one of many) sign language(s) on-the-fly and put the various sign language(s) translators out of a job. The crux seems to be that sign language differs significantly from spoken language and includes facial movements and nonverbal emotional tonality.

  • KeepFlying a year ago

    The fact that hearing impaired people prefer ASL interpreters to closed captioning tells me automated translation will never be enough.

    It's the same reason we prefer interpreters to Google Translate when the message is important.

    Interpretation adds the nuance of a human that all the automatic tools miss.

    I'm sure it could make a small dent in the market for interpreters but only a small one.

agarsev a year ago

Sign language researcher here! I would recommend you look a bit at the scientific literature on the topic. I know it can be a bit overwhelming and hard to know to separate the actual info from the garbage, so I can try and select for you a few hand picked papers. IMO, trying to understand sign language oneself, or at least getting basic notions, is fundamental to understand where the real problems lie.

Unfortunately there's no getting away from that. While the scarcity of data indeed is an issue, and your idea is nice (congratulations!) the actual problem is the scarcity of useful data. Since sign language doesn't correspond to the oral language, there are many problems with alignment and just what to translate to. Glosses (oral language words used as representation for signs) are not enough at all, since they don't capture the morphology and grammar of the language, which among other things heavily relies on space and movement. Video + audio/audio captions is nearly useless.

Good luck with your efforts, this is a fascinating area where we get to combine the best of CS, AI, linguistics... but it's hard! As I said, let me know if you want some literature, by PM/email if you want, and I'll get back to you later.

jallmann a year ago

Sign languages have such enormous variability that I have always thought having fluent sign language recognition / translation probably means we have solved AGI.

Detecting the presence of sign language in a video is an interesting subset of the problem and is important for building out more diverse corpora. I would also try to find more conversational sources of data, since news broadcasts can be clinical as others have mentioned. Good luck.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection