Settings

Theme

Can you save on LLM tokens using images instead of text?

pagewatch.ai

48 points by lpellis 2 months ago · 20 comments

Reader

bikeshaving a month ago

Does this mean we’ll finally get empirical proof for the aphorism “a picture is worth a thousand words”?

https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_...

  • heltale a month ago

    I suppose it’s only worth 256 words at a time right now. ;)

    https://arxiv.org/abs/2010.11929

    • estebarb a month ago

      The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED

      • bikeshaving a month ago

        2.4% relative error is not bad.

        • pastor_williams a month ago

          Reminds me of Babbage making allowance for meter.

          """

              ... it is said that he [Babbage] sent the following letter to Alfred, Lord Tennyson about a couplet in "The Vision of Sin":
          
                   Every minute dies a man,
                   Every minute one is born
          
              I need hardly point out to you that this calculation would tend to keep the sum total of the world's population in a state of perpetual equipoise, whereas it is a well-known fact that the said sum total is constantly on the increase. I would therefore take the liberty of suggesting that in the next edition of your excellent poem the erroneous calculation to which I refer should be corrected as follows:
          
                   Every minute dies a man,
                   And one and a sixteenth is born
          
              I may add that the exact figures are 1.167, but something must, of course, be conceded to the laws of metre.
          
          """

              Charles Babbage and his Calculating Engines
          • cbhl a month ago

            Shouldn't it be the other way around if the population is increasing? Every minute one is born = 1440 born/day, every minute and a sixteenth ~= 1335 dead/day for a net population increase of 105/day.

          • zahlman a month ago

            Wouldn't "one and a sixth" be more accurate in both respects?

      • behnamoh a month ago

        how do you decompress all those 4 words from one token?

        • estebarb a month ago

          Not from one token, from one embedding. Text contains a low amount of information: it is possible to compress a few token embeddings into a single tiken embedding.

          The how is variable. The calm paper seems to have used a MLP to compress from and ND input (N embeddings of size D) into a single D embedding and other for decompress them back

        • HarHarVeryFunny a month ago

          The mechanism would be prediction (learnt during training), not decompression.

          It's the same as LLMs being able to "decode" Base64, or work with sub-word tokens for that matter, it just learns to predict that:

          <compressed representation> will be followed by (or preceded by) <decompressed representation>, or vice versa.

floodfx a month ago

Why are completion tokens more with image prompts yet the text output was about the same?

  • cma a month ago

    Some multimodal models may have a hidden captioning step that may take completion tokens, others work on a fully native representation, and some do both I think.

  • Garlef a month ago

    "Thinking" Mode

ashed96 a month ago

In my experience, LLMs tend to take noticeably longer to process images than text.

  • weird-eye-issue a month ago

    It has to get the image data first, basically just IO time before processing it

    • ashed96 a month ago

      IIRC there's pre-processing (embedding/tokenization?) before feeding images to LLMs?

      Hit this issue optimizing LLM request times. Ending up lowering image resolution. Lost some accuracy but could bear that.

  • psadri a month ago

    I wonder if these stay in the prefix cache?

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection