Ask HN: Traning LLM directly on file bytes

2 points by stealthcat 2 years ago · 0 comments · 1 min read

Multi-modal LLM like PaLM, GPT4, MiniGPTv2 relies on data encoder (image, speech models) to map data to token embedding space.

Is there any attempt to directly train on file bytes? Make the only vocab of LLM as base-2, base-8 or hexadecimal, then do next token prediction on this.

I know some attempts have been done like MEGABYTE and Charformer but some may have is not directly learning from bytes with all the header info

No comments yet.

Settings

Ask HN: Traning LLM directly on file bytes

Keyboard Shortcuts