Settings

Theme

Ask HN: Traning LLM directly on file bytes

2 points by stealthcat 2 years ago · 0 comments · 1 min read

Reader

Multi-modal LLM like PaLM, GPT4, MiniGPTv2 relies on data encoder (image, speech models) to map data to token embedding space.

Is there any attempt to directly train on file bytes? Make the only vocab of LLM as base-2, base-8 or hexadecimal, then do next token prediction on this.

I know some attempts have been done like MEGABYTE and Charformer but some may have is not directly learning from bytes with all the header info

No comments yet.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection