The Language of Intelligence: Could Mandarin Be the Secret to Smarter AI?

4 min read Original article ↗

As large language models (LLMs) continue to reshape how we interact with technology, a quiet but profound question is starting to emerge among researchers: Does the language we use to train LLMs matter more than we think?

Michela Tjan Sakti Effendie

Press enter or click to view image in full size

Photo by Cherry Lin on Unsplash

There has been chatter in Reddit that some Chinese labs are building LLMs that are faster and more efficient using significantly less training data, and the reason may come down to the language itself. Not the algorithm, not the GPU count, but Mandarin.

It sounds almost mythical, but there may be linguistic and symbolic truth behind the claim.

Mandarin as a Training Language: More Signal, Less Noise?

Mandarin Chinese is a logographic language with a high density of meaning per symbol. A single character like “细胞” (xìbāo, meaning “biological cell”) conveys a precise concept that avoids the ambiguity of an English word like “cell,” which can mean anything from a prison to a battery to a biology term.

While the average Mandarin speaker uses about 2,000–3,000 characters, the full language spans 50,000+ symbols, many of which encode specialized knowledge in science, law, or literature. This rich symbolic structure could provide a more compact and unambiguous training dataset, enabling models to learn associations more quickly and precisely.

And when the goal is symbolic compression, getting maximum meaning with minimum tokens, Mandarin might just be the ideal internal representation language for LLMs.

SynthLang: A prompt language inspired by a Japanese Kanji

Inspired by this idea, a new class of symbolic languages for AI is emerging. One such concept is SynthLang, an artificial prompt language built on the structural philosophy of logographic systems like Chinese Kanji and Japanese shorthand.

SynthLang is a machine-optimized symbolic language designed to express prompts and internal reasoning paths with maximal information density and minimal ambiguity. Drawing inspiration from Kanji and scientific notations, it encodes semantic structures, tasks, and reasoning chains into compressed symbols. These symbols are not meant to be human-readable but are fully interpretable by LLMs. The goal is to act as an efficient “inner monologue” for AI, minimizing token usage while preserving logical structure, much like Mandarin does in natural language.

LLMLingua: Microsoft’s Proof That Less Can Be More

Enter LLMLingua, a prompt-compression technique developed by Microsoft Research and presented at EMNLP 2023. The team demonstrated that LLMs don’t need verbose prompts to perform well. Using a smaller LLM (e.g., GPT-2-small or LLaMA-7B), they compressed long prompts up to 20x while retaining performance on reasoning, summarization, and dialogue tasks.

Here’s the kicker: the compressed prompts, while unreadable to humans, were still perfectly interpretable to the LLM. And when expanded back by GPT-4, they retained nearly all of the original reasoning steps — especially in complex chain-of-thought prompts.

This mirrors the speculative advantage of Mandarin: compression without loss. A more symbolic or semantically packed language might be doing for the model what LLMLingua does at the algorithmic level — streamlining understanding.

Could the “Ultimate LLM” Think in Symbols?

If languages like Mandarin can help models think more clearly, group concepts better, and reason more effectively, what’s the next step? Some believe it’s developing a universal symbolic language, one built specifically for machine reasoning. A language that’s optimized not for human expression, but for AI cognition.

Just as math underpins physics, perhaps a symbolic meta-language will someday underpin artificial general intelligence (AGI).

TL;DR:

  • Some Chinese researchers believe Mandarin helps LLMs train faster and perform better due to its symbolic precision.
  • Microsoft’s LLMLingua proves that models can compress prompts 20x and still retain performance, suggesting symbolic efficiency is key.
  • This raises the question: Should we design symbolic languages for LLMs to “think” in?

Language is not just a vessel for thought, it shapes it. And for LLMs, it may be the difference between thinking fast and thinking deep.