Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs (2024)

arxiv.org

248 points by doener 11 days ago


ozgune - 11 days ago

I had a related, but orthogonal question about multilingual LLMs.

When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.

For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.

Anyone else observed a similar behavior?

kiru_io - 11 days ago

Maybe someone should edit the title to mention this is from 2024: [Submitted on 30 Sep 2024 (v1), last revised 15 Oct 2024 (this version, v2)]

KronisLV - 11 days ago

I also quite liked the EuroLLM project: https://huggingface.co/blog/eurollm-team/eurollm-9b

Was pretty good with Latvian (better than other models this size as well as variants of Llama or Qwen that I could run) and I assume probably with other EU languages as well.

JKolios - 11 days ago

More diversity in the LLM space is always good. In my experience though, speaking as a native speaker of one of the less-used European languages, Mistral's models already use it pretty well.

jug - 11 days ago

On this topic, don’t miss the quite useful benchmark:

https://euroeval.com

NKosmatos - 11 days ago

There is also a Greek LLM from 2024.

Meltemi: A large foundation Language Model for the Greek language

https://huggingface.co/ilsp/Meltemi-7B-v1.5

denimboy - 10 days ago

I wonder how this compares to RWKV-V5 7B

https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers

tannhaeuser - 11 days ago

I mean, Mistral AI is a Paris-based company, and theirs was considered on par or better than other open weight models such as llama3.1 and qwen2.5, and mistral-24b is currently beating oh-so-great gemma3-27b depending on tasks.

Also, Stable Diffusion was originally (and still is I believe) developed in Munich.

It's true though that raising capital and finding investors works wayyy better in the US (kindof needless to say on HN) and so was getting top talent - at least in the past. Don't get me started on energy prices ;) but I don't believe those contribute significantly in the end anyway.

miros_love - 11 days ago

>European versions of ARC

But this is an image-like benchmark. Has anyone looked at the article about the EU-ARC, what is the difference? Why can't you measure it on a regular one?

I glanced through it, didn't find it right away, but judging by their tokenizer, they are learning from scratch. In general, I don't like this approach for the task at hand. For large languages, there are already good models that they don't want to compare with. And for low-resource languages, it is very important to take more languages from this language group, which are not necessarily part of the EU

andai - 11 days ago

Can someone explain this? They just reduce the English text during pretraining to balance it out? Shouldn't that harm every other benchmark though?

NetOpWibby - 11 days ago

Upset that my mind went, "TEKKEN 7 LLM." Imagine Heihachi Mishima vibe-coding for you.

htrp - 11 days ago

TIL there are european versions of ARC, HellaSwag, MMLU, and TruthfulQA.

smokel - 11 days ago

A paper on languages that begins with a grammatical error in the first sentence does not inspire confidence:

> LLMs represents a disruptive technology

YetAnotherNick - 11 days ago

They compared with Llama 3.1 and found that to be better on average for their tasks like European MMLU. And Llama 3.1 is the worst in the batch with Qwen 2.5 and Gemma 3 being significantly better.