FastLangML
High-accuracy language detection for chat, SMS, and conversational text. FastLangML combines multiple detection backends (FastText, Lingua, langdetect, pyCLD3) into a powerful ensemble that outperforms any single detector.
Key Features:
- 170+ languages supported via multi-backend ensemble
- Context-aware detection - tracks conversation history to resolve ambiguous short messages like "ok", "si", "bien"
- Code-switching detection - identifies mixed-language messages (Spanglish, Franglais, Hinglish)
- Slang & abbreviations - built-in hints for chat lingo ("thx", "mdr", "jaja")
- Confusion resolution - handles similar language pairs (Spanish/Portuguese, Norwegian/Danish/Swedish)
- Extensible - add custom backends, voting strategies, and hint dictionaries
Table of Contents
- The Problem
- Installation
- Quick Start
- Core Concepts
- API Reference
- Configuration
- Extensibility
- Benchmarks
- Best Practices
- Contributing
- License
The Problem
Traditional language detectors are trained on well-formed sentences and documents. They fail on the kind of text you see in real conversations:
"Bonjour!" -> French (correct)
"Comment ca va?" -> French (correct)
"Bien" -> Spanish? French? German? (WRONG)
"ok" -> English? Universal? (AMBIGUOUS)
"thx" -> Unknown (FAIL)
Why this happens:
- Short text has low statistical signal
- Words like "ok", "taxi", "pizza" exist in many languages
- Chat slang ("thx", "mdr", "jaja") isn't in training data
- No context from surrounding messages
FastLangML solves this by:
- Tracking conversation context to disambiguate short messages
- Using hint dictionaries for slang and common words
- Combining multiple detection backends for robustness
- Returning "unknown" (
und) when uncertain instead of wrong guesses
Installation
# Full installation with all backends (recommended) pip install fastlangml[all] # Minimal installation (fasttext only) pip install fastlangml[fasttext] # Pick specific backends pip install fastlangml[fasttext,lingua] pip install fastlangml[langdetect]
Available backends:
| Backend | Languages | Speed | Accuracy | Install Extra |
|---|---|---|---|---|
| fasttext | 176 | Fast | High | [fasttext] |
| fastlangid | 177 | Fast | High (CJK) | [fastlangid] |
| lingua | 75 | Medium | Very High | [lingua] |
| langdetect | 55 | Fast | Medium | [langdetect] |
| pycld3 | 107 | Very Fast | Medium | [pycld3] |
| langid | 97 | Fast | Medium | [langid] |
Note:
fastlangidprovides improved accuracy for Japanese, Korean, Chinese, and Cantonese detection.
Optional features:
| Feature | Description | Install Extra |
|---|---|---|
| spaCy NER | Named entity recognition for proper noun filtering | [spacy] |
| fuzzy | Fuzzy matching for hint dictionaries | [fuzzy] |
| diskcache | Disk-based context storage (local dev) | [diskcache] |
| redis | Redis-based context storage (production) | [redis] |
Quick Start
Single-Shot Detection (Most Common)
The simplest API - just pass text, get language back:
from fastlangml import detect # Direct comparison detect("Hello") == "en" # True detect("Bonjour") == "fr" # True # Use as string print(detect("Hola")) # "es" f"Language: {detect('Ciao')}" # "Language: it" # Full details when needed result = detect("Hello, how are you?") result.lang # "en" result.confidence # 0.95 result.reliable # True
That's it! No context, no stores, no configuration required.
Usage Modes
FastLangML supports three usage modes - pick what fits your needs:
| Mode | Code | Use Case |
|---|---|---|
| Single-shot | detect("text") == "en" |
One-off detection (most common) |
| With context | detect("text", context=ctx) |
Multi-turn chat (in-memory) |
| With persistence | store.session(id) |
Multi-turn across HTTP requests |
# 1. Single-shot - no setup needed from fastlangml import detect detect("Bonjour") == "fr" # True # 2. Context - optional, for conversations from fastlangml import detect, ConversationContext ctx = ConversationContext() detect("Bonjour", context=ctx) # tracks history in memory detect("ok", context=ctx) == "fr" # True (uses context) # 3. Stores - optional, for persistence across requests from fastlangml.context import DiskContextStore store = DiskContextStore("./data") with store.session("user-123") as ctx: detect("Bonjour", context=ctx)
Note: Context and stores are 100% optional. Most users only need single-shot mode.
Context-Aware Detection (Optional)
For multi-turn conversations, FastLangML can track context to resolve ambiguous short messages. When you pass a ConversationContext, the library:
- Remembers the last N detected languages
- Uses this history to resolve ambiguous messages
- Auto-updates the context after each detection
from fastlangml import detect, ConversationContext # Create a context to track the conversation context = ConversationContext() # French conversation detect("Bonjour!", context=context) # "fr" (clear) detect("Comment ca va?", context=context) # "fr" (clear) detect("Bien", context=context) # "fr" <- context helps! detect("ok", context=context) == "fr" # True <- continues French # The context tracks that this is a French conversation, # so ambiguous words resolve to French
How context resolution works:
| Message | Without Context | With Context (after French turns) |
|---|---|---|
| "ok" | Ambiguous | French (conversation language) |
| "Bien" | Spanish/French/German? | French (matches history) |
| "Si" | Spanish/Italian? | Italian (if conversation was Italian) |
| "Gracias" | Spanish | Spanish (high confidence, ignores context) |
Core Concepts
Context-Aware Detection
What is conversation context?
ConversationContext maintains a sliding window of recent language detections. It tracks:
- The detected language of each message
- The confidence score of each detection
- A weighted history favoring recent messages (decay factor)
Configuration options:
context = ConversationContext( max_turns=2, # Remember last 2 messages (default) decay_factor=0.8, # Weight recent messages higher )
When context helps:
- Ambiguous short messages ("ok", "yes", "no")
- Words shared across languages ("taxi", "hotel", "pizza")
- Mixed script input (romanized non-Latin languages)
When context doesn't help:
- Clear language switches (user explicitly changes language)
- High-confidence detections (context is ignored)
Example: Customer service routing
from fastlangml import detect, ConversationContext def route_message(messages: list[str]) -> str: """Route conversation to correct language queue.""" context = ConversationContext() for msg in messages: result = detect(msg, context=context) # Return dominant language of conversation return context.dominant_language or "en" # French customer conversation messages = ["Bonjour", "J'ai un probleme", "ok", "merci"] queue = route_message(messages) # Returns "fr"
Persisting Context
For multi-turn conversations across requests, persist context using the built-in stores.
DiskContextStore (local dev/testing):
pip install fastlangml[diskcache]
from fastlangml import detect from fastlangml.context import DiskContextStore store = DiskContextStore("./contexts", ttl_seconds=3600) # Context manager - auto-saves on exit with store.session(session_id) as ctx: result = detect(text, context=ctx, auto_update=True)
RedisContextStore (production):
pip install fastlangml[redis]
from fastlangml import detect from fastlangml.context import RedisContextStore store = RedisContextStore("redis://localhost:6379", ttl_seconds=1800) with store.session(session_id) as ctx: result = detect(text, context=ctx, auto_update=True)
DIY storage (no extra deps):
# Serialize to your own storage data = ctx.to_dict() save_to_database(session_id, data) # Load back data = load_from_database(session_id) ctx = ConversationContext.from_dict(data) # Or use lightweight history format ctx = ConversationContext.from_history([("fr", 0.95), ("fr", 0.9)])
Stores internally save only [(lang, confidence), ...] for efficiency.
Multi-Backend Ensemble
FastLangML can combine multiple language detection backends for better accuracy. Each backend has different strengths:
| Backend | Best For | Weaknesses |
|---|---|---|
| fasttext | Speed, many languages | Less accurate on short text |
| lingua | Accuracy, short text | Slower, fewer languages |
| langdetect | General purpose | Non-deterministic by default |
| pycld3 | Speed, CLD3 compatibility | Lower accuracy |
How ensemble works:
- Text is sent to all configured backends
- Each backend returns a language prediction with confidence
- A voting strategy combines the predictions
- Final result is the language with highest combined score
from fastlangml import FastLangDetector, DetectionConfig # Configure ensemble with 3 backends detector = FastLangDetector( config=DetectionConfig( backends=["fasttext", "lingua", "langdetect"], backend_weights={ "fasttext": 0.5, # Trust fasttext most "lingua": 0.35, # Lingua for accuracy "langdetect": 0.15 # Langdetect as tiebreaker }, ) ) result = detector.detect("Ciao, come stai?") print(result) # "it" print(result.backend) # "ensemble"
Voting Strategies
Voting strategies determine how to combine predictions from multiple backends.
Available strategies:
| Strategy | Description | Best For |
|---|---|---|
weighted |
Weighted average of confidence scores | Production (default) |
hard |
Majority vote (each backend = 1 vote) | Equal-trust backends |
soft |
Average of all probabilities | Well-calibrated backends |
consensus |
Require N backends to agree | High-certainty requirements |
Weighted Voting (Default)
Multiplies each backend's confidence by its weight, then picks the language with highest weighted score.
from fastlangml import FastLangDetector, DetectionConfig detector = FastLangDetector( config=DetectionConfig( voting_strategy="weighted", backend_weights={ "fasttext": 0.6, "lingua": 0.4, }, ) )
Hard Voting
Each backend gets one vote. Ties broken by confidence.
detector = FastLangDetector( config=DetectionConfig(voting_strategy="hard") )
Consensus Voting
Only returns a result if at least N backends agree. Useful when you need high certainty.
from fastlangml import ConsensusVoting, FastLangDetector, DetectionConfig detector = FastLangDetector( config=DetectionConfig( custom_voting=ConsensusVoting(min_agreement=2) ) )
Hint Dictionaries
Hints are word-to-language mappings that override backend detection. Essential for:
- Chat slang ("thx" -> English, "mdr" -> French)
- Company-specific terms
- Ambiguous words you want to force
Built-in hints for short text:
from fastlangml import FastLangDetector, HintDictionary # Load default hints for chat/SMS hints = HintDictionary.default_short_words() detector = FastLangDetector(hints=hints) detector.detect("thx") # "en" (thanks) detector.detect("mdr") # "fr" (mort de rire = LOL) detector.detect("jaja") # "es" (Spanish laugh)
Adding custom hints:
from fastlangml import FastLangDetector detector = FastLangDetector() # Add hints for your domain detector.add_hint("asap", "en") detector.add_hint("btw", "en") detector.add_hint("stp", "fr") # s'il te plait # Hints override backend detection detector.detect("asap") # "en"
Hint priority:
Hints are checked before backend detection. If a hint matches:
- The hint's language gets a confidence boost
- For very short text (<=5 chars), hints dominate the result
- For longer text, hints are weighted with backend predictions
API Reference
detect()
The main function for language detection.
def detect( text: str, context: ConversationContext | None = None, mode: str = "default", auto_update: bool = True, ) -> DetectionResult
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str |
required | Text to detect |
context |
ConversationContext |
None |
Conversation context for multi-turn |
mode |
str |
"default" |
Detection mode: "short", "default", "long" |
auto_update |
bool |
True |
Automatically add result to context |
Returns: DetectionResult
DetectionResult
@dataclass class DetectionResult: lang: str # ISO 639-1 code ("en", "fr", "und") confidence: float # 0.0 to 1.0 reliable: bool # True if confidence >= threshold reason: str | None # Why "und" was returned (if applicable) script: str | None # Detected script ("latin", "cyrillic", etc.) backend: str # Which backend or "ensemble" candidates: list # Top-k alternatives meta: dict # Timing and debug info # Can be used directly as string detect("Hello") == "en" # True (via __eq__) print(detect("Bonjour")) # "fr" (via __str__) f"{detect('Hola')}" # "es"
ConversationContext
context = ConversationContext( max_turns: int = 2, # Max messages to remember decay_factor: float = 0.8, # Recency weight (0.0-1.0) ) # Properties context.dominant_language # Most common language in history context.language_distribution # {lang: weighted_count} context.last_turn # Most recent turn # Methods context.get_context_boost(lang) # Get boost score for a language context.get_language_streak() # (lang, streak_count) context.clear() # Reset context
FastLangDetector
detector = FastLangDetector( config=DetectionConfig(...), hints=HintDictionary(), ) # Methods detector.detect(text, context=None, mode="default") detector.detect_batch(texts, mode="default") detector.add_hint(word, lang) detector.remove_hint(word) detector.set_languages(["en", "fr", "es"]) # Restrict output detector.available_backends()
Configuration
DetectionConfig
from fastlangml import DetectionConfig config = DetectionConfig( # Backend configuration backends=["fasttext", "lingua"], backend_weights={"fasttext": 0.6, "lingua": 0.4}, # Voting voting_strategy="weighted", # or "hard", "soft", "consensus" custom_voting=None, # VotingStrategy instance # Thresholds thresholds={ "short": 0.5, # Confidence threshold for short mode "default": 0.7, "long": 0.8, }, min_text_length=1, # Features filter_proper_nouns=False, # Enable heuristic-based proper noun filtering use_script_filter=True, # spaCy NER for proper noun filtering (requires pip install fastlangml[spacy]) # filter = ProperNounFilter(use_spacy=True) # Filters: PERSON, ORG, GPE, LOC, FAC, NORP entities # Weights hint_weight=1.5, context_weight=0.3, )
Extensibility
Custom Backends
Create your own detection backend using the @backend decorator:
from fastlangml import backend, Backend from fastlangml.backends.base import DetectionResult @backend("my_detector", reliability=4) # reliability: 1-5 class MyBackend(Backend): """Custom language detection backend.""" @property def name(self) -> str: return "my_detector" @property def is_available(self) -> bool: # Check if required dependencies are installed return True def detect(self, text: str) -> DetectionResult: # Your detection logic here lang = "en" # Replace with actual detection confidence = 0.95 return DetectionResult(self.name, lang, confidence) def supported_languages(self) -> set[str]: return {"en", "fr", "de", "es"}
Using the custom backend:
from fastlangml import FastLangDetector, DetectionConfig detector = FastLangDetector( config=DetectionConfig( backends=["my_detector", "fasttext"], ) )
Programmatic registration:
from fastlangml import register_backend, unregister_backend, list_registered_backends # Register register_backend("my_backend", MyBackend, reliability=4) # List all registered print(list_registered_backends()) # ["my_backend"] # Unregister unregister_backend("my_backend")
Custom Voting Strategies
Implement your own voting logic by extending VotingStrategy:
from fastlangml import VotingStrategy, FastLangDetector, DetectionConfig class ConfidenceOnlyVoting(VotingStrategy): """Pick the language with highest individual confidence.""" def vote( self, results: list, weights: dict[str, float] | None = None, ) -> dict[str, float]: if not results: return {} # Find max confidence per language scores = {} for r in results: lang = r.language if lang not in scores or r.confidence > scores[lang]: scores[lang] = r.confidence return scores # Use custom voting detector = FastLangDetector( config=DetectionConfig( custom_voting=ConfidenceOnlyVoting() ) )
Language Confusion Resolution
FastLangML handles commonly confused language pairs with specialized logic:
Supported confused pairs:
- Spanish / Portuguese
- Norwegian / Danish / Swedish
- Czech / Slovak
- Croatian / Serbian / Bosnian
- Indonesian / Malay
- Russian / Ukrainian / Belarusian
- Hindi / Urdu
from fastlangml import ConfusionResolver, LanguageSimilarity # Resolve ambiguity between similar languages resolver = ConfusionResolver() # Check if languages are a known confused pair pair = resolver.get_confused_pair({"es", "pt"}) # frozenset({"es", "pt"}) # Adjust scores based on discriminating features scores = {"es": 0.45, "pt": 0.42} adjusted = resolver.resolve("Eu tenho um problema", scores) # Portuguese boosted due to "tenho" (have) # Get discriminating features es_features, pt_features = resolver.get_discriminating_features("es", "pt") # es_features: ["pero", "cuando", "donde", ...] # pt_features: ["mas", "quando", "onde", ...] # Check language relationships sim = LanguageSimilarity() sim.are_related("es", "pt") # True (Romance family) sim.are_related("en", "zh") # False (different families) sim.get_related_languages("es") # {"pt", "fr", "it", "ro", "ca", "gl"}
Code-Switching Detection
Detect mixed-language messages (Spanglish, Franglais, Hinglish, etc.):
from fastlangml import CodeSwitchDetector detector = CodeSwitchDetector() # Detect code-switching result = detector.detect("That's muy importante for the proyecto") result.is_mixed # True result.primary_language # "en" result.secondary_languages # ["es"] result.languages # ["en", "es"] result.language_distribution # {"en": 0.6, "es": 0.4} # Get language spans for span in result.spans: print(f"{span.text}: {span.language} ({span.confidence:.2f})") # "That's": en (0.85) # "muy": es (0.92) # "importante": es (0.95) # "for": en (0.88) # "the": en (0.90) # "proyecto": es (0.94) # Quick check detector.is_code_switched("Hello world") # False detector.is_code_switched("Hola, how are you?") # True # Pattern-based detection from fastlangml import detect_code_switching_pattern pattern = detect_code_switching_pattern("That's muy bueno") # ("en", "es") - matches Spanglish pattern
Supported code-switching patterns:
- Spanglish (English + Spanish)
- Franglais (English + French)
- Hinglish (English + Hindi)
- Denglish (German + English)
Benchmarks
Accuracy
Tested across standard language samples, short text, and CJK:
| Test Category | Accuracy | Notes |
|---|---|---|
| Standard Languages (25 samples) | 96.0% | Full sentences across 25 languages |
| Short Text (16 samples) | 100.0% | Single words like "Hello", "Bonjour", "안녕" |
| CJK (6 samples) | 100.0% | Japanese, Chinese, Korean sentences |
Backend Comparison
FastLangML ensemble outperforms individual backends:
| Backend | Accuracy | Avg Latency |
|---|---|---|
| ensemble | 100.0% | 8.28ms |
| fasttext | 100.0% | 0.01ms |
| lingua | 90.0% | 0.18ms |
| langdetect | 80.0% | 2.84ms |
| langid | 70.0% | 0.56ms |
Performance
| Operation | Latency | Notes |
|---|---|---|
| Initialization | ~100ms | One-time, first detection |
| First detection (cold) | ~3000ms | Loads all backend models |
| Warm detection | 8.28ms | Average per detection |
| Cache hit | 0.0006ms | ~10,000x faster |
| Script short-circuit | 0.0006ms | Korean, Thai, Hebrew, etc. |
| Batch (100 texts) | 550ms | ~182 texts/sec |
Optimizations
FastLangML includes several performance optimizations:
| Optimization | Speedup | Description |
|---|---|---|
| Script short-circuit | ~5,500x | Skip backends for unambiguous scripts (Hangul→ko, Thai→th) |
| Result caching | ~10,000x | LRU cache for repeated texts |
| Lazy backend loading | - | Only load backends when needed |
| Adaptive parallelism | - | Skip threading for ≤2 backends or short text |
Script short-circuit languages:
- Korean (Hangul), Thai, Hebrew, Armenian, Georgian
- Tamil, Telugu, Kannada, Malayalam, Gujarati, Bengali, Punjabi, Oriya, Sinhala
- Khmer, Lao, Myanmar, Tibetan
Running Benchmarks
# Run accuracy benchmarks pytest tests/test_benchmarks.py -v # Quick performance check python -c " from fastlangml import detect import time start = time.perf_counter() for i in range(100): detect(f'Hello world {i}') print(f'100 detections: {(time.perf_counter()-start)*1000:.0f}ms') "
Best Practices
1. Use Context for Chat/Messaging
Always pass a ConversationContext when detecting messages in a conversation:
# Good: Context-aware context = ConversationContext() for msg in conversation: result = detect(msg, context=context) # Bad: Stateless detection for msg in conversation: result = detect(msg) # Loses valuable context
2. Choose the Right Mode
- short: For SMS, chat messages (< 50 chars). Lower confidence threshold.
- default: General purpose. Balanced threshold.
- long: For paragraphs/documents. Higher confidence threshold.
detect("ok", mode="short") # More lenient detect("Hello world", mode="default") detect(long_paragraph, mode="long") # More strict
3. Add Domain-Specific Hints
If your users use specific slang or terms, add hints:
detector.add_hint("gg", "en") # Gaming detector.add_hint("lol", "en") detector.add_hint("mdr", "fr") # French LOL detector.add_hint("kek", "en") # Gaming laugh
4. Restrict Languages When Known
If you know the possible languages (e.g., a bilingual support queue), restrict the output:
# Only consider English and Spanish detector.set_languages(["en", "es"])
5. Handle "und" (Unknown)
When FastLangML is uncertain, it returns und instead of guessing wrong:
result = detect("ok") if result == "und": # Can compare directly # Fallback to default or ask user lang = result.candidates[0].lang if result.candidates else "en" print(f"Low confidence: {result.reason}")
6. Use Ensemble for Production
Single backends have blind spots. Ensemble improves reliability:
# Development: Fast single backend detector = FastLangDetector( config=DetectionConfig(backends=["fasttext"]) ) # Production: Reliable ensemble detector = FastLangDetector( config=DetectionConfig( backends=["fasttext", "lingua"], voting_strategy="weighted", ) )
7. Batch Detection for Throughput
When processing many messages, use batch detection:
# Good: Batch processing (parallelized) results = detector.detect_batch(messages, mode="short") # Bad: Sequential detection results = [detector.detect(msg) for msg in messages] # Slower
Contributing
We welcome contributions! See CONTRIBUTING.md for detailed guidelines.
Quick Start
# Clone and setup git clone https://github.com/pnrajan/fastlangml.git cd fastlangml poetry install --all-extras # Development commands make test # Run tests make lint # Check code style make format # Format code with ruff make fix # Auto-fix linting issues make typecheck # Run type checker make check # Run all checks (format + lint + typecheck + test)
What We're Looking For
- Bug fixes with test cases
- New detection backends
- Performance improvements
- Documentation improvements
- New voting strategies
See CONTRIBUTING.md for:
- Project architecture overview
- Testing guidelines
- Code style requirements
- Commit message conventions
- Pull request process
License
MIT License - see LICENSE for details.