Summary
I propose extending the LLMs.txt standard to include inline HTML data attributes that provide AI-friendly structured data directly within web page elements. This would complement the existing /llms.txt file approach by solving context preservation issues, particularly for complex content like comparison tables, pricing information, and structured data.
Problem Statement
Current LLMs.txt helps AI systems locate important content, but doesn't address the fundamental challenge of semantic disambiguation within that content. Specifically:
Table and Structured Content Issues
- Lost context: When RAG systems scrape comparison tables, they often confuse "our pricing" with "competitor pricing"
- Relationship fragmentation: Table headers become disconnected from their data during embedding
- Ambiguous ownership: Content like "$50/month" loses meaning without knowing which company/product it refers to
Real-World Example
On pages like comparison tables (e.g., "Formester vs Fillout"), current AI systems might incorrectly extract:
- ❌ "Formester costs $20/month" (actually Fillout's price)
- ❌ "Our basic plan includes 20MB uploads" (actually competitor's feature)
Proposed Solution: data-llm Attributes
Add standardized data-llm attributes to HTML elements containing structured JSON that provides AI-friendly context and semantics.
Basic Syntax
<element data-llm='{"type": "content_type", "context": {...}, "data": {...}}'> <!-- Regular HTML content for humans --> </element>
Example Implementations
Pricing Comparison Tables
<table data-llm='{ "type": "pricing_comparison", "context": { "our_company": "Formester", "comparison_target": "Fillout", "page_purpose": "competitive_analysis" }, "data": [ { "feature": "Personal Plan Pricing", "formester": "$12/month for 1000 submissions", "fillout": "$20/month for 2000 submissions" }, { "feature": "File Upload Limit", "formester": "100 MB (Free), 1 GB (Personal)", "fillout": "20 MB (Free, Starter, Pro)" } ] }'> <!-- Regular HTML table markup --> </table>
Product Information
<div class="product-card" data-llm='{ "type": "our_product", "product_name": "Business Plan", "price": "$45/month", "features": ["15k submissions", "team collaboration", "advanced analytics"], "company": "formester" }'> <!-- Product card HTML --> </div>
Contact Information
<section data-llm='{ "type": "company_contact", "support_email": "help@formester.com", "response_time": "24 hours", "availability": "24/7" }'> <!-- Contact section HTML --> </section>
Benefits
1. Solves Context Preservation
- AI systems can definitively distinguish "our" vs "competitor" information
- Table relationships are explicitly maintained in structured form
- No more pricing confusion in RAG responses
2. Backward Compatible
- Doesn't interfere with existing HTML, CSS, or JavaScript
- Works alongside current LLMs.txt files
- Search engines ignore unknown data attributes
3. Developer Friendly
- Easy to implement during development
- Single source of truth - update once, both human and AI versions stay current
- No separate file management required
4. Scalable
- Works for any type of content, not just tables
- Extensible schema system for different content types
- Can be validated against JSON schemas
Integration with LLMs.txt
This proposal complements rather than replaces LLMs.txt:
- LLMs.txt - Guides AI to important pages and sections
data-llmattributes - Provides semantic understanding of content within those pages
Updated LLMs.txt Example
# Formester > AI-powered form builder with advanced features ## Pricing Information - [Pricing comparison](https://formester.com/pricing): Compare our plans with competitors - Note: Contains `data-llm` attributes for accurate pricing extraction - [Feature matrix](https://formester.com/features): Detailed feature breakdown - Note: Uses structured attributes for feature categorization
Implementation Strategy
Phase 1: Schema Definition
- Define common content types (
pricing_comparison,our_product,company_contact, etc.) - Create JSON schema specifications for validation
- Document best practices and examples
Phase 2: Tooling
- Build parsers for common RAG frameworks
- Create validation tools for developers
- Develop browser extensions for testing
Phase 3: Community Adoption
- Share with RAG system builders
- Integrate