Analyzing CDN Logs with Natural Language: How We Taught Claude to Write SQL

Waking up at 3 AM because of an outage is already tough. There’s a lot of pressure to find and fix issues quickly, but what if you also have to deal with complex log data that has hundreds of fields? In these scenarios, writing efficient diagnostic queries is a lot to ask, even for experienced operations engineers. You’re sleep-deprived, working as fast as possible, and trying to make sense of complex schema.

This is the problem we set out to solve when we created a Claude skill to help SREs and other practitioners better understand their CDN logs. The goal: let users ask questions of their full-fidelity CDN data in plain English and get correct SQL, accurate results, and helpful visualizations without needing to know schema details, query optimization patterns, or domain-specific knowledge about Hydrolix.

Here’s how we built it, why it works, and what we learned along the way.

Before we get started, here’s an example of a real-world use case that this Claude skill can be used for. Here’s the prompt:

Review CDN performance for the past 12 hours based on the akamai.logs table. Create visualizations to highlight any issues I need to address.

And here’s the response. The following visualization shows hourly bandwidth and key findings.

Key findings shows cache hit ratio information and services with errors over the last 12 hours.

At a glance, a practitioner can scan these key findings to understand critical information about cache hit ratio, which resources can be cached and optimized, and which services are having problems. It also provides a helpful dashboard with metrics, too, as the next image shows. The metrics include total requests, cache hit ratio, error rate, and P99 latency—all metrics that practitioners need for understanding how their CDN infrastructure is performing.

Dashboard is titled "CDN performance - 12h review." It includes metrics like total requests, cache hit ratio, error rate, status 112, total bandwidth, P99 latency, and a graph with hourly traffic, cache hit ratio, and error rate over time.

The Real Challenge Isn’t Writing SQL—It’s Writing the Right SQL

TrafficPeak customers have petabytes of CDN logs stored in Hydrolix. A single table can contain billions of rows with hundreds of fields covering everything from cache behavior to compression ratios to streaming quality metrics. Maximizing insights from this data traditionally requires specialized knowledge, including:

Schema expertise: Understanding which of 100+ fields answers your question. For example, do you need cacheStatus, cacheRefreshSrc, or cacheTTL?
Summary table mechanics: Knowing that cnt_all is an ALIAS column that expands to countMerge(count()) and should never be wrapped in sum() or count()
Performance optimization: Writing queries that include time filters and appropriate LIMIT clauses so they complete in seconds rather than timing out.
Table selection logic: Deciding when to use detailed raw logs versus a summary table.

Skills solve this problem by providing specific instructions for AI to answer your questions.

What Is a Skill? Encoding Expert Knowledge for AI

A skill is a knowledge package that guides an AI through a specific domain. Think of it as encoding the expertise of your best engineer into a format that an AI can follow consistently, every time.

The CDN Analytics Skill has three core components:

1. Orchestration logic: A 10-step workflow that guides Claude from understanding a user’s question through query generation, execution, visualization, and error recovery.

2. Comprehensive schema documentation: Detailed information on tables and fields.

3. Query template library: 30+ proven query patterns across 10 analysis categories, showing both correct usage and common mistakes to avoid.

The skill integrates with Hydrolix through MCP (Model Context Protocol), which connects Claude directly to the query engine. Before generating any SQL, Claude reads the relevant reference documentation. This isn’t optional. The workflow requires it. This simple requirement of reading before writing the query prevents most of the common mistakes we saw in early testing.

The 10-Step Workflow: How Questions Become Answers

Every user question goes through the same systematic process. Let’s walk through it.

Step 1-2: Understanding and Clarifying

When you ask “what’s my cache hit ratio?”, the skill first categorizes your question. Is this cache performance analysis? Error analysis? Cost attribution? There are 10 major categories, and classification determines which schema documentation and query templates Claude will reference.

Next comes clarification. The skill asks targeted questions to fill in gaps such as:

Time range: Should the range be “last 24 hours” (default), “last week”, or “January 15-20”?
Scope: Do you need specific hostnames, geographic regions, or all traffic?
Grouping: Should the data be grouped by time buckets, by country, by URL, or should everything be aggregated?

This upfront clarification prevents the back-and-forth of “I need more information” that makes AI interactions frustrating. Claude asks everything it needs in one go, then proceeds.

Step 3: Reading the Schema (The Critical Step)

Before generating SQL, Claude loads detailed documentation for the relevant tables. This documentation includes far more than column names and data types. It explains when to use each field, what values mean, and how fields relate to each other.

Consider a 1-minute summary table for CDN logs. It may contain billions rows of 1-minute pre-aggregated rollups. This can be the workhorse table for most analytics queries. But using it correctly requires understanding how summary tables work in Hydrolix.

Summary tables store intermediate aggregation states as AggregateFunction types. This enables “aggregates of aggregates.” You can roll up 1-minute data to hourly or daily data without losing accuracy. The table has two types of columns:

Raw AggregateFunction columns like count(), sum(totalBytes), uniqMerge(cliIP). These store binary states and you cannot SELECT them directly.
ALIAS columns like cnt_all, sum_totalBytes, pct_offLoadHit. These apply -Merge combinators to produce final values.

Here’s the critical insight that took us several iterations to encode properly: ALIAS columns are already aggregate functions. When you SELECT cnt_all, it expands to countMerge(count()). This automatically aggregates within each GROUP BY group.

The common mistake, one we saw repeatedly in early testing, is wrapping ALIAS columns in sum() or count():

-- WRONG: Nests aggregate functions, produces errors or wrong results
SELECT
  toStartOfHour(datetime) as hour,
  sum(cnt_all) as total_requests
FROM akamai.summary
GROUP BY hour
-- RIGHT: ALIAS column already performs aggregation
SELECT 
  toStartOfHour(datetime) as hour,
  cnt_all as total_requests
FROM akamai.summary
GROUP BY hour

The skill documentation includes sections explicitly titled “WRONG” and “RIGHT” with these examples. When Claude reads this documentation before generating SQL, it avoids the mistake.

Step 4: Table Selection Logic

With multiple tables available, choosing the right one matters. The skill encodes a decision tree to choose between raw and multiple summary tables at different time granularities.

The performance difference is significant. A 7-day cache hit ratio analysis that requires scanning 160 billion rows of raw log data will be terribly inefficient. The same query against the right summary scans roughly 10 million rows (7 days × 1,440 minutes per day × ~1,000 unique dimension combinations) and completes in under 2 seconds, all while providing the same information as the inefficient query.

Step 5: Circuit Breakers as Performance Guards

Every query must include at least one performance guard:

Time filter on the primary key (typically the datetime field)
LIMIT clause
Both (recommended for exploratory queries)

The skill enforces hard limits:

Maximum time range: 30 days
Query timeout: 120 seconds (enforced by Hydrolix)

When a user asks for something that exceeds these limits, such as, “show me cache hit ratio for all of 2024,” Claude explains the constraints and offers alternatives such as multiple queries for different months, coarser time granularity (daily instead of minute-by-minute), or focusing on specific time windows of interest.

This prevents wasting time and resources on inefficient queries.

Step 6: Query Generation From Templates

The skill includes 30+ query templates across 10 categories that encode best practices.

Let’s look at an example: cache performance analysis. The template library includes:

Cache hit ratio over time with configurable time buckets
Byte hit ratio by hostname
Cache status distribution by geographic region
Comparison of edge hits versus origin hits

Each template shows the correct pattern and explains when to use it. When a user asks “what’s my cache hit ratio by country?”, Claude doesn’t generate SQL from scratch. It adapts the “cache hit ratio by geographic dimension” template, substituting the appropriate grouping field and time range.

This approach dramatically reduces errors. Instead of solving the general problem of generating arbitrary SQL, we’re solving the specific problem of adapting a proven template for the user’s requirements.

Step 7-8: Execution and Visualization

After generating a query and running it through the MCP integration with Hydrolix, results return with full metadata including row counts, execution time, and data types.

But raw query results still need to be translated into insights. The skill decides whether visualization would help. For time-series data, Claude creates interactive line or area charts. For category comparisons, bar charts. For proportions, pie charts.

These aren’t static images. They’re React components using Recharts, which are fully interactive with hover tooltips and formatted axis labels. When showing bytes transferred, values appear as “2.3 TB” not “2534231134208”. When showing cache hit ratios, they’re percentages with one decimal place, not raw floats with 15 digits.

The skill encodes visualization best practices: maximum 10-15 data points for readability, readable color palettes, clear titles, and formatted values. The user gets a professional-quality chart without specifying any of these details.

Step 9: Presenting Results with Context

Claude doesn’t just display results. It summarizes key findings in plain English: “Your cache hit ratio averaged 87.3% over the last 24 hours, with a dip to 82.1% between 2:00-4:00 AM on January 15.”

It also adds context: “This dip coincided with increased traffic to newly published content, which typically has lower cache hit ratios for the first few hours.”

And it suggests follow-ups: “Want to see cache hit ratio by specific hostname? Or examine which URLs had the most cache misses during that time window?”

This transforms raw data into actionable insights.

Step 10: Error Handling and Recovery

When queries fail, the skill includes a recovery workflow:

1. Check the error message against known patterns.

2. Apply the appropriate fix:

Wrong table selection → switch to the correct table.
Summary table ALIAS mistakes → fix the column usage.
Timeout → add stricter time filters or reduce date range.
Missing fields → substitute available alternatives.

3. Explain what went wrong and what was changed.

4. Retry automatically.

If the retry fails, Claude explains the limitation and suggests alternatives.

Why This Approach Works

You might be thinking: couldn’t any AI with access to the schema generate these queries? We thought so too, initially. Then we tested it.

Without the skill, Claude (and other LLMs) make predictable mistakes:

Wrapping ALIAS columns in aggregate functions
Querying wrong tables for multi-day aggregations
Omitting time filters, leading to timeouts
Using fields that don’t exist or misunderstanding what they represent
Generating syntactically correct but semantically wrong SQL

The skill solves this by making Claude read authoritative documentation before generating queries. Common pitfalls are explicitly documented as “WRONG” patterns with explanations. Circuit breakers prevent costly mistakes upfront. The example library provides proven starting points. The error handling workflow guides automatic recovery.

The end result: a user asks a plain English question and gets correct table selection, proper ALIAS column usage, appropriate time filters, formatted results with context, optional visualization, and follow-up suggestions, all without needing to know SQL, schema details, or Hydrolix internals.

You can ask questions like:

“What’s my cache hit ratio for the last week?”
“Show me error rates by status code for hostname X.”
“Which URLs consumed the most bandwidth yesterday?”
“What’s my edge versus origin traffic breakdown by region?”

You’ll get correct SQL, accurate results, and helpful visualizations, all without writing a single line of code. And if you’re curious what SQL was generated, just ask. Claude will show you the query and explain how it works.

That’s the power of encoding expertise. The knowledge is there when you need it, invisible when you don’t, and reliable every single time.

Next Steps

With CDN Insights, you can start monitoring your CDN infrastructure in minutes. Compare how multiple CDNs are performing at any scale, segment data by region, ASN, and more, then drill down into individual logs for further troubleshooting. You can also keep full-fidelity data long-term (15 months by default) without needing to sample or throw data away.
If you’re using Akamai, consider TrafficPeak for Akamai, a managed observability solution for Akamai Connected Cloud powered by Hydrolix that can give you deep insights on your CDN infrastructure and more.