Building AI for Lumber’s Messiest Paperwork

Written by: John Shriver

Press enter or click to view image in full size

Most writing about AI automation starts with the technology: the model, the framework, the architecture diagram. This post starts somewhere different. Before we wrote a line of code for our paperwork digitization workflow (lovingly named the Fiber Ingest Pipeline), our engineering team spent time understanding the human workflow we intended to automate, mapping the decisions embedded in each step, and building a shared understanding of what AI could realistically take on and where it would fall short.

This is the story of how we built an AI system to extract and structure data from one of the lumber industry’s most stubborn paper artifacts: the scale slip. Along the way, we learned as much about process design and building user trust as we did about LLMs and data pipelines.

Press enter or click to view image in full size

Example Scale Slip

What Makes Scale Slips Hard

At Cambium, we’re working to digitize an industry that still runs on paper, phone calls, and informal relationships. A core part of that effort is ingesting transaction data from across the lumber supply chain onto the Cambium platform, and scale slips are where much of that data lives.

A scale slip is the receipt generated every time a sawmill buys or sells logs. It records each log’s species, grade, diameter, length, board footage, and price, providing the detailed accounting that determines how much a supplier gets paid. A great example is the slip pictured above. As you can see in the image, there is a printout of the actual logs sold to the mill, as well as a handwritten “minus $40 for log trim” in pen, an adjustment that changes the final payout and needs to be captured just as accurately as the printed data above it.

The challenge isn’t that any individual slip is hard to read. An analyst familiar with a given mill’s format can work through one in a few minutes, but what counts as “familiar” changes from mill to mill. Some smaller mills produce a straightforward list of logs, while others visually complex tally matrices where log length and diameter intersect in a grid of counts. Regardless of the layout, most mills also include handwritten annotations, corrections, and adjustments that sit outside any consistent format.

Traditionally, processing a scale slip required two people. One analyst would read the slip and manually transcribe every line into our internal accounting tools, mapping species names, grades, and values into a consistent format, while a second analyst independently reviewed the work to catch mistakes. Only after both agreed could the data be used to generate a sale summary and pay the supplier. On average, this process takes about 25 minutes per slip for the two people involved. At the volumes we were handling, that was manageable, but it wouldn’t hold as Cambium scales.

Start Analog

At the start of this project, the first thing we did was embed an engineer in the operations team to learn how they were currently processing scale slips. Participating in the existing, manual workflow, we gained a much deeper understanding of the decisions, judgments, and implicit knowledge required to process these slips on a regular basis.

Watching our ops team work through slips, we saw things that wouldn’t have been apparent from a process diagram: the nuances of log grades and species name changes from mill to mill, how shipping costs factor into the price paid for a log, and the subtleties of how each mill grades logs based on the market it serves. Each of these patterns shapes how an analyst reads and interprets the data, and any system meant to replace that work needs to account for them.

Observing the work gave us a map of the full process, including the judgement calls embedded in each of the individual steps. From there, the team could have a more honest conversation about what AI could realistically take on and where human judgement would still be required. Steps that involved reading structured data and converting it to a standard format were strong candidates for automation, while steps that required external context, relationship judgment, or carried financial consequences (like authorizing a payment) needed to remain human. The interesting territory was in between: quality checks that required reasoning about whether extracted data was internally consistent. That was where we thought an AI agent could do something genuinely useful, moving beyond faster data entry toward a kind of automated review that could catch the same inconsistencies a second analyst would.

Press enter or click to view image in full size

Human and AI Driven Processes

Using what we learned from the hands on work, we built a system that processes a scale slip in three steps: extract text from the PDF, map that text into a standardized data schema, and run automated QA to validate the result. The pipeline takes unstructured data like a PDF and creates machine-readable records ready for ingest to the Cambium platform. For any step where human review is needed, the data is presented in a format that is easy for an analyst to review and edit before it moves downstream.

Extraction: Preserving the Mess

The first step is turning a PDF into machine-readable text, and the goal isn’t just pulling words off a page. The table structure, the column alignments, the relationship between headers and values: all of that layout information matters for everything downstream. We use AWS Textract for this step, converting each page into a structured markdown representation that preserves the document’s original layout. We didn’t start here, though. Our first iteration used the LLM itself for extraction, and the problems that created taught us something important about where deterministic tools belong in an AI pipeline. More on that later in this post.

Mapping: From Mill-Specific to Standard Schema

This is where the real complexity lives. The mapping step takes the extracted text and converts it into a standardized data record that our platform can use, regardless of which mill the slip came from. An AI agent handles this work, and it needs to accomplish several things at once: identify which type of mill format it’s looking at (whether that’s a simple log list, an automated receipt from a large industrial operation, or an international format in French), parse the data according to that format’s structure, and map each field into a common schema covering species, grade, diameter, length, board footage, deductions, and price.

Press enter or click to view image in full size

Scale Slip and Extracted data

"logs": [
    {
      "log_id": "729499",
      "species": "Poplar",
      "grade": "1SC",
      "diameter": 13.0,
      "length": 8.0,
      "gross_bf": 40.0,
      "net_bf": 40.0,
      "defect_bf": null,
      "unit_price": 170.0,
      "total_price": 6.8,
      "count": 1,
    },
    ...

Take a look at the example scale slip above. The first entry has a log with ID 729499, followed by details about the log, and what the mill paid. The agent reads the row, identifies the species, grade, and diameter, calculates the board footage, and maps it all into a consistent JSON record. It processes each row on the slip this way. After extracting all the log-specific information, it reaches a handwritten note at the bottom that indicates one log needed to be trimmed before the mill would accept it. The parser captures this as a $40 log trim deduction, adjusting the net value from $480.24 down to $440.24, matching what the inspector wrote by hand.

Getting this right across dozens of mill formats took significant iteration. The agent uses a set of specialized tools depending on the format it detects, with each tool handling the deterministic parts of the job (parsing a known structure, calculating board footage) while the LLM handles the reasoning (identifying the format, resolving ambiguities, deciding which tool to apply and when).

QA: The System Checks Its Own Work

Overall, the data extraction works well with roughly 95% of the data coming through correctly. At scale though, the remaining 5% of incorrect data leads to cascading data quality issues that a human would need to identify and correct, diminishing the value of the automated process. Fortunately, we can equip the pipeline to identify most of these mistakes and fix them in real time.

To catch these errors, a second AI agent reviews the mapped data before any human sees it. The challenge is that we don’t have a ground truth dataset to validate against; each scale slip we process is new to Cambium and the FIP. Instead, the QA agent evaluates internal consistency of the extracted data, checking calculated summaries of the extracted log level data against totals and summary tables included on the slip. If the numbers don’t match, we know something went wrong.

{
  "disposition": "pass",
  "confidence": 1.0,
  "flags": [],
  "comparison_results": [
    {
      "field": "total_logs",
      "extracted": 27,
      "calculated": 27,
      "difference": 0.0,
      "difference_pct": 0.0,
      "status": "match"
    },
    ...
  ],
  "reasoning": "The extracted scale slip data has been thoroughly validated and all summary values match the calculated values from individual log entries with 100% accuracy. Specifically:1. **Total Logs**: Extracted (27) matches calculated (27) - 0% difference ..."
  "recommendations": [
    "No corrections needed - extraction is accurate and complete",
    "Data quality is excellent with summary totals matching calculated values",
    "The payment adjustment section is properly structured and accounts for the deduction"
  ],
  "calculated_summary": {
    "total_logs": 27,
    "total_gross_bf": 1870.0,
    "total_net_bf": 1693.0,
    "total_value": 480.24,
    "gross_value": 480.24,
    "net_value": 440.24
  },
  "extracted_summary": {
    "total_logs": 27,
    "total_gross_bf": 1870.0,
    "total_net_bf": 1693.0,
    "total_cull_bf": null,
    "total_value": 440.24,
    "has_species_grade_breakdown": true,
  }
}

In practice, this catches real issues. On one slip, the QA agent flagged that the sum of extracted gross board footage did not match the summary printed on the scale slip, causing the slip to fail QA. That discrepancy triggered a rerun of the mapping pipeline, where the mapping agent used the QA feedback to make an additional tool call, allowing the missing logs to properly accounted for on the second pass. The agent’s report detailed what it found, what matched, and what it had concerns about based on the totality of the data. In the rare case a human does need to evaluate the results, the detailed QA output gives the human reviewer specific things to look at rather than a binary pass or fail.

Human in the Loop: The Review Step That Matters

Even with these guardrails in place, the pipeline doesn’t push data straight through to our transaction platform. Every processed slip goes through a QC review where a single analyst can see the extracted log table, the aggregated summaries, and any flags the QA agent raised. The data is editable, and the analyst reviews, corrects if needed, and approves the generated sale summary.

This is a deliberate design choice. The original process required two people working independently, while the new process requires one person reviewing data that has already been extracted, mapped, and checked. That reviewer’s job is fundamentally different now: instead of transcribing from scratch and hoping they don’t make an error, they’re evaluating a pre-built result with the AI’s own assessment alongside it.

The transparency of the QA output matters as much as its accuracy. The agent doesn’t just say “pass” or “fail”; it explains its reasoning, detailing what it compared, what matched, and where it has concerns. Over time, that visibility helps the team build confidence in the system. They can see when it’s right, they can see when it catches something, and when it misses something, they understand why and can flag it for the team to improve.

The Numbers

Since deploying the Fiber Ingest Pipeline, processing time has dropped from roughly 25 minutes per slip for two people to about 8 minutes for one person, an 85% reduction. The time that remains is almost entirely the human-in-the-loop review step, which is exactly the part we want a person doing. The pipeline has already ingested over 6 million board feet of log data from mills around the country, and we’re targeting 24 million board feet in the year ahead as we expand the number of mills and the volume of slips we process.

Those numbers matter, but the more important shift is a structural one. The ops team’s time has moved from low-value transcription work to higher-value review and decision-making. That’s the kind of change that compounds as the business grows.

Don’t Ask the LLM to Do Everything

The current architecture wasn’t the first one we tried. The initial version of the pipeline used the LLM for both OCR and data mapping, essentially asking the model to look at the raw document and produce a structured record in a single pass. It mostly worked, but the OCR errors were small and non-deterministic: a digit misread here, a column slightly misaligned there. Each error was minor on its own, but they cascaded. A misread board footage value would ripple through the QA step, triggering flags that obscured real issues, and the team spent increasing amounts of time building workarounds for edge cases that were really symptoms of unreliable extraction.

The fix was to separate the jobs. Moving the OCR work to Textract stabilized this step, and freed us to invest development effort where the LLM’s reasoning ability actually matters: mapping data across inconsistent formats, running QA logic that requires judgment, and building tools to address the gaps that remained (like using summary tables to flag incomplete log-level data). The lesson was a practical one. LLMs are powerful, but they’re not the right tool for every step in a pipeline, and using a purpose-built service for OCR while reserving the LLM for reasoning tasks made the entire system more reliable.

Earning Trust Is Part of the Deliverable

The second lesson was harder to anticipate, and it had less to do with the technology than with the people relying on it. When we took on the scale slip workflow, the operations team became our customers. They were the ones who had owned this process manually, and they were the ones who would need to trust the pipeline enough to change how they worked.

When people own a process manually, they handle an enormous amount of ambiguity without really thinking about it. A slip with unusual formatting, a grade code they haven’t seen before, a column that doesn’t quite line up: an experienced analyst works through it and moves on. Once you start to automate that process, the expectation shifts. Users don’t evaluate the system on its average performance; they judge it by its worst case. We experienced this firsthand when one mill’s slips had non-standard logic that was difficult for the LLM to parse consistently but straightforward for a human to manage. Errors on that single format reduced confidence in the entire pipeline, even though other mill formats were processing cleanly.

From a data science perspective, our accuracy metrics were strong. The system was performing well by any standard technical benchmark. But for the ops team, one unreliable format was enough to make them question whether they could depend on it. Standard metrics told us the pipeline was working. The ops team’s experience told us it wasn’t working well enough. Both were true, and the second one mattered more.

By replacing a manual workflow, we had taken ownership of the outcome. The ops team couldn’t easily revert while we iterated, and they shouldn’t have to — our job extended beyond building a pipeline that performed well on aggregate. We needed to set clear expectations about what the system could and couldn’t handle, get genuine buy-in on those boundaries, and align on what “done” looked like from the ops team’s perspective, not just ours. The threshold for trust in an automated system is different from the threshold for trust in a person. People get the benefit of the doubt when they make mistakes; software doesn’t. When you take on automation work for an internal team, you’re not just building a tool. You’re asking people to change how they work, and you’re committing to make that change successful on their terms.

What Comes Next

The Fiber Ingest Pipeline started with scale slips because they were the most immediate bottleneck, but the broader opportunity is much larger. Every step in the lumber supply chain, from harvest to installation of the final product, generates transaction data that lives in some combination of paper, PDFs, emails, and spreadsheets. The same approach we used here (understand the process, identify where AI reasoning adds value, build tools for deterministic tasks, keep humans in the loop where they matter) applies to each of those steps.

In the next post in this series, we’ll walk through the technical architecture of the pipeline in more detail: how the agents reason through a scale slip using tool-based decision making, the eval-driven development process we used to test and iterate, and the patterns we’ve found useful for building reliable AI systems on messy real-world data.