Why I Treat Data Ingestion as an Adversarial Process

It took me longer than I’d like to admit to accept this.

Like most engineers, I used to think of data ingestion as a technical chore: fetch records, normalize fields, store them, move on. Errors were edge cases. Retries were a safety net. Missing data was something we’d “clean up later.”

That mindset works until your audience scales up, you’re working with more data and those retries happen more frequently and the missing data becomes more of a problem.

While building AIGrantMatch, I realized something uncomfortable: the moment you treat ingestion as a cooperative process, you’ve already ceded control over the truth of your system.

Grant data doesn’t arrive clean, consistent, or complete.
Statuses drift. APIs change semantics quietly. Records disappear and reappear. Fields that were “required” last week suddenly go null.

None of this is malicious; but none of it is reliable either.

If your system assumes:

successful fetches are permanent,
failures are rare,
retries eventually converge on correctness,

You’re not building a pipeline. You’re building hopes and dreams riding on magic rainbow unicorns.

The most dangerous failures aren’t crashes.
They’re quiet partial successes.

A record that fetched once but never refreshed.
A status that failed to update but didn’t error loudly enough to alert anyone.
A background job that “completed” but didn’t actually finish its work.

In GrantMatch, I stopped thinking in terms of “success vs failure” and started tracking freshness, intent, and ownership instead:

When was this data last attempted?
Is something actively working on it right now?
If it failed, is that a terminal state or just the last known outcome?

Once you ask those questions, a simple boolean like isFetched stops being useful.

Here’s the part most teams miss:

Every retry policy encodes a value judgment.

How long do you keep trying?
Who gets blocked while you do?
When do you surface uncertainty to the user?

Those aren’t engineering details — they shape trust.

In AIGrantMatch, I had to accept that some records would be stale by design. The alternative was lying by omission: presenting data as authoritative when it wasn’t.

That tradeoff is uncomfortable. But it’s honest.

Here’s one change that made everything else click for me:

I stopped modeling ingestion as fetched vs not fetched.

Instead of a single boolean or timestamp, I track ingestion as a state machine with intent.

At a minimum, each record carries:

when it was last attempted
whether it’s currently being worked on
what the last known outcome was
whether that outcome is considered final or retryable

Conceptually, it looks less like this:

isFetched: true

and more like this:

detailsStatus: FETCHING | COMPLETE | FAILED
detailsFetchedAt: timestamp | null
detailsErrorAt: timestamp | null

Why this matters:

A record stuck in FETCHING too long isn’t “in progress” — it’s abandoned.
A FAILED record isn’t dead unless you explicitly say it is.
Freshness becomes measurable instead of implied.

Once you do this, retries stop being magic. You can reclaim stale work. You can surface uncertainty to users honestly. And you can answer questions like “Is this data trustworthy right now?” instead of pretending the answer is always yes.

That’s what designing for adversarial reality looks like in practice:
you stop assuming success, and you start accounting for time, ownership, and decay.

Treating ingestion as adversarial doesn’t mean pessimism.
It means clarity.

It means assuming:

sources will contradict themselves,
failures will cluster, not scatter,
and “we’ll fix it later” usually means “we won’t notice.”

Once I adopted that posture, the system got simpler. States became explicit. Ownership became clear. And most importantly, I stopped confusing absence of errors with presence of truth.

If your pipeline only works when everything behaves, it doesn’t work.

It just hasn’t failed yet.

Why I Treat Data Ingestion as an Adversarial Process

Discussion about this post

Ready for more?