Why I think Anthropic's approach to data poisoning is solving the right problem from the wrong end

11 min read Original article ↗

A few months ago I read Anthropic’s research on data poisoning defenses for large language models. I have a lot of respect for their team and their research, the resulting paper is a fantastic effort with good findings. But as I read it, I found myself thinking they were approaching the problem the way ML engineers would approach this problem — from inside the model and training — when the more scalable answer lives one layer below, at the infrastructure layer, in my opinion.

In another post, I’ll also cover how I believe this exact same approach can solve real time prompt injection which occurs outside of user interactions….. which Open AI also believes is too difficult to fully stop.

I have a background in systems, cybersecurity, software, SaaS, and infrastructure. I’ve built security systems at Fortune 5 companies, led highest tier payment security compliance programs for payments serving the NFL, MLB, and NBA clients, and prototyped from IoT edge security to brainwave signal firewalls. In what may seem as a disjoint comment, I also have a lifestyle side project that gets > 1M visitors a year, without marketing - which helps me understand the scope of distributed data poisoning to come.

I believe the scope is much greater than others believe. I make my case for that scope below. I mention hedge funds, Fortune 3000, robotic operating systems, sensors, etc …. as impacted by these issues in the future.

I am not an ML researcher. I think that is why I see this differently.

Here’s my case.

The Anthropic approach, as I understand it, operates at the LLM pre-training and data storage level — embedding defenses inside the model and build steps itself, using LLM-native methods to detect when training or fine-tuning data has been corrupted. It’s a sophisticated solution built by people who think in model internals.

The issue isn’t that it doesn’t work. The problem, I believe, is that it is brittle in a specific way: it ties your defense to the architecture of one model family and pipeline, requires integration with the pipeline of pre-training, instruction, and safety fine tuning. I believe this has to be rebuilt or re-validated every time the underlying model changes. You’re playing defense inside a moving target.

In cybersecurity, there is an approach, which simplified, is called “zero trust”. To keep the summary short, it simply implies every step and architecture in a system should in no way trust any other. In some cases, a system (whether it be a server, operating system or hardware element) might not automatically even trust itself.

The approach I see with using LLM pre-training pipelines currently - is that all the content that isn’t obviously bad (eg: pr0n, etc) is trusted with the approach that volume and statistics of the models will dampen out poisoned results.

Results from Anthropic’s paper and testing, suggest that models with between 600M - 13B parameters, could be backdoored with as little as 250-500 documents. In my belief, this suggests the major frontier providers as well as EVERY business employing LLMs (whether publicly hosted or privately hosted) now have a huge back door.

A few quick examples are coming, I swear…….

In antivirus and endpoint security realms, security teams learned in the 1990s and 2000s: you cannot build reliable security that lives entirely inside the thing being attacked. The answer was to move the defense perimeter outward — to the OS layer, the network layer, the hardware layer. You would combine defense in depth and Zero trust. Never assume anything that enters the system is clean.

Some simple, but relatable examples might be firewalls in front of application servers, as well as intrusion detection tools on a network or on a system.

Applied to LLMs, this means: don’t try to detect poisoning after data enters the model ecosystem. Evaluate the data before it ever gets near the model or it’s storage.

Data has historically been treated by most except the large social platforms as clean / curated. Yes it might have sentiment issues or image issues, but the words themselves were not an issue in business settings.

I believe that has changed. Even the words consumed themselves are now a large issue - because you can use those to move decisioning systems away from their desired output.

My core premise: treat every piece of data entering an AI pipeline the same way an antivirus engine treats an unknown executable or a firewall treats a network packet. Data is hostile until proven otherwise. Score it. Quarantine the uncertain. Reject the clearly malicious.

I think the solution is Decision Integrity Infrastructure.

What I’ve built is an infrastructure-layer defense that operates as a data pipeline agent — external to any specific model, agnostic to what’s upstream or downstream. It assigns two independent scores to every piece of data in near real-time - sort of like if antivirus, firewalls, data pipelines, and isolation build testing had an LLM focused baby…… with a signature and behavior score. These terms will mean something different to each but in my terms…..

Signature score. Pattern-matched against known poisoning signatures — analogous to a traditional AV signature database. Fast, deterministic, cheap. Catches known attack vectors immediately, at scale, with negligible latency.

Behavioral score. Slower but more powerful. Analyzes what the data causes when processed — it is downstream behavioral fingerprint rather than its static appearance. This is where novel, zero-day poisoning attempts get caught. An attacker can craft data with no known signature; they cannot easily craft data that produces no anomalous behavioral signal.

The two scores run in parallel. Data with a clean signature score and a clean behavioral score passes through. Data that trips either threshold gets quarantined for review. The architecture is designed so the signature layer handles the vast majority of traffic with near-zero latency, while the behavioral layer runs asynchronously on the uncertain tail.

This is similar in architectural evolution that endpoint security went through: signature AV in the 1990s → behavioral sandboxing in the 2010s → combined platforms today. I’m proposing the same for LLM data ingestion.

The enterprise implementation integrates natively with a streaming pipeline like Kafka or Apache Beam — meaning it slots into existing data infrastructure without requiring teams to rebuild their pipelines around it. It runs as an agent (not to be confused with “ai agent”). If your organization already has a streaming data architecture, this becomes another stage in that pipeline.

Coverage is not limited to text. The system should handle static files, streaming data, and image inputs — addressing multimodal poisoning vectors that purely text-focused defenses miss entirely.

The signature scoring can be common or specialized for each LLM. The behavioral scoring can alter depending on the desired goal. You might append the content to a subset of a model and programmatically test the outcome. You might want to run images through classifiers and remove any that do not meet a goal. You might want to run code or computer commands on a newly spun up VM to note it’s behavior.

I know architecturally, people want more - that will be a whole other post.

Robot Operating Systems and Sensors. LLM interpreted images, video, etc.

When I started this work I was thinking about enterprise LLM deployments. But the architecture has a property I find interesting because the insight had come from struggles I had with an earlier personal project: there are ways to make it lightweight enough to run on low-power, resource-constrained edge devices.

[Side note - skip if you don’t want to read about a brainwave firewall: in 2019 I worked on a personal project to implement EEG brain wave validation for brain machine interface systems. I had noticed that with a system of the nature of Neuralink etc, a vulnerability occurred on both sides of an Internal to External communication. Without going into the weeds, BMI machines would need a firewall also because they’d be open to denial and volume attacks, but also to data attacks, one could architecturally circumvent the “networking” and possibly alter EEG waves in someones head , under certain conditions].

I have a working agent that runs on hardware typical of robot operating system (ROS) deployments — the kind of compute you’d find in autonomous vehicles, industrial robots, and embedded AI systems. The implications are significant. Data poisoning in a language model means bad outputs. Data poisoning in a robot’s perception or decision stack means bad physical actions. The safety stakes are categorically different, and the defense layer needs to move closer to the sensor, not further away.

Running the defense on-device, at the source, before data ever travels over a network, is one approach that makes sense in environments where latency and connectivity cannot be guaranteed.

In practice, in the future ROS can have the visual data input altered to make insane situations appear normal as believed by the on board LLMs.

Right now, most AI systems are trained on reasonably controlled data and fine-tuned by organizations with some visibility into what they’re ingesting. This will not always be the case. As RAG pipelines pull from live external sources, agentic systems write to and read from shared context or code bases, and as AI models are fine-tuned on user-generated data at scale, the attack surface for data poisoning grows by an order of magnitude.

The defenses being built today — including Anthropic’s — were designed for today’s threat surface. The infrastructure-layer approach scales to be independent of LLM code pushes or hot fixes, because it doesn’t care what model architecture it is protecting or what data modality is being inspected. It just asks: is this data clean enough to obfuscate statistically?

At my core, I’m a version of what Paul Graham called someone whose goal is to make technology do something whether the technology wants to or not. I think in systems, which means I naturally also think in the strengths and weaknesses.

Here’s what I think happens in the next 5 years:

Hedge funds can be compromised by subtly changing either the external data they consume or the internal trading rules sitting in LLMs they use for evaluating trades. This could cost $hundreds of millions.

Banks will be compromised by having their decisioning systems altered on payment risk scores or routing altered by simply altering the rules with data poisoned over time.

Even if no external data is consumed to poison a system, data rules based on internal files and converted to LLMs will be compromised either at rest, or in transit to the model. Once the trained data is altered just enough - finding those situations become hidden and difficult to find.

Frontier models can be manipulated at a grey hat level to alter product suggestions and where to purchase those products; separately, backdoors laying in wait for particular people or from specific companies may be served malicious content to an end user which bypasses all other security controls. Examples: commands which download files from other sources, or a command to erase a hard drive, etc.

Even Fortune 3000 companies, whom only use local LLM models with their own private data lakes, files, and networks can have their decisioning system LLMs compromised through incremental data poisoning, state actors or others altering the local files input to an LLM, or in transit.

Some would argue “if you have access to my files, there is a bigger problem”. I agree, but take into account a system admin or a senior team lead in finance who translates business rules and accounting rules to markdown, or uploads them to company intranets which also get ingested into the LLM - simply being social engineered or malicious content link jacked - can now be responsible for shifting the entire way a company’s decisioning systems, decide.

I’ve kept this work private until now. I’m sharing it because I think the AI safety community is having a serious and important conversation about data integrity, and I believe the cybersecurity frame is underrepresented in it.

If you work in AI safety, LLM infrastructure, or autonomous systems and want to go deeper on the technical specifics — the signature methodology, the behavioral scoring approach, or the ROS agent architecture — I’m genuinely interested in that conversation. Or in the spirit of HN - you think I’m a dolt…. You can see me on LinkedIn.

And if someone at Anthropic reads this and feels I mis-represented the findings or other : I’d enjoy correcting myself as well.

Discussion about this post

Ready for more?