What election polling teaches us about ML-based email security

Hands on a laptop with overlaid logos representing network security — Like election polls, email security models hide uncertainty and struggle with rising AI complexity. (Image credit: Thapana Onphalai via Getty Images)

Like election polls, email security models hide uncertainty and struggle with rising AI complexity. (Image credit: Thapana Onphalai via Getty Images)

The margin of error is wider than your vendor is telling you. And the agentic inbox is about to make it catastrophically worse.

Two weeks before a major election, a respected polling organization publishes its latest numbers. Candidate A leads by three points. Margin of error: plus or minus four. The lead is smaller than the margin of error.

The headline says “Candidate A leads.” Millions of people form opinions based on a number that is statistically indistinguishable from a coin flip.

The attacks your model was never trained on

Adversary-in-the-Middle phishing works by placing a reverse proxy between the victim and a legitimate authentication service. The victim clicks a link, enters credentials, completes their MFA challenge, and authenticates successfully.

The proxy captures the live session cookie. The attacker now has authenticated access without needing the password. MFA is not broken. It is bypassed. The authentication event happened legitimately. The attacker just intercepted the proof of it.

Here is what makes this an ML detection problem of a different kind. The email that initiates the attack is often completely clean by every surface measure. The sending infrastructure may be legitimate. The URL may be a real SharePoint link.

The social engineering is contextually appropriate. There is no malicious attachment, no suspicious payload, no domain registered yesterday. Every signal ML models have been trained to recognize as indicative of malicious email is absent. The attack specifically engineers around those signals because the attackers understand exactly what the models are looking for.

And the training data problem compounds this. AiTM at operational scale is only a few years old. The labeled sample for confirmed AiTM-initiating emails is thin compared to commodity attacks.

Worse, the training data is systematically biased: models learn primarily from AiTM variants that were eventually detected by other means and retrospectively labeled. The sophisticated variants that passed through undetected never entered the training set. They were not caught, so they were not labeled, so the model never learned from them.

This is the likely voter screen problem that breaks election polling. Pollsters who only reach people who answer their phones are not sampling the electorate. They are sampling people who answer their phones. Your AiTM detector has the same structural flaw. It is modeling the attacks it could see, not the attacks it needed to catch.

The model is reporting a three-point lead with a four-point margin of error. Your dashboard just does not show you the margin of error column.

Your own IT decisions are breaking your baselines

The secondary ML approach for catching account takeover is behavioral anomaly detection. Build a baseline of what normal looks like for each account. Flag meaningful deviations. Wire transfer approved at 3am from an account that never acts at that hour. Sub-minute response latency from someone who normally takes hours. Login from an unfamiliar geography followed immediately by financial activity.

These were reliable signals in a simpler world. The world is no longer simple.

Enterprise AI agents are being deployed at scale right now. Microsoft Copilot drafts and sends responses on behalf of users. Workflow automation agents process approvals. Scheduling agents manage calendar-adjacent email. Financial agents handle routine transaction communications. Some organizations already have multiple agents running concurrently on executive inboxes.

Think about what this looks like from the perspective of a behavioral baseline model. The human has a characteristic signature built over years: inconsistent timing, occasional typos, variable response latency. Emails from a phone in traffic look different from emails written at a desk. The signature is distinctly human.

The Copilot agent sends grammatically perfect, consistently formatted responses at sub-minute latency regardless of time of day. The scheduling agent fires at precise intervals. The financial workflow agent responds to trigger phrases with templated precision at whatever hour the condition is met. From the model’s perspective, the inbox now looks like three or four distinct actors operating through a single account.

This is operationally close to what a compromised account with an attacker-installed persistence layer looks like.

Security teams can partially mitigate this. You can label agent-generated activity explicitly in your detection pipeline, segment baselines by actor type, and build separate behavioral profiles for human and automated traffic. Some teams are already doing this.

But the mitigation only works if every agent is inventoried, every integration is tagged, and the labeling stays current as agents get added and updated. In practice, agent deployments outpace security team awareness of them. And if even one agent’s activity leaks into the human baseline unlabeled, the contamination compounds silently. The model absorbs agent behavior as human behavior. What was once anomalous becomes the new normal. The baseline shifts on compromised ground.

The harder structural problem remains: even with perfect labeling, you have expanded the definition of “normal account behavior” to include automated, off-hours, grammatically perfect, sub-minute-latency activity. A real attacker operating alongside legitimate agents now falls within that expanded definition. The behavioral signal surface has genuinely narrowed.

What good pollsters do when their models break

The best polling organizations do not abandon quantitative models when confidence intervals widen. They do something more disciplined. They acknowledge the uncertainty explicitly in how they communicate findings.

They triangulate against independent data sources rather than trusting a single model. They weigh certain signals more heavily when the model is operating outside its training conditions. And critically, they treat poll output as a prior probability, not a conclusion. The model tells you where to look. It does not tell you what to decide.

Secure email needs the same architectural relationship with ML output. A probability score should be the starting point of an assessment, not the end of one.

But I want to be honest about what this actually requires, because it is harder than it sounds and the industry has not solved it yet.

The questions that matter for catching the attacks described above are questions like: does this authentication request make sense given who sent it, who received it, what their relationship looks like, and what the organizational workflow normally requires at this step? Is the urgency framing consistent with how this counterparty has historically communicated? Would a reasonable, informed person who understood this organization’s context find this email suspicious even if every surface feature looks clean?

Those are not pattern matching questions. They are reasoning questions. And no one in the industry, including my own company, has fully closed the gap between what ML pattern matching can do and what contextual reasoning requires. We are all building toward it from different directions. Some approaches will work. Some will not. The honest assessment is that the problem is genuinely hard and the tools are still maturing.

What security leaders can do right now is stop treating detection scores as verdicts. Demand that your vendors disclose confidence intervals alongside probability scores. Instrument your agent deployments as security-relevant events with the same rigor you apply to new user provisioning.

Build your own retrospective analysis of what got through, because modeling the gap between detection and reality is more valuable than optimizing the detection you already have.

The margin of error is wider than your dashboard shows. The first step is making it visible.

Better understand cyber security with the best online cybersecurity courses.

This article was produced as part of TechRadar Pro Perspectives, our channel to feature the best and brightest minds in the technology industry today.

The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/pro/perspectives-how-to-submit

TOPICS

CEO of StrongestLayer.

The attacks your model was never trained on

Your own IT decisions are breaking your baselines

What good pollsters do when their models break

Useful links