Research Question

The volume of reporting on AI-generated phishing has outpaced the clarity of it. Industry sources describe large surges in AI-crafted messages reaching inboxes, point to AI-driven kits that mass-produce convincing brand pages, and survey practitioners who name AI phishing as a top concern for 2026. The direction of travel is not in dispute.

The question this note examines is narrower and more useful: of the numbers defenders are handed, which ones are decision-grade, and which are noise dressed as signal? A program that reacts to every headline figure ends up steering by measurements it cannot verify and did not design. The goal here is to sort the inputs, not to add another forecast.

Three Measurements Hiding in One Headline

A typical AI-phishing statistic conflates three distinct things, and separating them is the first practical step.

The first is volume: how many AI-generated messages exist or get sent. The second is effectiveness: whether those messages succeed more often per attempt than human-written ones. The third is sentiment: how worried practitioners say they are in a survey. A headline that an attack type is the top concern for most CISOs measures sentiment. It is real, but it describes a state of mind, not a rate of compromise.

These three move independently. Volume can rise sharply while effectiveness per message falls, because cheap generation encourages spraying. Sentiment can spike from a single well-covered incident while measured volume is flat. When a report fuses them into one alarming number, defenders lose the ability to act, because the correct response to a volume problem, an effectiveness problem, and a sentiment problem are not the same.

The Polymorphism Problem

Volume figures deserve particular scrutiny because of how AI changes the cost of variation. Generating a fresh subject line, body, and landing-page URL for every recipient is nearly free. The result is polymorphic phishing: campaigns where almost every message looks unique at the level of surface indicators.

Industry analysis of this pattern has found that the large majority of initial infection URLs in some campaigns are unique, while the large majority still resolve to a shared set of IP addresses. That gap matters. If a defender counts unique URLs, the campaign looks like thousands of separate attacks. If they count infrastructure, it collapses into a handful. The same activity produces wildly different "volume" depending on which indicator is counted.

AI did not invent polymorphism, but it removed the cost that used to limit it. Any metric built on the novelty of surface indicators, including unique URLs, unique sender display names, or unique message hashes, will now overstate the number of distinct campaigns. The more reliable counts cluster on what is expensive for the attacker to vary: hosting infrastructure, TLS certificate patterns, redirect topology, and the kits behind the pages.

Volume Is Not Effectiveness

The more important measurement is also the harder one: does AI-generated phishing succeed more often per attempt? Volume tells you how much arrived. Effectiveness tells you whether the arriving messages are better.

There is reasonable evidence that AI-crafted lures have closed and then crossed the quality gap with human-written ones over the last few years, moving from clearly worse to competitive or better. But effectiveness is sensitive to context in a way volume is not. It depends on the target population, the pretext, the channel, and the controls already in place. An effectiveness number borrowed from another organization's environment does not transfer cleanly to yours.

This is why per-attempt effectiveness should be measured locally, not imported. Your own simulation and real-incident data describe your users and your controls. An external effectiveness figure is at best a hypothesis to test against your environment, never a substitute for measuring it.

Metrics Defenders Actually Own

The way out of the noise is to track outcomes the organization controls and can verify directly. A small, stable set is more useful than a wide one.

Time-to-report is the interval between delivery and the first user report. It measures human detection speed and tends to degrade quietly as lure quality rises. Credential-submission rate is the share of users who reach a phishing page and enter credentials, which measures the failure that actually leads to compromise rather than the click that merely precedes it. Dwell time is the gap between a successful credential or token theft and detection, which measures the response pipeline. Reporting precision, the ratio of real threats to false alarms in user reports, measures whether the human sensor network is still trustworthy.

These metrics share three properties that the headline statistics lack. They are measured inside your environment, so they describe your risk. They are stable definitions, so quarter-over-quarter comparison is meaningful. And each one points at a specific control to improve. That is what makes a measurement decision-grade.

Defender Application

The practical recommendation is restraint. External AI-phishing figures are useful as weather, not as instruments. They are worth reading for direction, and most agree that volume is up and quality is rising. They are not worth wiring directly into program decisions, because most of them merge volume, effectiveness, and sentiment, and most of them count indicators that AI has made cheap to vary.

Pick three or four internal metrics, hold their definitions steady, and review the trend rather than the absolute value. When a striking external statistic appears, treat it as a prompt to check the matching internal metric, not as a result in itself. The honest version of measuring AI-generated phishing is mostly measuring your own environment carefully, and being skeptical of any number whose methodology you cannot see.