
Autonomous agents are giving experienced testers the leverage they have always wanted.
That distinction matters because vendors keep blurring it, and red team leads end up scoping engagements based on marketing instead of capability.
The question practitioners are actually asking is narrower. Where does agentic AI hold up under the pressure of a real engagement, and where does it still need a human steering the attack chain? Anyone who has run a complex internal pentest already knows the answer is not symmetric. Some stages of the work compress dramatically with agents. Others get worse if you hand them off.
This piece walks through what agentic pentesting actually is in 2026, where it accelerates engagement, where it fails predictably, and how to scope work so neither the vendor nor the client overestimates the agents.
Enterprise testing solutions describe the category of platforms and services organizations use to validate the security of their environment under conditions that approximate adversary behavior. The category spans four operational tiers, and they do different work.
The shift in 2026 is that the third tier has moved from prototype to production. Agentic platforms now hold up against real environments, but only inside the boundaries they are scoped to. Treating them as a substitute for either tier two or tier four still fails predictably.
Real agentic pentesting uses narrowly scoped agents to handle reconnaissance, attack surface mapping, exploit candidate generation, and evidence collection in parallel. Each agent is bounded by an explicit task, an explicit scope, and an explicit handoff back to a human or another agent. That is the operational definition. Conflating it with anything else creates risk.
Agentic pentesting is not a chatbot wrapping a Nessus scan. A scanner produces findings. An agent reasons about what to do with those findings, decides what to investigate next, and feeds evidence into the next stage without waiting for an analyst to copy the output into a spreadsheet.
It is also not autonomous offensive AI. An agent that can chain together a novel attack against a hardened target without supervision does not exist outside marketing. The state of the art is narrow autonomy with human checkpoints at the decision boundaries where context matters.
The boundary is worth being precise about, because most failed engagements involving agentic platforms come from either overstating or understating it.
Most engagements follow the same operational sequence: reconnaissance, host discovery, service enumeration, vulnerability validation, exploit candidate generation, and evidence collection. Agents perform well on the parallelizable stages. Humans drive the stages that require interpretation.
Reconnaissance. Ground-level visibility makes or breaks an engagement. Most remote scanners miss what is actually on the network because they cannot see ARP traffic, mDNS broadcasts, rogue access points, or the layer-2 anomalies that only show up on the wire.
Strike48 Pick is an open-source penetration testing agent built for reconnaissance and remote tool execution. It runs cross-platform from a single Rust and Dioxus codebase, compiling to desktop (Windows, macOS, Linux), mobile (Android, iOS), a terminal UI, and a headless agent — so testers can drop it on whatever beachhead the engagement gives them. No manual stitching of nmap, arp-scan, and Wireshark outputs into a wiki page.
The operational difference matters. A traditional remote recon pass against a flat /22 might return 400 responding hosts. Pick running on a beachhead inside the segment returns the same 400, plus 40 unmanaged devices the scanner missed, plus the rogue AP advertising mDNS that no one in IT knows about. That is the recon delta that changes the engagement.
Host discovery and service enumeration. Once reconnaissance produces the asset inventory, discovery and enumeration are the most parallelizable work in the engagement. Agents iterate across hosts simultaneously, normalize service banners, fingerprint software versions, and flag candidates for validation. A human reviewing 800 hosts of nmap output takes a day. Agents take minutes and surface the same priority signal.
Vulnerability validation. This is where agentic AI earns the time it saves. The agent takes a candidate finding (an exposed admin panel, a vulnerable CVE-tagged service, an open SMB share with weak ACLs), generates a non-destructive validation attempt, executes it, and captures evidence. Validation that previously sat in a backlog waiting for an analyst now completes in the same pass as discovery.
Exploit candidate generation. Agents propose exploit chains based on validated findings, public exploit databases, and the environmental context the recon stage built. They do not run weaponized exploits autonomously in a responsible engagement. A human reviews the candidate, decides whether to proceed, and either executes or hands it back. This is the most important human-in-the-loop checkpoint in the workflow.
Evidence collection. Screenshots, packet captures, command output, and chain-of-custody metadata flow into a structured artifact store as the agents work. The report writes itself from those artifacts. Testers stop spending the last three days of an engagement assembling proof.
Agentic AI fails predictably on the stages that require interpretation, creativity, or context that does not appear in the scan output.
The pattern is consistent. Agents handle the parts of the work that benefit from speed and parallelism. Humans handle the parts that require judgment about consequences. Engagements that respect that boundary produce results. Engagements that ignore it either miss real findings or generate noise the client cannot act on.
The teams getting value from agentic pentesting in 2026 follow the same pattern. Agents handle the parallelizable, well-bounded stages. Humans drive scope, exploitation, and storytelling. The integration question is operational, not philosophical.
What to test agentically.
What to keep human-led.
How to integrate Pick alongside an existing toolkit. Pick is built to deploy on the same beachheads testers already use. It does not replace nmap, Responder, Bloodhound, or any standard tooling. It produces structured recon data those tools cannot produce remotely, then feeds that data into Strike48 for correlation against the rest of the engagement evidence. Teams that already run a Kali jumphost inside the assessment scope can add Pick to it in under ten minutes.
The integration delivers immediate leverage. The recon data Pick captures inside a segment is the data testers would otherwise have to assemble manually from multiple tools after the engagement ended. Compressing that work moves the productive time of an engagement toward exploitation and reporting, which is where the client experiences value.
Agentic pentesting in 2026 is not about removing humans from the engagement. It is about giving experienced testers the leverage to spend their time on the work that actually requires their judgment. Recon, validation, and evidence collection compress. Exploitation and narrative stay human. The teams that scope engagements around that boundary deliver better assessments in less time.
Strike48 Pick is open source. The repository ships with a working build, a documented module structure, and recon outputs that drop directly into Strike48 for correlation. Clone the repo, run it inside a scoped engagement, and see what your remote scanners have been missing.
Agentic pentesting uses narrowly scoped AI agents to handle reconnaissance, attack surface mapping, vulnerability validation, exploit candidate generation, and evidence collection in parallel. It is not vulnerability scanning with a chatbot interface, and it does not replace human testers. The pattern is hybrid. Agents handle parallelizable, well-bounded work. Humans drive scope, judgment, and creative attack chains.
Yes. Pick is open source and available on GitHub. It runs on Windows, macOS, and Linux from a single Rust and Dioxus codebase. Teams can deploy it inside a scoped engagement without a Strike48 platform license, though the recon data it captures flows natively into Strike48 for AI correlation.
No. The stages of an engagement that benefit from agents (recon, validation, evidence collection) compress dramatically. The stages that require human judgment (scope, creative exploitation, narrative) do not. Testers using agentic platforms run more engagements per quarter and spend more of each engagement on the work that actually requires expertise.
Pick runs on a beachhead inside the segment under assessment and surfaces ARP, mDNS, rogue access point, and PCAP data that remote scanners cannot reach. It is built to complement existing tooling, not replace it. The structured output feeds directly into Strike48 for correlation rather than requiring manual stitching across tools.