Position · v1.0 Audience · operators, reviewers Revised · 2026-04-22

Why a new implementation.

fail2ban built the category and still works for most of the hosts running it. This page is not an argument that it was wrong. It is an argument that the architecture it chose in 2004 has reached its ceiling, and the tools to go further are now mature.

01 · What fail2ban did right

The category exists because fail2ban proved it could.

In 2004, automatically correlating log failures and writing firewall bans on the host was not an established pattern. fail2ban made it one.

What it introduced
jail filter maxretry findtime bantime ignoreip

Those concepts survived because they were operationally correct. fail2zig keeps the model. What changes is the implementation ceiling — not the category definition.

02 · What the model costs in 2026

Four concrete problems the shell-action model cannot resolve.

PR.01

An interpreter in the TCB

CPython, the Python standard library, python3-systemd, and a chain of C extension modules all execute in the daemon's address space with root. The interpreter can load any bytecode and parse any input. Whatever else a root daemon should be, a shape that accepts "any bytecode" as legitimate input is a larger trust contract than a single static binary.

PR.02

Process-per-action overhead

Every ban forks /bin/sh and execs iptables / nft / ipset. On a host being scanned at a few hundred attempts per second — a modest VPS workload in 2026 — the fork-exec cost becomes the binding constraint on how fast bans can actually land. A shell-action can be written efficiently; it cannot be written faster than execve.

PR.03

Unbounded memory growth

Attackers shaping log traffic to pressurise the state tracker is a real scenario. A dictionary-backed data model grows under that pressure — and resizes during growth, which pauses the event loop. Under sustained attack, those pauses stack and the daemon falls behind its own log stream.

PR.04

No runtime in minimal images

Distroless images, scratch containers, OpenWrt routers, hardened minimal servers, FreeBSD jails. The platforms where a host IPS matters most are precisely the platforms that do not tolerate a 30 MB Python runtime. Operators running those today run no host IPS at all, which is the wrong answer.

03 · Why this is architectural

These are structural, not patchable.

All four problems in the previous section come from the same choice: orchestrate in Python, act through shell. That is the shape of the architecture, not the quality of any single piece of it.

What can't be patched
Tune the shell template
execve is still on every ban. The process spawn is the operation, not an implementation detail.
Rewrite the filter
CPython is still in the daemon's address space. The interpreter is the trust contract, not the code it loads.
Cache the state tracker
The dictionary still resizes. Growth under attacker pressure is the data model, not a bug in the lookup path.

A better ceiling requires a different shape — a statically-compiled binary, memory bounded explicitly, firewall state written to the kernel directly, a data model designed for attacker-shaped input.

04 · What fail2zig chose

Different ingredients, same problem.

The work the category has always done, rebuilt with the tools that were not available when it was first invented.

The binary is the TCB

A single statically-linked musl binary. No interpreter, no VM, no dynamic loading, no third-party Zig packages in the critical path. What you audit is what runs. The trusted computing base is ~12,000 lines of Zig you can read in an afternoon.

Direct netlink, no shell

Bans are written to the kernel through AF_NETLINK directly — no nft, iptables, or ipset binaries are ever spawned. Ban latency drops from milliseconds to tens of microseconds, and the supply chain for the ban path contains exactly one program. See netlink interop.

A hard memory ceiling

A fixed-size arena, configured at startup, enforced at the allocator level. The daemon cannot exceed it regardless of attack volume. When the state tracker reaches capacity, eviction policy fires — the daemon does not resize, does not pause the event loop, does not get killed by the OOM reaper.

A filter engine compiled at build time

Filters are specialised at compile time via Zig's comptime. No runtime regex engine, no pattern interpreter — attackers cannot influence how their input is parsed because the parser was fixed before the binary shipped.

Zero runtime dependencies

No Python, no Perl, no Ruby, no Node. No D-Bus. No /bin/sh. No plugins. The binary plus the Linux kernel is the complete dependency graph. See zero runtime dependencies for which kernel ABIs the binary does depend on and why those are the acceptable boundary.

fail2ban is the adoption path

Your jails, filters, thresholds, and ignore lists port over with one command: fail2zig --import-config /etc/fail2ban. The mental model is identical. The operational vocabulary is identical. The only thing that changes is what runs as root on your machine.

05 · What fail2zig is not

Scoping is a feature.

Not a WAF. fail2zig does not inspect request bodies, enforce rules on HTTP payloads, or sit in the request path. WAFs belong at the edge. fail2zig lives on the host.
Not a SIEM. fail2zig does not aggregate, store, index, or correlate logs across hosts. It reads one host's logs, decides who is attacking it, and tells the kernel to drop their packets.
Not an IDS. Intrusion detection is a deeper problem space — protocol decoding, packet inspection, behavioural anomaly detection. Projects like Suricata and Zeek do that work. fail2zig consumes the signals an IDS (or the services themselves) produce and enforces the blunt response.
Not a network firewall. nftables is the firewall. fail2zig writes to it. A network firewall, a cloud security group, and a host IPS are three different layers; running fail2zig does not replace the first two.
Not a replacement for hardening. fail2zig bans attackers after they have already probed. Disabling password auth for SSH, keeping services patched, and minimising exposed surface are what prevent the probe-to-compromise path in the first place. fail2zig raises the cost of probing; it does not replace the reasons probing is ineffective against a well-run host.

Where to go from here.

If you run fail2ban today and the values above speak to the problems you have actually hit, the import path is the shortest route.