The test suite as AI safety net

A senior engineer working with Claude Code as primary tool produces in a day what a 2023-era engineer produced in a week. The unit-of-work shifted from "write code, then review it" to "review the agent's plan, let the agent code, run the test suite, review the diff." The test suite is the thing standing between AI velocity and AI shipping wrong code. Here's what that test suite has to look like, what it has to catch that traditional tests don't, and how writing tests changes when the AI is the one writing the implementation.

$The new role of the test suite

Pre-AI, the test suite was a quality check. The thing you ran before merging to catch regressions you didn't anticipate. The author of the test wrote it after writing the code, with full knowledge of the implementation. The suite was complementary to careful code review.

Post-AI, the relationship inverts. The agent generates implementations faster than you can carefully read them. A 4-hour session produces 800 lines across 12 files. You can read every line, but you didn't author them, and your understanding is shallower than the agent's. The test suite stops being a complement to careful review and becomes the primary mechanism by which you trust the work. If the tests pass and they cover the right invariants, you can ship.

This shift only works if the test suite has properties most existing suites don't have. Three properties matter:

Coverage of business invariants, not just function behaviors.
Speed. The full suite has to run in under 5 minutes, ideally under 2.
Determinism. Flaky tests poison the agent's feedback loop more than they poison a human's.

$Business invariants, not function behaviors

A traditional unit test asserts "this function returns X for input Y." A business-invariant test asserts "the system never reaches state Z, regardless of which functions were called in what order." The two look superficially similar but catch different bugs.

Example from a multi-marketplace catalog. The function-level test:

def test_listing_price_updates():
    listing = create_listing(price=19.99)
    listing.update_price(29.99)
    assert listing.price == 29.99

Useful, but the agent can pass this by hardcoding the assignment. The invariant test:

def test_listings_never_diverge_from_product_master_after_publish():
    """After any publish() call, listing.price must equal product.template.list_price."""
    product = create_product(list_price=19.99)
    listing = create_listing(product=product)
    # Exercise N code paths that could mutate state.
    for action in [republish, edit_then_publish, async_sync, cron_resync]:
        action(listing)
        assert listing.price == product.list_price, f"Diverged after {action.__name__}"

The invariant test is what catches the AI's plausible-but-wrong implementation. A function-level test verifies the agent did what was specified. An invariant test verifies the agent didn't break something it wasn't thinking about. AI-generated implementations are stronger on the first failure mode and weaker on the second; the invariant tests pick up the slack.

$Speed: the agent has to be able to iterate

The agent's edit-test-edit cycle is the inner loop of AI engineering. If the test suite takes 18 minutes, the agent waits 18 minutes between each attempt. A naive task that would converge in 6 iterations now takes 108 minutes of wall time, and you, the operator, are paying compute the whole time the agent is idle.

Tactics for fast test suites:

In-memory database fixtures for unit tests. SQLite or Odoo's TransactionCase with rollback. Avoid full Postgres setup/teardown per test.
Categorize tests by speed and pin a fast set for the inner loop. A fast/ suite that runs in <90s + a full/ suite that runs in 10 minutes. The agent runs fast/ after every edit; CI runs full/ on PR.
Avoid network calls. Mock external APIs at the HTTP-client boundary. A test that exercises the Amazon connector should never hit the SP-API.
Parallelize. pytest-xdist on 4-8 workers cuts wall time roughly linearly for tests that don't share state.

The target: full fast/ suite under 90 seconds, full full/ suite under 5 minutes. On a 143-test Odoo suite, achieving this required the in-memory fixture pattern + parallelization + a strict no-network rule for fast tests. The investment in test infrastructure paid back inside the first week of AI-driven development.

$Determinism: flaky tests poison the feedback loop

A flaky test is one that passes 95% of the time and fails 5% of the time for reasons unrelated to the code under test. Humans tolerate flakiness; you re-run, it passes, you move on. The agent does not have the social context to recognize "this test is flaky, ignore it." The agent reads the failure literally, generates a fix for the wrong root cause, and you end up with a "fix" for a non-bug.

Sources of flakiness I've debugged in AI-driven engineering sessions:

Tests depending on dictionary iteration order (in Python 3.7+ ordered, but tests written assuming sorted-by-something keys break when the source data ordering shifts)
Tests using datetime.now() without freezing the clock
Tests with race conditions in async setup (the test inserts a record, then queries before the database has committed)
Shared fixtures that one test mutates and another depends on (test ordering becomes load-bearing)
Network-dependent tests that nominally mock but fall back to live calls on cache miss

The rule: a test that fails once for reasons unrelated to the code under test goes into quarantine immediately. Either fix the flakiness or delete the test. Letting a flaky test live in the suite is corrosive. The agent will eventually try to "fix" it and waste a session iteration on the wrong target.

$What the agent does and doesn't do well with tests

Empirical observations from ~14 months of running Claude Code as primary tool on production Odoo work:

Does well:

Writing function-level tests when given a clear spec.
Adding edge-case tests when prompted ("what happens with empty input, null fields, unicode strings").
Generating fixture data that exercises a model's full field surface.
Diagnosing a test failure when given the failure output + the related code.

Does poorly:

Identifying which business invariants matter for a given module without explicit prompting. The agent defaults to function-level tests; invariant tests need to be specified by a human who understands the domain.
Recognizing flakiness vs real failures. The agent treats every failure as a real bug.
Writing tests that are too tightly coupled to the implementation. A test that asserts on private state will break the next time the implementation changes, which is often, in AI-driven sessions.
Not over-mocking. The agent will mock anything that fails on first try, including the thing being tested.

The asymmetry: the agent is good at filling in test cases for a specified behavior, weak at deciding what behavior is worth specifying. The human's leverage is in the second part. Write the 8-line invariant tests yourself; let the agent expand them with edge cases.

$The pattern: human writes invariants, agent writes everything else

The division of labor that works:

Human writes invariant tests. 1-2 per module. They encode the contract the system must maintain. They're the things that, if they fail, indicate a real regression.
Human writes the test-fixture infrastructure. The in-memory DB setup, the fast-suite runner, the mocking helpers. This is the part that's slow to write correctly and easy to break by accident; it pays to have a human author it once and lock it down.
Agent writes function-level tests, edge cases, parameterized variants. Volume work that exercises the full surface of each function.
Agent maintains tests through refactors. When a function signature changes, the agent updates the tests. The human reviews the diff for "did this update preserve the invariants or sneak around them."

On the 143-test Odoo suite that pairs with active production work, the breakdown is roughly 15 human-authored invariant tests + 128 agent-authored function tests. The human-authored tests are 12% of the count and 80% of the value.

$Property-based testing earns its keep here

For invariants over wide input domains, property-based testing (hypothesis in Python) is force-multiplying with AI. The pattern: write the invariant as a property, let hypothesis generate the inputs:

from hypothesis import given, strategies as st

@given(
    price=st.decimals(min_value=0.01, max_value=99999.99, places=2),
    qty=st.integers(min_value=0, max_value=99999),
)
def test_listing_never_publishes_with_invalid_state(price, qty):
    listing = build_listing(price=price, qty=qty)
    result = listing.publish()
    if result.success:
        assert result.published_price > 0
        assert result.published_qty >= 0

Hypothesis tries the boundary cases (0.01, max_value, qty=0, qty=99999) plus random samples in between. It finds the cases the agent didn't think to test for, and often the cases the human didn't either.

$The bottom line

The test suite is the thing that lets AI-augmented engineering be safe at velocity. The shape changes: fewer brittle function-level tests, more business-invariant tests, strict speed and determinism rules, division-of-labor where humans own the contract and the agent owns the volume.

If your existing test suite is slow, flaky, or function-level-only, the upgrade is worth doing before the next AI-driven engineering session. The 4-hour investment in test-suite hygiene saves more time than that in the first AI session that follows.