[FROSTLABS] · home / writing / test suite as safety net
2026-05-31 · 8-min read · AI engineering · Testing · Odoo

The test suite as AI safety net.

A senior engineer working with Claude Code as primary tool produces in a day what a 2023-era engineer produced in a week. The unit-of-work shifted from "write code, then review it" to "review the agent's plan, let the agent code, run the test suite, review the diff." The test suite is the thing standing between AI velocity and AI shipping wrong code. Here's what that test suite has to look like, what it has to catch that traditional tests don't, and how writing tests changes when the AI is the one writing the implementation.

$The new role of the test suite

Pre-AI, the test suite was a quality check. The thing you ran before merging to catch regressions you didn't anticipate. The author of the test wrote it after writing the code, with full knowledge of the implementation. The suite was complementary to careful code review.

Post-AI, the relationship inverts. The agent generates implementations faster than you can carefully read them. A 4-hour session produces 800 lines across 12 files. You can read every line, but you didn't author them, and your understanding is shallower than the agent's. The test suite stops being a complement to careful review and becomes the primary mechanism by which you trust the work. If the tests pass and they cover the right invariants, you can ship.

This shift only works if the test suite has properties most existing suites don't have. Three properties matter:

  1. Coverage of business invariants, not just function behaviors.
  2. Speed. The full suite has to run in under 5 minutes, ideally under 2.
  3. Determinism. Flaky tests poison the agent's feedback loop more than they poison a human's.

$Business invariants, not function behaviors

A traditional unit test asserts "this function returns X for input Y." A business-invariant test asserts "the system never reaches state Z, regardless of which functions were called in what order." The two look superficially similar but catch different bugs.

Example from a multi-marketplace catalog. The function-level test:

def test_listing_price_updates():
    listing = create_listing(price=19.99)
    listing.update_price(29.99)
    assert listing.price == 29.99

Useful, but the agent can pass this by hardcoding the assignment. The invariant test:

def test_listings_never_diverge_from_product_master_after_publish():
    """After any publish() call, listing.price must equal product.template.list_price."""
    product = create_product(list_price=19.99)
    listing = create_listing(product=product)
    # Exercise N code paths that could mutate state.
    for action in [republish, edit_then_publish, async_sync, cron_resync]:
        action(listing)
        assert listing.price == product.list_price, f"Diverged after {action.__name__}"

The invariant test is what catches the AI's plausible-but-wrong implementation. A function-level test verifies the agent did what was specified. An invariant test verifies the agent didn't break something it wasn't thinking about. AI-generated implementations are stronger on the first failure mode and weaker on the second; the invariant tests pick up the slack.

$Speed: the agent has to be able to iterate

The agent's edit-test-edit cycle is the inner loop of AI engineering. If the test suite takes 18 minutes, the agent waits 18 minutes between each attempt. A naive task that would converge in 6 iterations now takes 108 minutes of wall time, and you, the operator, are paying compute the whole time the agent is idle.

Tactics for fast test suites:

The target: full fast/ suite under 90 seconds, full full/ suite under 5 minutes. On a 143-test Odoo suite, achieving this required the in-memory fixture pattern + parallelization + a strict no-network rule for fast tests. The investment in test infrastructure paid back inside the first week of AI-driven development.

$Determinism: flaky tests poison the feedback loop

A flaky test is one that passes 95% of the time and fails 5% of the time for reasons unrelated to the code under test. Humans tolerate flakiness; you re-run, it passes, you move on. The agent does not have the social context to recognize "this test is flaky, ignore it." The agent reads the failure literally, generates a fix for the wrong root cause, and you end up with a "fix" for a non-bug.

Sources of flakiness I've debugged in AI-driven engineering sessions:

The rule: a test that fails once for reasons unrelated to the code under test goes into quarantine immediately. Either fix the flakiness or delete the test. Letting a flaky test live in the suite is corrosive. The agent will eventually try to "fix" it and waste a session iteration on the wrong target.

$What the agent does and doesn't do well with tests

Empirical observations from ~14 months of running Claude Code as primary tool on production Odoo work:

Does well:

Does poorly:

The asymmetry: the agent is good at filling in test cases for a specified behavior, weak at deciding what behavior is worth specifying. The human's leverage is in the second part. Write the 8-line invariant tests yourself; let the agent expand them with edge cases.

$The pattern: human writes invariants, agent writes everything else

The division of labor that works:

  1. Human writes invariant tests. 1-2 per module. They encode the contract the system must maintain. They're the things that, if they fail, indicate a real regression.
  2. Human writes the test-fixture infrastructure. The in-memory DB setup, the fast-suite runner, the mocking helpers. This is the part that's slow to write correctly and easy to break by accident; it pays to have a human author it once and lock it down.
  3. Agent writes function-level tests, edge cases, parameterized variants. Volume work that exercises the full surface of each function.
  4. Agent maintains tests through refactors. When a function signature changes, the agent updates the tests. The human reviews the diff for "did this update preserve the invariants or sneak around them."

On the 143-test Odoo suite that pairs with active production work, the breakdown is roughly 15 human-authored invariant tests + 128 agent-authored function tests. The human-authored tests are 12% of the count and 80% of the value.

$Property-based testing earns its keep here

For invariants over wide input domains, property-based testing (hypothesis in Python) is force-multiplying with AI. The pattern: write the invariant as a property, let hypothesis generate the inputs:

from hypothesis import given, strategies as st

@given(
    price=st.decimals(min_value=0.01, max_value=99999.99, places=2),
    qty=st.integers(min_value=0, max_value=99999),
)
def test_listing_never_publishes_with_invalid_state(price, qty):
    listing = build_listing(price=price, qty=qty)
    result = listing.publish()
    if result.success:
        assert result.published_price > 0
        assert result.published_qty >= 0

Hypothesis tries the boundary cases (0.01, max_value, qty=0, qty=99999) plus random samples in between. It finds the cases the agent didn't think to test for, and often the cases the human didn't either.

$The bottom line

The test suite is the thing that lets AI-augmented engineering be safe at velocity. The shape changes: fewer brittle function-level tests, more business-invariant tests, strict speed and determinism rules, division-of-labor where humans own the contract and the agent owns the volume.

If your existing test suite is slow, flaky, or function-level-only, the upgrade is worth doing before the next AI-driven engineering session. The 4-hour investment in test-suite hygiene saves more time than that in the first AI session that follows.

By David H. Frost · Frost Labs LLC More writing · Home · Privacy