Author: mike

  • The Bar That Caught Fire: Why Testing Is About Understanding, Not Just Verification

    A software tester walks into a bar and orders 1 beer, then 0 beers, then 999999999 beers, a lizard, -1 beers, and a “qwertyuiop”. The first actual customer walks in and asks where the bathroom is; the bar bursts into flames, killing everyone.

    This joke circulates among developers because it captures something true and uncomfortable. The tester did everything right by the textbook: boundary values, invalid types, null cases, injection attempts. The system handled all of it. Then a real user arrived with a completely reasonable request that nobody had thought to anticipate, and the whole thing collapsed.

    The distinction between what we test and what actually breaks systems is not a failure of process. It is a failure of imagination, and more fundamentally, a failure to understand how software, computers, and users actually behave in the wild. Testing is not a checklist to be completed. It is a discipline of thinking adversarially about systems you have built, which requires understanding those systems at a level that most developers never reach.

    There is a common assumption that senior developers write better code than junior developers. This is only partially true, and in a narrow sense. What senior developers actually do better is anticipate failure. They have seen enough systems break in enough ways that they develop an intuition for where the weak points are. They know that the code which handles the happy path is rarely where problems surface. The problems live in the transitions, the timeouts, the retry logic, the race conditions, the assumptions about what upstream systems will actually return versus what their documentation claims.

    This is why testing cannot be separated from design. A well-designed system is testable not because someone added test hooks after the fact, but because the designer understood what needed to be verified and structured the code to make that verification possible. A poorly designed system resists testing at every turn: state is hidden, dependencies are implicit, and the only way to know if something works is to run the whole thing and hope for the best.

    The taxonomy of testing reflects this reality, and each category exists because someone, somewhere, learned an expensive lesson about what happens when it is skipped.

    Unit tests verify that individual components behave as specified in isolation. They are fast, cheap, and useful for catching mistakes early, but they tell you nothing about whether the components work together. Integration tests check that the seams between components hold under realistic conditions: that the service actually talks to the database, that the message queue delivers what was sent, that the authentication layer and the business logic agree on what a valid user looks like. End-to-end tests simulate real user journeys through the complete system. They are expensive to write, slow to run, and brittle in the face of change, but they are the only tests that can catch the bar-on-fire problem: the scenario where every component works correctly in isolation but the system as a whole does something catastrophic.

    Regression testing ensures that what worked yesterday still works today. Every change to a codebase carries the risk of breaking something unrelated, often in ways that are not obvious until a user reports that a feature they relied on has silently disappeared. A regression suite is institutional memory in executable form: a record of every behaviour the system has promised, verified continuously. Smoke tests serve a narrower purpose. They run after every deployment, answering a single question: does this release work at all? They check that the system starts, responds to basic requests, and has not been rendered completely non-functional by whatever just shipped. A failed smoke test means rolling back before users notice. A missing smoke test means users notice first.

    Performance and load testing ask a different question: not whether the system works, but whether it works at scale. Code that runs perfectly in development can collapse when a thousand users hit it simultaneously. Queries that return instantly against a test database can take minutes against production data. Connection pools exhaust, memory leaks compound, and race conditions that were statistically invisible become statistically inevitable. These failures are difficult to predict from reading code, which is why they must be measured empirically under conditions that approximate reality.

    Security testing, including penetration testing, treats the system as an adversary would. The joke’s “qwertyuiop” input is a nod to this: an attempt to inject something unexpected and see what happens. But real security testing goes further, probing authentication flows, session management, input handling, and access controls for weaknesses that a motivated attacker could exploit. The consequences of neglecting this category are well documented and increasingly regulated.

    User acceptance testing puts the software in front of actual users, or representatives of them, to verify that what was built matches what was needed. This is where you discover that the feature works exactly as specified but the specification was wrong, or that the workflow makes perfect sense to developers but is incomprehensible to anyone else. Exploratory testing is a close relative, but less structured. It is the practice of using the system without a script, following curiosity, and actively trying to break things. A good exploratory tester is not verifying requirements; they are probing assumptions. They are the person who might have asked where the bathroom was, not because it was in the test plan, but because they were thinking about what a real customer might actually do.

    Finally, there is chaos engineering: the deliberate injection of failure to verify that the system degrades gracefully. Kill a service. Drop a network connection. Corrupt a configuration file. Slow the database to a crawl. The question is not whether these things will happen in production, but when, and whether the system will recover or cascade into a wider outage. This practice formalises what experienced engineers already know: that resilience is not a feature you add at the end, but a property that must be designed in and continuously verified.

    The bar joke lands because it exposes a gap between verification and validation. The tester verified that the system handled a range of inputs correctly. What they failed to validate was whether the system could handle the kind of requests it would actually receive. The customer asking for the bathroom is not an edge case in any technical sense. It is a completely normal interaction that simply was not part of the test plan, because the test plan was designed around inputs to the ordering system rather than interactions with the bar as a whole.

    This is where experience becomes irreplaceable. A junior developer writes tests that confirm the code does what they intended it to do. A senior developer writes tests that probe what the code does when their intentions are violated. They ask: what happens when this external service is slow? What happens when it returns malformed data? What happens when two users try to do the same thing at the same moment? What happens when the disk fills up, the network drops, the clock skews, or the configuration file is missing a field that was only added last month? These are not exotic scenarios. They are Tuesday.

    The economics of testing are often misunderstood. Organisations treat testing as a cost centre, something to be minimised or outsourced. This gets the model backwards. Testing is not an expense incurred after development. It is a form of knowledge acquisition that happens during development. Every test written is a statement about what the system is supposed to do, preserved in executable form. Every test that fails during development is a bug caught before it reached users. Every test that fails in CI is a regression prevented. The cost of testing is visible and immediate. The cost of not testing is diffuse and delayed, which makes it easy to ignore until a production incident makes it impossible to ignore.

    There is also a deeper point about what testing teaches the people who do it. Writing tests forces you to think about interfaces, contracts, and failure modes in a way that writing production code does not. You cannot test a function without understanding what it promises to do and what it requires in return. You cannot write an integration test without understanding how components communicate. The discipline of testing makes you a better designer, because it makes the consequences of design decisions immediate and concrete.

    None of this is to argue that more testing is always better. Test suites can become liabilities: slow, flaky, and full of tests that verify implementation details rather than behaviour. The goal is not coverage as a metric but confidence as an outcome. A small number of well-chosen tests that exercise the critical paths and failure modes of a system are worth more than thousands of tests that merely confirm the code was written the way it was written.

    The bar caught fire because nobody tested whether the system could handle a question it was not designed to answer. That is not a failure of the tester. It is a failure of everyone involved to understand that real systems exist in real environments where users do unexpected things, dependencies behave unexpectedly, and the assumptions baked into the code will eventually be violated. Testing is the practice of systematically discovering those assumptions before users discover them for you. It is not a phase of development. It is a mode of thinking, and it is one of the clearest markers of genuine seniority in the field.

  • Directed Intelligence: How Senior Developers and AI Actually Work Together

    This briefing sets out a practical view of how software development is changing, and where the real value is likely to sit over the next few years. It argues that the most effective model is not “AI replaces developers”, nor the current fashion for loosely directed prompt-driven “vibe coding”, but a partnership: an experienced architect or senior developer providing structure, judgement, and intent, with an AI coding assistant handling research-heavy and execution-heavy work.

    The distinction matters. Used well, AI compresses delivery timelines and raises baseline quality. Used badly, it produces brittle systems that look complete until they fail under real use.

    At its core, professional software development has always been about decision-making under constraint. Requirements are incomplete, trade-offs are unavoidable, and the consequences of early choices often surface months later. Senior developers and architects earn their keep by shaping those decisions: choosing boundaries, defining contracts, sequencing work, and knowing which problems deserve precision and which can be left flexible. None of that disappears with AI. In fact, it becomes more important.

    What does change is where time is spent. A capable AI assistant is exceptionally good at tasks that traditionally consumed disproportionate effort: scanning API documentation, comparing libraries, recalling edge-case behaviour, generating idiomatic code in unfamiliar frameworks, and stitching together boilerplate that is correct but tedious. It can draft service clients, data models, validation layers, and test scaffolding faster than any human, provided it is given clear direction and constraints.

    The senior developer’s role shifts upward. Instead of writing every line, they define the shape of the system, the invariants that must hold, and the failure modes that must be avoided. They tell the AI not just what to build, but how and why: which architectural pattern to follow, how state flows through the system, what performance or security assumptions apply, and what “done” actually means. The AI then executes within those bounds, pulling in the right APIs, examples, and documentation to produce working code that aligns with the intent.

    This is a very different proposition from what is often described as “vibe coding”. In that model, the human input is vague and outcome-focused: “the user needs to log in”, “add a dashboard”, “make it scalable”. The AI fills in the gaps as best it can. Sometimes the result looks impressive on first run. More often, it’s a tangle of assumptions: authentication logic mixed into UI code, hard-coded secrets, undocumented state transitions, and dependencies chosen because they were common in training data rather than appropriate for the problem.

    The issue with vibe coding is not that the AI writes bad code in isolation. It’s that the system has no spine. There is no explicit model of the domain, no clear separation of concerns, and no shared understanding of what must remain stable over time. Each new prompt layers more behaviour on top of an already fragile base. The codebase becomes difficult to reason about, which means difficult to change safely. That is exactly the opposite of what most organisations need.

    By contrast, a directed AI workflow starts with structure. An experienced developer will articulate things that feel obvious to them but are critical for the AI: where authentication lives, how identity is represented, what guarantees the API makes to clients, how errors propagate, and which parts of the system are allowed to know about which others. They will specify non-functional requirements early, because they know those are expensive to retrofit. They will review output not line by line, but at the level of intent: does this code respect the architecture, or has it quietly undermined it?

    There is also a governance benefit. When the human is clearly accountable for design decisions and the AI is treated as an implementation tool, review and audit become tractable. You can explain why a library was chosen, why a pattern was followed, and where responsibility lies. That is much harder when the development process is effectively a series of improvised prompts.

    None of this is to say the model is risk-free. AI output still needs scrutiny, especially around security, licensing, and subtle correctness issues. It can be confidently wrong. The difference is that an experienced developer knows where to look and what questions to ask. They understand which parts of the system are safety-critical and which are not. That judgement remains a human responsibility for the foreseeable future.

    The direction of travel is clear. Teams that treat AI as a junior but extremely fast contributor, guided by senior hands, will build better systems more quickly. Teams that treat it as an oracle and replace design thinking with vague intent will accumulate problems they don’t see until production. The technology is the same in both cases. The outcome depends on whether experience is used to steer it, or sidelined in favour of convenience.