Trevor Wagner

Project-Focused Software Engineer, QA Automation

Exploring Three Types of Narrative Often Reflected in Highly-Readable Test Code

A view of Minneapolis in yellow, taken from within the Amber Box at the Guthrie Teather

Photo by Mick Haupt on Unsplash

When I work with test code (especially test specifications) I'm unfamiliar with, more often than not I find myself trying first to develop a reading of that code. The more clearly I understand (and can walk though, as a series of events) what's important about the code I'm looking at, the easier time I find I have working with the code on my own or collaborating with others.

To develop a reading, I often find myself walking through the series of events defined within code, definitions and naming, the overall organization of tests and test support code, and the relationship of a particular specification to others near it in order to be able to summarize in terms that make sense to me (and likely also to others) what the code is doing.

Most often, I find myself trying to get familiar enough in this manner to be clear on two things:

  • What the test code does that is valuable. What, effectively, does the code accomplish? Within the ways test code does things like get- and set values, define- and execute assertions, interact with external systems, interact with test runtime, and report on state or status produced within test runtime, what does it actually do that is valuable, in terms of defining and running a test?
  • Why it is that whatever the test code does is valuable. If we zoom out on what test code does, how might we characterize the ways in which what it does is valuable? For that matter, how does test code connect what we understand it does back to the sorts of data and feedback we depend it (as test code) to provide? Why is that data and/ or feedback valuable?

My general experience has been that, as a sort of headline- or high-level summary, these statements most often serve as a good first step to develop a well-grounded reading of any code.

Regardless, the focus of this post will be limited to test code.

If it seems reasonable to characterize these two statements as serving to define lines of inquiry to focus an evaluation for the what and the why (respectively) giving test code interpretive context, developing a reading in response to these lines of inquiry requires us to engage directly with the how. Specifically, test code engages with three different sets of concerns to deliver the sort of value we connect to (and help us understand more clearly how it supports) the why:

  • How the solution/ system under test currently functions or behaves with respect to technical expectations. How do test operations (or the ways they define evaluation of current functional state, as encoded within test code) expose expectations for how the system under test currently functions as a system? And for that matter, how well does the solution conform to those expectations?
  • How the way the solution functions currently satisfies stakeholder expectations. How do test operations (or their functionality -- again, as encoded within test code) provide a clear indication of the way we expect the system under test (as a system) to successfully deliver value in response to a set of clearly-defined stakeholder expectations?
  • How the ways in which any specification conducts its runtime serve to arrive at and execute one or more assertion statements with as high a degree of runtime integrity as possible. As the specification is defined, how does it provide a clear set of conditions for success, and how does the specification make sure test runtime produces clear and actionable feedback whatever (as far as we can predict) ends up happening within test runtime?

Code (of any type) that is highly readable tends also to be efficient for development organizations that depend on it, mainly because many activities (with test code specifically: beyond writing and running the tests) involve reading the code. Efficiency in troubleshooting/ debugging, refactoring, reorganizing, dispositioning tests (as well as test support code) depends (all) on the ability of reviewers to develop a clear reading responding to the two lines of inquiry outlined at the beginning (above).

Test code is no different. The easier it is to develop a reading not just of how the code defines operations (which on its own is highly valuable) and how it functions as a test but also of how a test delivers on the mission it was given in a way that is valuable (and why that mission is valuable bigger-picture), the easier it is to develop a sense of context that makes any activities that depend on code readability easier both to plan and to execute.

It's a lot to consider, but these are things I've observed both other engineers and myself attempt to walk through as we attempt to develop a clear understanding of test code we are working with before we attempt either to work on it ourselves or to explain our readings to others.

The more time and cognitive load it takes to develop these sorts of readings, the less efficient activities that depend on them (and, by extension, the code that requires them) tend to be -- especially at scale.

Conversely, the simpler we can potentially make any of this (either as writers or as readers), generally the easier to read and the more efficient to work with.

Nevertheless, in order to be able to relate somehow what test code is doing I often find myself engaging with analysis and interpretation of the code in order to develop an understanding of even the three lower-level statements describing how, listed above. And sometimes the how (let alone its connection to the what and the why) can be fairly complex, if not potentially also complicated.

Sometimes this also includes with test code I've written. No one (for that matter nobody's code) is immune.

As an example, consider the following specification:


def test_confirm_unauthenticated_post_request_returns_http_403(): 
    # arrange
    path = "/endpoint"
    request_body = None

    # act
    response = unauthenticated_client.do_POST(path=path, payload=request_body)

    # assert
    assert response.status == 403

As a counterexample, also consider the following code (which we will presume is functionally equivalent):


def test_expected_unauthorized_rejection_message():
    result = execute_post_request("/endpoint", None).status

    assert result == 403

For reference, it could also be argued that early declaration of path and request_body are unnecessarily verbose and run the risk of making the specification at least nominally less performant during test runtime. To focus more directly on readability, we're going to ignore that here.

A middle ground between these two might include defining these values at the point they are passed as arguments, i.e.: unauthenticated_client.do_POST(path="/endpoint", payload=None).

These two code samples are each fairly simple, but it will likely take more work to argue successfully that the second sample is easier to develop a clear, comprehensive reading of without additional analysis (and interpretation).

Conventionally, it seems like a lot of discussions of code readability focus on elements of formal readability, like how pointers, methods, and other pieces of code get named, and how structures and syntax get used to convey the how as a means of suggesting the what. If developers could just use the code (for example, follow a style guide) to encode all of the detail we expect them to access when reading through specifications and test support code, the result should be code that is more readable.

Meanwhile, I've found that test code that tends to be highly readable tends to be highly relatable: it makes its own opportunities to suggest to the reader how it handles each of the three types of how it is responsible for and how each relates to the what and the why. The code opens itself to reading, along any of the lines provided above, and may even hint at a likely interpretation.

The medium we have to convey all of these understandings (into a format we can have confidence will likely be readable later, save for comments and documentation) is most often the same set of programming statements (among them, calls to lower-level APIs like the test runner and assertion libraries) we use to define the specification itself and any code that potentially supports it.

Humanity has a tool that we've been using for (at least) tens of thousands of years to help convey understandings by making series of events relatable -- to use an accounting of what happens to convey a clear understanding of why it happened and (if all goes well) provide an opportunity for the audience to extract from that why any of the above may have been valuable. For its age, it's still surprisingly useful for making what nearly any phenomenon, occurrence, or process easy to digest.

The name of that tool is narrative.

Narrative is something we can find throughout our history, stretching from cave paintings to modern sculpture. We can hear it in music and poetry, see it on our screens, and visit places where it plays out before us on a stage. It's often the tool of choice we use to deliver- and access the evening news.

We also use it to describe complex technical processes, like how pistons, valves, spark plugs, and other components interoperate at each stage of the four-stroke engine cycle.

Despite its wide and extended use throughout human history (itself, arguably, a form of narrative or a collection of narratives), narrative remains a durable and effective tool that virtually anybody can use to make series of events (technical or nontechnical) accessible and transferrable by making them relatable.

This utility also applies to test specifications, where three distinct series of events thread together to link functionality exhibited by work product as a system, an attempt by stakeholders to access value from work product, and the ways in which the set of operations defined for a specification conduct a test that serves as a predictable, reliable source of data and feedback.

As anybody who has spent time refining their craft telling stories likely understands, it's one thing to tell somebody a story, and it's another to show them. This post will explore each of the three narratives describing how in greater detail, including why each narrative is valuable, how the narrative works, and showing (at a high level) some of the places it might be possible to find (or hint at) these sorts of narrative within and between test specifications.

An Outline of How, as a System, Work Product Exhibits Behaviors That Are Functionally Correct

Every well-written test specification is tasked somehow with confirming that a feature set/ solution has either behaved in a manner that is acceptable or that it has not behaved in a manner that is unacceptable. When testing software behaviorally, this is generally the main goal: isolate a behavior, extract evidence of that behavior having taken place (or not having taken place) by way of output produced by the system, and compare the character (or easily-comparable aspects) of that output to a set of expectations responsible for defining a set of conditions test runtime must meet in order to exit with a status of [PASS].

As a scenario, a test specification shines a spotlight on a subset (or, in some cases, the full set) of functionality that produces this output. If an API client attempts to submit a POST request to an HTTP endpoint, for example, without having authenticated, maybe we expect the endpoint to return a response like HTTP/1.1 403 Forbidden.

A behavior like this (and the output it produces) is easily testable. Within the introduction (above), the status code for this response is what each of the two code snippets provided evaluate for.

There is likely an extraneous set of reasons that explain why behaviors exhibited within functionality is valuable in context of stakeholder concerns (this is developed next section, below), but within the narrative described here, what is most valuable is functional correctness -- that, in essence, the solution produces the expected behavior given a set of conditions. These conditions may include the execution of an action triggering the production of the output that gets evaluated.

Within program code used to define behaviors for the solution that produce this response (i.e. response to an HTTP request), programmers encode instructions for the solution into program code. Within this, the chances are relatively good that, to accommodate any HTTP request received by the POST endpoint (at /endpoint), there is some set of code parts within the solution responsible for guiding (or routing -- maybe even handling) the request through the system and process any data it presents (within headers, for example, or within the POST request body/ payload) in order to produce a response appropriate for the request.

Some might describe the paths data or other input take within the system under test as codepaths, especially where every piece of code along this path handles the request within the sort of logical path defining things like control flow (for example, when based on conditional logic). While this is generally recognizable as a term, I try to avoid using it because it fails to describe what defines functionality completely; it describes how the solution is composed.

Within a four-stroke vehicle engine, it's not only the composition of a piston, a valve, or a spark plug that helps marshal and process air, fuel, and spark through the four-stroke cycle: their functionality (including their current functional state) plays a role that is arguably equally as important.

Within software, although code plays a key role in defining how functionality can expect to behave, it is not the only influence on behavior.

To reiterate a little, the questions I find myself starting with most often when developing a reading of any code (test code included) involve what the code does that's important and why what the code does is important. One of the specific things well-written specifications do that is valuable is to confirm that the system under test has behaved in a manner that is functionally correct. The ways in which a test specification can help make it clear what is expected to be happening within the system under test at the time, including how the way output is evaluated relates to this set of expectations.

Where to Find a Reflection of This Outline

Naturally, we can see evidence of this narrative in the assertion statements used to define how a test will compare one or more sets of output to a relevant set of expectations. The clearer the relationship can be made (and in definitive terms) between what gets asserted and how this relates to a clear understanding of functional correctness, the more clearly a specification can help make this narrative relatable.

Also, I often see this reflected in whatever name is used to describe the specification. Whether the test runner allows a natural-language description to name the specification or just a method name, either provides an opportunity, if the assertion statement can be thought of as a sort of moment of truth within test runtime, the specification name presents an opportunity to advertise what is expected to happen within the specification.

It's also within the aperture or approach the test defines to isolate and evaluate data. This is one reason why specifications benefit from a limited number of assertion statements.

Specifically within the snippets provided within the introduction (above), the assertion statements in each specification evaluate for HTTP/1.1 403 within the status code provided by a response to the HTTP request each specification places (respectively). The first sample, though, makes it clear that what's being evaluated is response.status, which is a little more descriptive than what is provided in the second code sample (result). An attentive reader will likely be able to extract from context that 403 is likely a status code commonly used by HTTP. This still requires some analysis and interpretation, though, where the first example makes this relationship clear more proactively.

Also, the first snippet makes it clearer how the status of unauthenticated_client relates to the output that gets evaluated for within the assertion statement (returns_http_403()) and the type of HTTP request (post) involved.

Additionally, the first snippet also initiates response as the value returned by unauthenticated_client.do_POST(path=path, payload=request_body). The name of the pointer unauthenticated_client reflects the state of the client account (currently no user/ login session) within the system under test.

Within tests I write, I've been known to boil everything down to a simple value named result, which is what ends up getting compared against within the assertion statement. Where I can, though, I still try to make the set of operations leading to the value stored within result to make it clear which output from the system under test result is responsible for storing as it is evaluated to confirm functional correctness.

A Summary of How Stakeholders Can Expect to Access Value from Functionality as a Solution

In addition to evaluating for functional correctness, test specifications also provide data and feedback that organizations that depend on them can use to determine whether a current iteration of work product will likely satisfy stakeholder concerns. Return to the engine/ motor vehicle analogy: regardless of whether a driver understands how a motor vehicle works, what's generally most important to most consumers is that the motor vehicle conveys the driver (and possibly also passengers) from a starting point to a specific destination. Also that the vehicle turns in the ways the driver expects when he/ she/ they expect it to. Also that the vehicle is capable either of stopping or of signaling the driver's activity to other drivers.

These aren't functional descriptions; they are value statements.

From a certain point of view, these two things (functional correctness and capability to satisfy stakeholder concerns) can easily seem like the same thing; in testing, though, that's not necessarily the case. The two generally go hand-in-hand, and because of this relationship it can be tempting to conflate them, especially for the sake of brevity.

Along similar lines, let's return to the HTTP/1.1 403 Forbidden response evaluated in the code snippets provided in the introduction and explored briefly in the previous section (above). While on one hand it's valuable to test for functional correctness, there are reasons valuable to stakeholders to confirm that the endpoint within a system where access to an HTTP API is controlled by authentication might return a response like this as a reply to a request from a client that has not yet authenticated:

  • By returning only HTTP/1.1 403 Forbidden (and not either accepting any data submitted via the POST request or returning a response containing any data that might otherwise be expected as a response to the original HTTP request), the API endpoint restricts access to data and other system elements for which authentication was required to gain access.
  • By returning a status code of 403 and a reason message of Forbidden specifically (between the status code and the reason message), the API endpoint provides clear signal to users why it is not possible either to submit data or (if the endpoint makes the option available) to view the status of any data expected to be returned.

The first reason is likely valuable to anybody who has entrusted data to a system that requires some sort of authentication for access (especially restricting access based on something like login or access roles): the system is going to need to successfully allow all users granted access (and only users granted access) to access the data. At the same time, the second reason (clear signal to users) is likely valuable at least to users developing client code for the API, in that it provides usable feedback to help make activities like troubleshooting easy (also readable as efficient).

Within this, what gets rendered within a single header line in an HTTP/1.1 response within one or more anticipated use scenarios can be expected to meet the needs of two different sets of stakeholders. Along similar lines the set of brakes on a car or turn signals serve secondary stakeholders at the same time (and often, coincidentally, at the service of) primary stakeholders.

If (to reiterate) our primary interest when developing a reading of test code involves understanding what test code does that is important and why what it does is important, the same way verifying functional correctness serves as a form of how, validating capability to satisfy concerns held by stakeholders also serves as a form of how.

Where to Find a Reflection of This Summary

When we structure tests in terms of Given-When-Then, we assign a formal structure to a narrative related to stakeholder value:

  • Given generally summarizes a starting state somewhere between the stakeholder and the system (or maybe strictly within the system as relevant to the stakeholder). Maybe an HTTP client has not yet authenticated (and established a user session). Maybe a motorist has the vehicle in motion.
  • When describes a point of no return, often where the stakeholder undertakes some sort of action that triggers the action that produces the data that gets evaluated within the assertion statement. Within the HTTP scenario expanded on above, this is when the client submits a request (while not authorized -- but, specifically, the When step is the request). If a driver is going to try to stop a motor vehicle, it's the point where the driver engages the brakes (brake pedal in a car, brake lever or pedal on a motorcycle, etc.).
  • Then usually links the evidence that gets evaluate to some sort of outcome that follows Given and When if a stakeholder gets what they want from the system under test. Imagine a server returning HTTP/1.1 403 Forbidden in response to the request or the motor vehicle stopping once the driver engages the brake.

With this in mind, BDD testing tools (especially test runners that drive tests defined using natural language) can be helpful in that they allow testers to define tests by describing what's important.

Arguably, neither of the snippets provided in the introduction do a great job of conveying on their own how they serve to support stakeholder concerns like those outlined above for the HTTP POST endpoint at /endpoint. This owes in part to the fact that they mainly seek to confirm functional correctness (see the previous section, above) and that they limit the evaluations they perform to a minimum number of assertion statements.

Despite this, it should be possible for a number of tests with limited scope to cover stakeholder concerns like these collectively. For example, three (or four) specifications should generally be sufficient to cover the concerns outlined above, related to /endpoint returning HTTP/1.1 403 Forbidden in response to an unauthenticated POST request:

  • One specification to confirm an HTTP status code of 403 (i.e. the first snippet provided in the introduction).
  • One or two specifications confirming at least a subset of a. that the response body contains the expected description and b. that a reason code of Forbidden was provided for the response.
  • One specification to confirm that no data for which data was restricted to authenticated users was provided in the response.

When colocated (or otherwise executed together), specifications like these help paint a clear picture of the stakeholder concerns they evaluate on behalf of (as a means of producing signal that informs data and feedback).

In JavaScript and Groovy, I've gotten a lot of mileage out of collating specifications like these, so that from one API call I can evaluate for each concern (while at the same time limiting I/O).

In a real-world scenario on a team I worked with, this sort of approach led to the reevaluation of a formal specification (for an HTTP API, dealing with general stakeholder concerns related to what has been outlined in this section). Because the tests (and the test plan) made it clear to the delivery team both what tests evaluated and why (in terms of stakeholder concerns) what tests evaluated was valuable, the team was able to determine with ease that parts of the specification as currently written did not address all stakeholder concerns addressed within the test plan.

An Accounting of How Test Code Will Produce Signal by Extracting and Evaluating Output

Beyond the the ways in which we expect to understand how a test will confirm a solution exhibits functional correctness and beyond the ways we expect to understand how a test will confirm support for stakeholder concerns, there is also a third reading of how that helps us understand what test code does and why what test code does is valuable. That reading responds to a narrative describing how the specification conducts runtime within a test to (ultimately) process any assertions it is responsible for.

This should likely be relatively intuitive to anybody who writes code (of any type): if we accept that program code allows software developers to encode instructions for program runtime, test code allows developers to encode instructions for test runtime. At the same time a specification serves as a set of instructions, it also serves as a test, where test runtime leads to a point where one or more assertion statements (hopefully as few as necessary) evaluate output extracted from the system under test.

Most often, test code does one of five things in test runtime:

  1. Gets and sets (possibly unsets) values (either local- or external to the test).
  2. Defines and executes assertions.
  3. Interacts with external/ asynchronous systems or functionality.
  4. Interacts with test runtime itself (consider the static wait).
  5. Reports on state or status produced within test runtime.

Within the code used to define operations, every specification reflects a narrative describing how these activities are defined and ultimately executed within test runtime. Test specifications that are readable also make this story relatable.

Where to Find a Reflection of This Accounting

When we structure tests in terms of Arrange-Act-Assert, we assign a formal structure to this narrative:

  • Arrange denotes (or defines) the point in the specification where actions are taken and values to be set/ accessed in order to be able to proceed with act and assert (below).
  • Act denotes the point in the specification where an action is taken that either triggers the behavior that produces test results (which will ultimately be evaluated within assert) or where output is retrieved from the system under test.
  • Assert denotes the point in the test where output extracted from the system under test is compared to a set of expectations defined for the specification as a test.

When zoomed out far enough, Arrange-Act-Assert can easily resemble Given-When-Then: there are three steps, within which step two is an action that serves as a pivot point leading to step three, where actual results are compered to expectations. There are also two noteworthy differences here, though:

  • While Given-When-Then deals specifically with a workflow (presumably from the fixed perspective of a stakeholder for whom a presumed link between When and Then is valuable), Arrange-Act-Assert does not deal with a workflow.
  • While Arrange-Act-Assert deals specifically with the mechanics of arranging and executing parts of a test, Given-When-Then does not deal at all directly with the mechanics. Taken literally, they frame different retellings of what happens within a test specification.

Beyond Arrange-Act-Assert, I often find that the way the structure of a specification leads (clearly) to an assertion statement helps clarify any progression through a narrative fitting this general description. If I can take for granted that the assertion statement is assert response.status == 403, then the question becomes how we got there. If code above the assertion statement makes it easy to understand how we got to response (within the first snippet provided in the intro: unauthenticated_client.do_POST(path=path, payload=request_body)), now we have the beginnings of a path leading to the assertion statement. If we can make it clear earlier within the specification how we got to the unauthenticated_client.do_POST() statement, then the specification effectively draws a straight line (through each statement to the assertion statement) how test runtime leads to the point where the output being evaluated either conforms to expectations or it does not.

When I work with Software Developers learning how to write test automation, I tend to coach them to use simple, declarative statements. This actually evokes a similar axiom Ernest Hemingway used to coach beginning writers; although I am not personally a fan of Hemingway (in a number of different respects), I tend to share the axiom because it tends to be easy to remember and follow as a rule of thumb. The simpler the statements are (and the more they can declare, simply, what is happening when they execute), the easier they tend to be both to read and to relate to one another both what test code is doing and why what test code does is valuable as it traces from the beginning of the specification through to the end.

Conclusion

The same way there are many different ways to write (or relate, or tell) a story, there are many different ways to read (or relate to) a story. As an insight, the former explains at least in part why the human experience has known so many talented storytellers who, through a series of events and the relation of them (i.e. what the story is and how it gets told) reveal not just what happened but also what between the events and their retelling can reveal about the audience, the storyteller, and the realities any of them engage in. At the same time, the latter (that there are many ways to read a story) helps as an insight to inform the many context within which audiences whom we may never meet may find things like reassurance, confidence, and possibly even well-reasoned challenge in a narrative, depending on who the audience is and what value that particular audience might find within the narrative -- including how it is presented.

Successful storytellers tend to understand effectively how to play both sides of the ball here at the same time: with clear and well-trained understandings of what is likely to be relatable, storytellers tend to try to find a medium they can use to relate series of events (or maybe, absent of events, a discussion between two characters waiting for a third to arrive) in terms of those understandings.

And to be clear this doesn't necessarily need to be as in-depth and detailed (nonetheless illuminating) as how every part in a vehicle engine works or as esoteric as, for example, whether the person Vladimir and Estragon are waiting for in the revival of Waiting for Godot is or is not actually Death (I personally feel like it would be most excellent if it wasn't -- also, as I understand it, William Sadler is not listed as part of the cast of Waiting for Godot, which, either way, this seems like a brilliant production idea and likely also a great show). It can be as simple and down-to-earth as explaining why it would appear any particular football team lost the game on Sunday (or Monday, or even Thursday, or sometimes Saturday) -- whether in terms of a high-level post-game analysis or a play-by-play. One can potentially get a lot across here with just one sentence.

The phenomena we experience between making a series of events relatable and finding a way to relate to a series of events set out before us are, as some of the evidence we have shows, some of the oldest we know in humanity. As the cave art in Sulawesi (linked to in the introduction, above) shows us, there's a good chance we don't even need words to experience them.

Within our day-to-day lives, we find that narrative isn't just a nice-to-have; it's actually a tool we use almost everywhere. Aside from its value for entertainment, narrative helps us get things done when those things rely on relatability.

Narrative's value as a tool generally breaks down to its ability to answer two questions:

  • What has happened?
  • Why is what has happened valuable?

When we evaluate (i.e. try to read) test code, I find we are (arguably) trying to do the same thing we might do in present-day with the cave art from Sulawesi: we are trying to extract the series of events related within the artifact into a clear reading (or assembly of readings) that we can then relate to in terms of value -- wether that relating occurs on or own or whether we share how we relate with others. With test code we ultimately want to understand what the test code does that is valuable and why it is that whatever the test code does is valuable.

There are other subtle clues we can provide to readers that align with general best- / very-good practices for code readability (for any code). Method names for test support code that help clarify the action they perform (or the output they provide, or the question a boolean value might answer) are also helpful. Variable- and class names (and the like) help along similar lines. At the same time, though, while on one hand these help make it clear what code does functionally, I find they are often limited (on their own) to help make it relatable what code does of value and what the value is that the code delivers in context of an automated test.

Within nearly every test specification, whether it tells a story or not, there are three narratives that help us explain answers to these questions:

  • How the solution/ system under test currently functions or behaves with respect to technical expectations.
  • How the way the solution functions currently satisfies stakeholder expectations.
  • How the ways in which any specification conducts its runtime serve to arrive at and execute one or more assertion statements with as high a degree of runtime integrity as possible.

The more easily code used to define a test specification can make these series of events relatable to readers (perhaps even to the writer -- I've been there) of that code, the easier the code will likely be to work with (including to read), because the easier the series of events defined within code will be to relate to.

Like with any story worth telling, the question of how to do this lies between what is likely going to be most relatable for the audience and what the storyteller understands about how to give a series of events the best chance of being relatable -- whether now or at some unanticipated point in future. The same way it's up to a programmer to encode instructions, a programmer also has an opportunity to find ways to make the code open to reading for any audience (anticipated or unanticipated) that may attempt to relate to test code in the future, in terms of what the code is doing, how it does it, and why any of the above is valuable within the practice of automating software testing.