Making Acceptance Testing as Boring as Possible: Exploring How a Team Shifted Testing Left Without Need for a Testable Build

Jul 10, 2023

Photograph showing the interior of a large rocket thruster

When we discuss what we generally characterize as as shift-left testing (as well as its understood benefits), most often we relate it to the following hypothetical clause:

If testing activities could somehow be shifted to a point as early in the development process as possible...

If one of the most important benefits of software testing involves data and feedback related to the current functional state of work product, then the sooner a team can take advantage of actionable feedback, the sooner the team can start fixing potential issues discovered through that feedback. And the sooner a team can potentially start fixing issues, conventional wisdom (as well as the independent studies cited in this NIST planning report from 2002, if the reader is willing to overlook how outdated the Life Cycle Stages listed within table 1-5 may seem) proposes the cheaper any such issues will likely be to fix.

This generally seems workable until it becomes clear that by testing activities, conventional definitions of shift-left testing mean specifically to refer to test execution as a primary source of feedback. The sooner a testable build or deployment can be engaged- or interacted with, the sooner it can be made tangible what the team has already built, at which point the sooner the team can start potentially adjusting and reworking that tangible result as desired. Even if a particular description hints at other testing activities (requirements gathering often gets mentioned), the main focus remains on test execution as a primary source of objective feedback for tangible work product.

Problems generally surface when earlier test execution becomes a necessary condition for shift-left testing: in order to take advantage of feedback as early as possible, test execution needs to be shifted as far to the left (that is to say, as early in the development process) as possible.

Recently I found a blog post on IBM's Website that outlines the same general approach, where test execution (including continuous testing) is made a key source of early feedback. To a certain degree the post also describes requirements gathering and communication as important, but the majority of the in-depth descriptions deal specifically with test execution instead of anything else. Rather than explaining how the comprehensive strategy worked, many of the sections get into depth on the benefits of continuous automated testing and how Unit- and Integration testing (as well as BDD) can play a role in delivering these benefits reactively.

The same blog post notes (at multiple points), though, those benefits (and the associated feedback) required input from testers and other stakeholders. Communications, notes the post, as well as requirements gathering, was important in addition to test execution. So it seems reasonable also to conclude that early feedback that the team benefitted from here wasn't just the result of automated tests. Again, though, the post doesn't provide much depth beyond a brief mention (mostly within the introduction and the section at the end listing what it describes as best practices). Perhaps this was helpful as part of a content strategy promoting SEO, which seems plausible given the link to a product description appears in the introduction.

And to be clear, I didn't link to the post specifically to pick on IBM or the author. In fact, I believe this is one of the most comprehensive descriptions of shift-left testing that I've read.

One of the most significant problems involved in shifting test execution to the left is the risk of operational churn relating to premature test execution. Churn here may include need for rework in case of bugs or design issues. If time needs to be taken to assess testable/ deployed work product the team may need to navigate unexpected discovery or unstructured exploration in order to regain its bearings. And if testing results in disagreement somewhere between developers, testers, product leadership, and other stakeholders, the team may need to navigate a need for reconciliation or alignment of mission. This is especially a problem with manual testing, but it can also affect automated testing, especially when it is developed in parallel with work product.

One of the main reasons we plan testing in the first place is in attempt to avoid these things proactively. None of them is easy to estimate reactively, and any can prove fatal to delivery on a commitment for a sprint.

Although not by name, the IBM post actually touches on the risk of churn briefly (especially within the introduction); again, though, it doesn't provide much detail. Most other conventional descriptions of shift-left testing (focused squarely on moving test execution left as a means for getting feedback sooner) overlook this completely.

Also, maybe it's worth qualifying early on the point here is not to suggest that any early testing is likely premature. I have personally gotten a lot of value out of both automated regression testing and Test-Driven Development (TDD). Also, I describe briefly (below) how testing establishing a baseline was helpful on longer projects incorporating multiple user stories. When planned correctly, the right sorts of early test execution can be very helpful as a reactive evaluation of work product.

If a team needs humans to review work product it's developed, though, and if not everybody shares similar understandings (if not as close as possible) of what is being implemented, how the design and implementation of work product is expected to be significant to stakeholders, and how the implementation can be expected to be evaluated in support of the above, the team can expect feedback from testing to run the risk of producing churn until the team is on the same page for all of the above. All the while, as new testable builds and deployments become available, the conventional expectation is that test engineers continue testing (full regressions if needed) in order to produce the desired feedback (or risk missing something) -- all as soon as possible, and usually at the same time they responsible for updating test plans (in support of changes to these conditions) that they are currently responsible for executing. After all, isn't test execution what test engineers are there for?

Experience shows time and again that, in addition to any waste, this is a great way to produce burnt out, resentful testing practitioners. And, among other problems, this can also produce churn.

There is a way to produce feedback sooner that is less risk-prone and often more efficient. For about eighteen months I worked with an agile delivery team that made shift-left testing (although we didn't call it that) part of an overall teamm goal was (again, not by name) to limit operational churn. What I called what ultimately worked was make acceptance testing as boring as possible. By committing to this as a goal, the team experienced success both in improving feedback delivered at earlier points in the development process and in limiting the amount (not just risk) of operational churn that accompanies both premature testing and the sort of testing absent of shared understanding as described above.

In short, we succeeded by keeping things moving forward consistently in a number of different ways. Ultimately for the team, limiting churn and moving testing left worked hand-in-hand.

Specifically, there were two testing activities that had the most influence on success:

Test Planning: Testers provided comprehensive test plans focused on concerns that work product is expected to satisfy for stakeholders (at any level, within the organization or outside of it). Eventually the team made review of test plans by developers a key component of its Definition of Done (or, at least, "Development Done").
Test Automation: Software Engineers developed automated tests to qualify builds as testable. This limited churn resulting from test builds being presented for testing where either the feature set being evaluated or a closely-related feature set was not accessible or fully functional due to a regression of some sort. To (briefly) make an apparent connection obvious, this was something the IBM post noted they were doing, as well.

Testing activities, however, were not sufficient on their own to shift testing left.

Success followed the team embracing both of these activities in unique ways and ultimately connecting them to how the team delivered value in a manner that also played a vital role in delivering both efficiently and with high quality. This post will explore how the team embraced these activities and what the general results were that the team observed after embracing them. Also, the conclusion will provide some analysis on what seems to have made these connections vital -- not just to the success shifting testing left, but also more generally to what helped propel delivery of high-quality software at a consistent pace.

In The Beginning, When Acceptance Testing Was More Exciting

When I joined the team, it was in the middle of retesting (and partially reworking) a solution that depended on integrations between the company's main offering and an external SaaS solution as a means of pay-gating access to storage space provisioned through a cloud hosting platform.

At a couple points, we ran into trouble when software developers on the team discovered mid-sprint that they needed to make some sort of adjustment to their implementation due to a new discovery that had not been evident in planning or the initial implementation. These adjustments often followed some sort of ambiguity within either of the following:

The connection between the value framed within a user story and how the team intended to implement functionality in response was not initially clear. The proposed adjustments to the desgin and/ or implementation would accommodate a revised understanding of this connection.
Some unexpected nuance presented by an external system or element of runtime that the solution integrated with. Long story short, we were often confronted by unanticipated unknowns discovered mid-sprint. The proposed adjustments would allow the team to acommodate these unknowns.

Once the developers and product owner adjusted, it was up to the test engineers to keep up with the team in order to help it meet its delivery commitment: test engineers would need to execute whatever testing was necessary to qualify the current testable build, hopefully by the end of the current sprint. Even if developers were able to successfully implement the adjustments within the same sprint, there was no way to guarantee (even to the degree guarantees in software development are valid) that the changes could clear testing.

After a sprint or two like this it seemed clear to me that, if the team were willing to look ahead, we should be possible to anticipate the issues prompting these adjustments (and ultimately adjust for them) proactively. By being more proactive, it also seemed like we should be able to limit the amounts of churn and waste we were experiencing as a result of managing these discoveries reactively.

At a couple sprint retrospectives (if I recall correctly, in response to suggestions that testing had become a bottleneck to meeting commitments) I remember suggesting that part of the team's problem was that acceptance testing was too exciting: the fact that we were discovering enough in testing to put story passage at risk seemed to result from our willingness to proceed without clear understandings either of what we'd committed to or of what risks we might encounter to those commitments while attempting to deliver on them. In essence, I suggested, it was the amount of unexpected discoveries and exploration that that made work unpredictable; it was this unpredictability that any bottlenecks seemed to result from.

To make the work we did more predictable, I remember suggesting that the team should be willing to commit to making acceptance testing "as boring as possible."

Adjustments the Team Made to Shift Testing Left

Initially, I invested a lot of the resources I had access to into attempting to get as much clarity as I could (mainly within backlog refinement and sprint planning) what exactly we planned on building and how. The more clearly I could understand not just the general design of the implementation but also how the finished solution was expected to deliver value to consumers (in essence, how design and implementation worked together to the benefit of stakeholders), the greater the clarity with which (I understood, and still do) I should be able to model which parts of the application should be tested and which stakeholder concerns make them worth testing.

On longer projects (usually a long-lived epic incorporating several discreet user stories), we initially focused on executing testing as comprehensively as possible on early stories, to establish a baseline. As the team iterated on those stories comparing the functionality exhibited by new change sets to (and performing regression testing against) the original baseline. Here (as suggested by conventional definitions of shift-left testing), early execution ended up being particularly helpful as a source of feedback that did not incur the risk being executed prematurely.

Regardless of the size of a given project, commitment to limiting unexpected discovery and to coordinating on what would be evaluated (and why) ended up being vital to making acceptance testing predictably and consistently boring.

As the team focused more on consistency and predictability, we found that both our functional baseline and any plans we set between backlog refinement and spring planning generally avoided need for major adjustment. We also found that acceptance testing seemed to execute more smoothly overall. Two activities that contributed the most to this were test planning and automated regression testing.

Test Planning

I developed test plans much the same way I typically do, by creating inspection checklists composed of inquiries that target stakeholder concerns.

At one point, at the suggestion of the team's project manager, I requested (during a retrospective) agreement the team for a requirement that, in order for test engineers to execute testing, we would need approval of a test plan I'd composed by at least the software developer assigned to the user story (at one point we also required the product owner). In essence, in order for us to commit to test execution, we would need to concur (if not just discuss) what would be getting evaluated and (very importantly) why.

Although not unanimously at first (I believe we needed to test-drive the changes with a subset of interested parties before a second vote), the team agreed to this requirement.

Eventually the team started to engage in discussions around the test plans that seemed valuable.

Shortly after requiring test plan reviews, software developers on the team would report occasionally during standup that they needed to talk to me afterwards to clarify something I'd written in a test plan. Sometimes they would report that they had taken the test plan to the product owner for clarification and now needed to adjust what they were currently building. Following any of this, the developers occasionally extended their work estimates.

And, to be fair, sometimes I found out that I was mistaken in understanding either the desgin, the implementation, or the stakeholder concerns I referenced within the test plan.

Sometimes, too, we would disagree on the fundamentals of what should be tested and why. In the face of this, sometimes I would tell developers that I agreed not to execute a suite of tests, at which point I executed them anyway (which sometimes led to discovering product issues), because execution fit into my original time estimate. Sometimes I found things to test that we hadn't discussed. Sometimes I didn't execute testing for things called for in the plan once it became clear that the testing was unnecessary.

As I recall explaining more than once, a test plan is not necessarily a test guarantee. But as any sort of commitment, it seemed like the test plans prompted conversations about what, tangibly, we understood we were committing to.

Automated Regression Testing

Early on after I joined the team, we finished a couple sprints where we were not able to complete testing we had committed to due to functional issues in test builds either preventing us from being able to access the functionality our test plans required us to interact with or degrading or disabling functionality in a feature set that, as a result, blocked access to the functionality we meant to test. After two or three rounds of returning stories to the developer for a new build, we were still blocked due to lack of a testable build. Again, this was something the team experienced for a couple sprints.

Before the retrospective on one of these sprints I recall approaching the tech lead on the team and requesting that some amount of unit- and integration tests be implemented to confirm that the functionality currently blocking testing was working as expected (and, bigger picture, that any build submitted for testing was validated by the software developers as ready for testing). It would be up to the developers to confirm that, if functionality were to be handed off to test engineers, that the test engineers would have everything they needed in order to test.

As I recall the tech lead agreed.

The Results the Team Observed with Testing Activities Shifted Left

Overall, the team observed three main benefits as a result of these adjustments:

The team encountered the sorts of issues that generally produced churn less frequently than before as the team adjusted to a new understanding of what was being tested (or even built) and why.
The team grew much more comfortable identifying and processing sources of risk for operational churn. That is, for things it did not immediately understand or were not immediately prepared to work with, the team was much more comfortable identifying and dispositioning these things. And because the team had committed to limiting the impact of these things, the team was also prepared to discuss them in a retrospective if (often unmitigated or unidentified) they ended up having an impact on team throughput.
The team was able to estimate work with greater accuracy. When it committed to work it had estimated, the team was more often able to complete it as originally commited to, and when it projected how long it would take (including on the team's burndown chart) more often than not it was able to deliver in the time it had estimated.

In short, the team was now making informed decisions on how to execute as a result of gathering feedback proactively. The benefits of this approach made the work the team committed to more consistent from several different perspectives.

Consistent Test Execution

In short, what we observed in test execution was what we'd requested: acceptance testing generally executed very predictably and without much excitement. When executing acceptance tests, we found that the system under test either produced the expected output or it did not; although we did discover product issues, there wasn't often much mystery to them. In the rare case that we did discover something unexpected, it typically wasn't very complex. When work product did not produce the expected output, the variance was likely of low impact/ severity or easy to disposition (either a quick fix before release or easy to fix afterward), because the overwhelming majority of complex or high-impact potential issues we'd already identified somewhere between backlog refinement, sprint planning, test planning, and test plan review.

Not only did we execute testing consistently; we were also able to predict somewhat accurately how long testing would take. At more than one daily standup I remember being asked whether a particular story was in danger of not being completed due to incomplete acceptance testing. I was able to report with confidence that if we received a testable build (or, additionally if needed, sign-off on a test plan from relevant team members), we should be able to complete testing per the approved test plan in a day, give or take. Because I understood the test plan (which had been approved), and because I was clear on the nature of work product (given a successful plan review), now I could assert with confidence that, barring any sort of major surprise (we did our best, but you never really know until you test), testing would likely take a day (or two days, or whatever was necessary).

In other cases, maybe we needed test plan approval: based on this I could also estimate somewhat accurately on top of approval, given my understanding of the current test plan. Add half a day for review (or maybe the developer has an idea how long review will likely take) to however long testing was expected to take based on our best understandings at the time.

More often than not, I recall that we were able to meet our estimates.

Consistent Team Planning

Eventually we found that planning testing ended up playing a role in how we approached backlog refinement and sprint planning. I would ask questions related to what I planned on including in test plans; the questions often prompted considerations of opportunities to cover unknowns before committing to work. Sometimes they also provided me with an opportunity to clarify my understanding early on. Sometimes we discussed within these pre-commitment meetings what would be tested and how (or even why). At the point we did, we were no longer discussing just what would be delivered but also how we would assess tangible work product both to confirm we had delivered successfully and in anticipation of potential product issues we expect we might uncover.

In addition, I recall some of the questions leading to discovery of unmet dependencies. Identifying and dispositioning unmet dependencies (whether they be internal or external dependencies) became part of the team's Definition of Ready.

Either way, I found that developers (and even product owners) tended to engage with questions I asked more directly than they did before we started requiring test plan reviews. Because we all understood that there was a chance that a concern that was raised in backlog refinement could end up having an impact once we started executing (as a team, attempting to deliver value within a sprint) I understand that they agreed (as I did) that an ounce of prevention was worth a pound of cure.

Consistent Work Quality

Over the eighteen months I worked with the team, it was assigned three different must-win projects of high strategic importance to the company. The projects were in different customer domains and made use of different solution architectures: one served as middleware between Epic and the company's main offering (providing updates over a TCP push connection in response to changes to hospital records), one was the pay-gated solution I mention above, and the third integrated with student information systems and new services under development at Apple at the time to push MDM profiles responsible for configuring shared iPads in classrooms. No two projects were the same. Despite this, though, we were able to pivot and maintain velocity at the same time we were generally recognized for the quality of our work.

Regarding support for shared iPad, it was reported at an end-of-year meeting after we'd shipped (the previous April) that company market share in the segment shared iPad related to directly had grown 33%. It was also noted (if I recall correctly) that Apple had designated the company as a preferred provider for MDM support of shared iPad.

Regarding integration with Epic, the feature set we implemented ended up becoming part of a pilot program that helped establish a new product vertical for the company.

Consistent Team Velocity

To reiterate what's already been outlined here: because the team adjusted (reactively) less mid-flight during sprints, we were much more likely to complete work within the same sprint we committed to it for originally.

At one point (for about four months) we completed as many as 40 story points (consistently) per one-week sprint. This was not sustainable (our normal velocity was 40 points give-or-take over a two week sprint), but for anybody doing the math, that was eight months of high-quality work completed in four months. During that time we were able to deliver fairly consistently and with a clear understanding of what we were capable of not just in terms of how we built, but also in terms of how we'd verify and validate what we'd built. And because our general approach effectively minimized the risk for operational churn proactively, we accomplished this (generally) with what seemed like a minimum of waste.

What's more, we found that we didn't need to fix product issues with the frequency that other teams seemed to need to; at one point this included another delivery team implementing new features in the same general functional area we were. Among other things, this lowered operational overhead for us to be able to continue new work at high velocity.

This was all over the course of several different high-priority projects, working with different technologies and different general solution architectures.

Conclusion

To any team or leadership interested in shifting testing left, this should hopefully serve as a helpful example of what shift-left testing can look like when it succeeds.

As I understand it, there were four things that ultimately contributed to the successes this team achieved:

Test Planning. The distinct way I write test plans doesn't just assume a technical medium that needs to be analyzed or inspected; instead it assumes a product that needs to be evaluated as a means of satisfying stakeholder concerns. Rather than listing codepaths or working from a requirements matrix, I find a way to define the testing problem and assemble a checklist of inquiries into that value that should satisfy stakeholder concerns as a means of solving that problem. This helped the team successfully divorce its understanding of what it meant to produce from a strictly technical understanding, to a point where the common language the team used surrounding testing encompassed not just what would be tested (and how) but also why it was important to test.
Communication. When the team committed to reviewing test plans, it also committed to reviewing each inspection checklist in terms of both any technical implementation and the stakeholder concerns the plan's inquiries were defined in terms of. It served as a common medium (using what became a common language, for both technology and value at the same time) for the team to use to discuss how tangible outputs would be evaluated. And as the test plans (and reviews) became more integrated to how the team operated, questions during sprint planning and backlog refinement were (usually) recognizable as more relevant because those questions were understood as relating to building a test plan the team that would help serve as that common medium. Through communication, testing (and the scope of what we tested -- see the description of testing understanding, below) eventually became more than a discreet set of activities handled by a couple individuals but instead a practice supported by the whole team.
Commitment to Limiting Unanticipated Discovery/ Exploration. The team's commitment to and follow-through in limiting unanticipated discovery became a vital part of succeeding on its goal of making acceptance testing as boring as possible. There was no magic formula (sometimes we needed remove a story from an active sprint because we hadn't reexamined our assumptions clearly enough before committing or because we discovered an unsatisfied dependency), generally speaking if we could identify unsatisfied external dependencies, create- and complete spike tickets defining structured investigation of unknowns, kept ourselves open to asking and responding to constructive questions in backlog refinement related to what we were confident we understood (or perhaps not), the more success we generally experienced in keeping things boring. The automated regression testing played a critical role in making acceptance testing boring here, as well. If we missed one of these things, we generally showed the accountability to acknowledge the miss in a retrospective.
Support from Leadership. There were at least one engineering manager and a project manager, as wells a couple others I could name who I understand helped keep the team (myself included) honest in how we approached working together and make sure we had any resources we needed on the team to be effective. It can be as simple as setting clear expectations and asking questions in support of those expectations. It can be as simple as floating a helpful suggestion. Or it could involve rotating a member off of the team who either didn't gel with the team dynamic or who would rather be working on something else. From experience, I understand the things that worked here I could have attempted to promote with leadership that was less engaged or supportive and experience very different results. Leadership was an important part of what made this work as much as any individual contributors on the team.

In a biweekly one-on-one meeting, the manager I reported to at the time remarked "it looks like you got developers to stop throwing code over the wall." At the time I felt (and still do) that this characterization at least partially missed the point. Similarly, I don't believe that characterizing what worked for the team could necessarily be summed up as "requirements gathering" or "involving the whole team in communicating." And although the team satisfied the necessary condition push testing activities as far to the left in the development process as possible, I'm personally not satisfied that any chronological placement of testing activities was sufficient to deliver the success that the team experienced for its efforts.

When I zoom out on what I understand in hindsight, the team's original goal was to limit operational churn experienced within acceptance testing by proactively identifying and resolving sources of potential churn. In order to deliver on this goal, the team ultimately needed to adopt strategies (and mature processes around) producing inputs for testing (as well as tactics involving test automation) that limited the number of surprises and potential discoveries that made acceptance testing anything other than "as boring as possible." This, I believe on its own (in addition to promoting testing to first-class citizen status on the team), helped shift testing left in practice.

Where I believe the team experienced success beyond this, though, takes the benefits of strategy and mature process a step further: what I believe worked was that the team successfully found a way to introspect proactively as a means of mitigating other sources of potential churn. Rather than limiting testing to change sets that might serve as a tangible representation of a release candidate, the team also committed to testing its understanding of it was committing to building. At the same time, it made how we understood we'd evaluate completed work product (in essence socializing the definition of acceptance testing within the team as a legitimate engineering problem) part of how we planned the delivery of value. All together, this was what produced the feedback that conventional advocates of shift-left testing generally promote as the result of early test execution; we produced it without incurring the risk of the sorts of operational churn (and accompanying frustration) that generally result from testing someting prematurely or without sufficient context.

In essence, the team found a way to measure twice and cut once in software development that resulted in multiple high-level pay-offs.

Again, this seems to contradict many of the many blog posts I've read that generally harmonize with a conventional definition of shift-left testing that relies primarily on shifting testing activities (especially including test execution) to the left and perhaps mentions other activities as a footnote (and likely doesn't engage with those other activities very directly). For the team I worked with, the details buried in the footnotes worked much like how the cone on a rocket nozzle directs thrust: it ensures that the burn is controlled in such a way that it propels the rocket in the expected direction.

What it does seem to harmonize with (for me), though, were the sorts of things I remember working to deliver with a high degree of quality (at speed) when I worked as QC in a print shop. Software is (of course) a product. It takes many sets of eyes to assure quality. And it's leadership (at whatever level), commitment, and follow-through that makes the difference between a successful mission to reach a particular destination and a mission that's likely a lot less predictable and (as a result) much more exciting.

Design Overview: In-Memory Generic Repository for Storing Test Data in JavaScript using LokiJS

Code Walkthrough: Simple Framework Running UI Tests with Cucumber-JVM, SpringBootTest, and Selenium

Trevor Wagner