On the importance of manually maintained isolation for unit tests written in impure languages

Background

In my experience, testing is one of the most difficult aspects to being a developer. While the implementation of a complex algorithm might be intellectually demanding, it deals with a known quantity (the computer.)

Testing, on the other hand, is the delicate art of coaxing correct, bug free code out of haphazard and often contrarian organic entities we call programmers. It gets even more challenging when testing involves a team of these people.

Why am I so caught up on testing?

I’m not a computer scientist or engineer of any sort — I’m a hack without even a tertiary qualification to my name. At no point in my career has anybody ever explained to me the principles of unit testing — so I’ve had to learn everything as I go.

I suppose I could be upset about this. Instead, I’ve found it an illuminating experience. Rather than simply being told to test, I’ve learnt first-hand why testing is important.

In January 2014, I joined News Corp Australia to work on a little node app.

At the time, it was not much more than a proof of concept demonstration of the utility of stateless embedding of API driven web page fragments to be assembled “on the edge” in Akamai, using ESI — the pet project of my immediate colleague.

Over the next two years, it evolved into a mission critical service, rendering and delivering all the online article pages for the organisation’s various news brands, first party advertising, popular lists, all kinds of static assets, delivering content to distribution services such as Apple News and Google AMP, as well as performing realtime analytics and content personalisation. If it went down, seven major Australian news outlets (and many more local publications) and their editorial staff would be on the blower filing incident reports.

It was therefore imperative that my small team (which never grew beyond four) get things right. And we did — but we got a lot of things wrong too.

Every one of the times we deployed bad code and broke the service for millions upon millions of readers, we could’ve avoided had we covered that case with an effective test.

From the very beginning we recognised the importance of tests. We wrote them. But we had poor test discipline and we didn’t really know how to write a good test. Countless times we revisited code only to discover that tests were confusing, or weren’t testing what we thought they were, or didn’t actually fail, even when the code they were supposed to be testing was deleted.

We eventually found ourselves with an effective strategy. Our test system as it stood when I left the company was extremely comprehensive, pretty damn fast, and battle proven. It let us deploy with confidence and release on a whim. It allowed us to move enormously faster than more traditional areas of the business which had to seek the approval of a manual tester for even trivial releases. Our feature delivery time was an order of magnitude better than the next fastest team.

Hence, this screed. It’s a rough compilation of all the things I wish I knew when I started — and a plea to everybody else using an impure language who isn’t testing properly. Most of my examples come from JavaScript, but I think should be largely applicable to any impure language.

Much of this might seem obvious. But sometimes the intellectually obvious is not the practically understood, and so it’s worth going through the basics.

What is a unit test?

It might sound patronising to ask a question like this. However, the vast majority of the JS developers I know either don’t test properly, or don’t write proper unit tests. I attribute this to the fact that they don’t actually know what a unit test is.

I’ve developed a pretty strict philosophy for what makes a good unit test over the years.

What is a unit?

It’s pretty clear from the name that a unit test is a test for a unit. However, it’s not at all clear from the name what a unit actually is — yet it’s very important to writing effective unit tests that this is well understood.

Wikipedia has a very straightforward definition that I like:

“Intuitively, one can view a unit as the smallest testable part of an application.”

Unit Testing - Wikipedia

Generally speaking, this is a single concern, exposed in a single function or subroutine, with a very minimal set of discrete behaviours.

I found that a good rule of thumb is that a code unit should have a maximum of two or three behaviours. Anything more complex is more than a unit.

The problems unit tests solve

Unit tests are very far from a formal proof that code works. However, well written and comprehensive unit tests provide an extremely good practical indication that code does what it says on the tin.

Without unit tests, making changes to something people rely on is a frightening proposition. Even when you’ve developed a keen intuition for the behaviour of software, any non-trivial code will have millions of hidden and unknown behaviours. These sinister quirks are lurking in every piece of software — waiting for the right input conditions to emerge.

It’s a common refrain that it’s impossible to unit test every behaviour, every possible state, every operating condition. That’s very true, but I believe it misses the point of testing.

Good unit tests aren’t trying to test every conceivable operating condition of code. Aside from validating the explicitly intended behaviours, they’re forcing you to write your code in a way which significantly reduces the number of discrete behaviours for a unit — and therefore the likelihood of unexpected operation.

The importance of repeatability

Another important aspect of unit tests is that they’re completely repeatable.

Imagine trying to test a scientific hypothesis without being able to engineer a repeatable test! Even a hypothesis as straightforward as “Paper catches fire when exposed to heat” would be useless if every time the test was run, the temperature of the heat source was random and uncontrolled, the dampness and thickness of the paper varied wildly, or the oxygen or combustible material spontaneously disappeared from the test chamber.

A scientist observing the results of such a scenario would rightfully criticise the testing setup — but if you pinned them down and forced them to make a conclusion from the test, they’d only be able to conclude that exposure to heat was entirely uncorrelated with the combustion of paper.

It’s a ludicrous hypothetical, but it serves to demonstrate that if you can’t rely on the test completely controlling for external conditions, and therefore reliably delivering the same result under the same input conditions every time — you’ve got an utterly worthless test.

More times than I’d care to admit, I’ve discovered tests that I or one of my colleagues had written which exhibited different behaviour if run by itself, or in a different order to the one defined in the test file.

Another classic example is the “works on my machine!” scenario. In this case, the developer writing the test accidentally relies on some aspect of their local environment to deliver the desired test outcome, which means the test fails or enters a different mode of operation (but still passes) when executed elsewhere, or on a CI server. It’s actually the latter scenario which is the most pernicious, dangerous state — there’s nothing more precarious to the ongoing stability of a software product than a test suite which inspires false confidence.

Every time a unit test is run, regardless of the operating conditions, it should give exactly the same result for a given code unit.

The only way to achieve this is through complete and total isolation.

Difference from integration and other kinds of tests

It’s a very good idea to establish very strict operational definitions for all your test tiers, and maintain that separation religiously.

Some might be thinking — “How can I test whether my database connection works if I have to isolate my tests from their environment?”

While a developer particularly au fait with modern test methodology might roll their eyes at such a question, it speaks to a very broad misunderstanding of where the demarcation between unit and integration tests exists, and a misunderstanding I constantly encounter.

Testing whether the database connection works is undoubtedly a great test — but it’s not a unit, it’s an integration of some form. You should have both an integration test to verify the database connection (in this instance) — as well as a set of completely isolated unit tests to validate the behaviour of code managing the connection.

If a unit test corresponds to a single code unit (as described above) — an integration test deals with anything larger than that.

If your unit test does any kind of I/O, or you can’t completely control the operating environment for the code unit under examination — not only is it likely to be a bad test, it’s also not a unit test.

Quoting Michael Feathers:

A test is not a unit test if:

  • It talks to the database
  • It communicates across the network
  • It touches the file system
  • It can't run at the same time as any of your other unit tests
  • You have to do special things to your environment (such as editing config files) to run it.

Tests that do these things aren't bad. Often they are worth writing, and they can be written in a unit test harness. However, it is important to be able to separate them from true unit tests so that we can keep a set of tests that we can run fast whenever we make our changes.

There is considerably more overlap between integration and functional/external-integration/smoke/acceptance tests than between unit and integration tests.

Why does this distinction matter?

I found that as our workflow and testing methodology evolved, the purpose of integration tests switched from our primary test tier, to a sort of parallel validation of our unit tests. The unit tests were assessing the integrity of the code, and the various levels of integration tests were largely present to ensure that the assumptions made at the unit level held true at progressively broader levels of integration.

Integration tests were rarely useful in isolation because of the trouble involved in maintaining test repeatability. The more components involved in a single test, the more which needs to be controlled. Some things are impossible to control inside the scope of a test — whether a database is available, for instance.

Furthermore, because of the difficulty of maintaining isolation, and therefore repeatability/accuracy at greater levels of integration, (and integration tests’ tendency to stomp all over your application changing operating state left, right, and centre) they should be completely isolated from unit tests. Unit tests should never be executed in the same memory space or concurrently to integrations, ever.

The challenge posed by impure languages

For the purpose of this article, a ‘pure’ language is any language which guarantees referential transparency for functions or functional units. That is — that every time a function is executed with a defined set of parameters, regardless of the operating environment, it will always return the same result. (Sound familiar?) An impure language is therefore the opposite — it does not make any such guarantees.

While referential transparency itself is unequivocally a binary attribute when it comes to individual functions, I’d argue that in practice language purity is much more of a spectrum. Haskell is purer than Erlang is purer than Python. Some languages allow discrete code units to flag themselves as not referentially transparent, despite enforcing referential transparency or symbol integrity for the remainder of the units.

It’s very easy to play fast and loose with an impure language. The more impure the language, the faster and looser you can play. Many of the conveniences afforded by impure languages actually directly come at the expense of purity. Node, for instance, has pretty anaemic out-of-the-box systems for in-process messaging (let alone its abysmal IPC) and therefore developers are encouraged by the system to store loads of state in globally available contexts.

The more that implicit global state affects the operation of code units, the less it’s possible to reason about their behaviour (and therefore the risk of encountering unexpected and erroneous behaviour substantially increases.)

Once it is recognised that referential transparency and avoiding shared state and side effects is a desirable architectural goal, it becomes the programmer’s responsibility to maintain these features where the compiler or virtual machine will not. It is theoretically possible to write pure code in a language like JavaScript, but it is not at all easy, and it is essentially impossible to achieve complete purity in practice.

Being human, the programmer is likely to make frequent mistakes, breaking isolation, and therefore referential transparency. So tooling, and in particular a good testing discipline, can fill some of the gap — bringing the in-practice stability of systems built in impure languages closer to the pure ideal.

Approach

Especially in the web development and JavaScript communities, conversations regularly fixate on frameworks. Test efficacy is often conflated with choice of test runner and assertion library. I feel this derails the real conversation that should be happening.

While for JS projects I’m more than happy to recommend mocha and either Node’s inbuilt assert library (simple is good) or chai (chai pulls some funky hacks, but it’s friendly) — I think obsessing over framework choice is a distraction.

The key attributes of good test discipline are largely unrelated to the technology.

Focus on speed

The test suite should be a tool to assist development, not to slow everybody down. Unit tests should run in the blink of an eye, even if you’ve got hundreds (or thousands.) A single unit test shouldn’t take more than 100ms to complete (and ideally much less.)

Having the test suite watch for changes and re-run the tests when one is detected significantly speeds refactoring and TDD efforts. The only way this is possible is if the tests are fast.

TDD (Write the bloody tests first)

TDD has become a buzzword of the first order, and when practices become buzzwords they often lose their meaning (see: Agile.) Additionally, there are a number of flavours of TDD with their own specific methodologies (see: London School, Detroit School.)

Therefore it’s worth describing what I’m talking about when I say TDD:

  • Before any code is written, all of the desired behaviours of a given code unit (each and every distinct thing it should do) are itemised and understood.
  • The public interface for the code unit is documented.
  • A series of tests — one for every behaviour — is written to test the public interface of the code unit. Private interfaces for the purposes of testing are never created.
  • Only then is the code unit written. It is not considered complete until it passes all of the previously written tests.

There’s some nuance to what makes something TDD versus BDD, but for all the hand-wringing I don’t think it changes much about how tests are written.

The most important thing is to write the tests before you write the code. This is important because it forces you to ignore how the code is implemented, and focus purely on the behaviour of the code unit.

If the tests are not highly coupled to the code, you’ll immediately see the following benefits:

  • Test-assisted refactoring later becomes easy. Conversely, highly coupled tests are really fragile, and likely to break if somebody changes even a slight detail inside the code unit.
  • Tests become clearer, and a readable specification for the behaviour of the code unit. On the flip side, highly coupled tests are verbose and difficult to understand.
  • Test failures become easier to action, because the failure is directly tied to a behaviour of the code unit interface, rather than a trivial implementation detail.
  • You’re forced to completely understand the behaviour of code before writing it, meaning you’ll get documentation (inline or external) for free.
  • You’ll have a test-guaranteed interface specification to your code unit, meaning that anybody writing code that consumes it can be a lot more confident in how it works — and this means it’s far easier for them to mock your dependency in their code.

Every time I’ve written tests after the code unit itself, I’ve ended up with bad, highly coupled, verbose tests (even if I’m deliberately trying to keep the tests decoupled.) Just knowing how the implementation works seems to restrict how I think about testing.

Hiding the implementation from myself by doing it later is a neat mental hack which completely prevents this from happening. After all, I can’t possibly lock my tests to the implementation if it doesn’t exist yet!

I’ve spent countless hours rewriting highly coupled tests because I needed to make a trivial change to the code under test. It’s a criminal waste of time, and you’re doing your future self a disservice if you don’t adopt TDD.

Dependency Injection (Library-free!)

It’s not possible to maintain isolation of a code unit, if its dependencies are not isolated from the code unit itself.

Dependencies can store or trigger all kinds of unexpected operational state, and every dependency should be mocked. This should not be done via monkey patching — libraries which promise to replace, patch, or intercept calls to methods on dependencies and return test data are a bad idea.

Why? Because stuff breaks.

For the first year or so during the development of our application, the test suite made heavy use of sinon and nock. It seemed sensible initially — nock, for instance, allowed us to ignore the implementation of our internal HTTP agent for making upstream API requests, intercept and mock those requests based on the URL or other parameters.

In practice we were creating fragile tests. Not only does allowing real HTTP calls to be made (even if intercepted) indicate that too many concerns are included in a single code unit, we found that when the implementation changed, and the parameters of upstream calls were no longer what the test was looking for, they’d fly right through to the origin server, and nock would either not notice — or throw a really unhelpful error when it hadn’t seen the expected request/s.

Sinon does the same — by encouraging real dependencies to be used, it fails to catch scenarios where the code unit under test may begin to use different dependency methods which may not be mocked. Ideally, it’s the responsibility of the programmer to ensure that the test is kept up to date, but in practice, the developer might not notice or remember to make the change. They’re human — the test strategy should anticipate that they’ll make mistakes.

Furthermore, sinon and other libraries introduced additional complexity and behaviour to tests in practice — tests failed unexpectedly, or passed silently (sinon in particular has a number of methods which, when used improperly, silently swallow errors.)

This is bad design. Any kind of library designed for testing should not allow mistakes to go unnoticed.

Failure modes of test systems should prevent the test suite passing. There are no exceptions to this.

It became clear that we needed to control our dependencies more tightly. A colleague suggested dependency injection, but we were initially highly resistant to the idea. Systems like RequireJS were extremely heavy handed and usually broke the CommonJS interface model which we found encouraged highly modular, compassable code, and was easy to use. In particular I didn’t want to rewrite our whole application to require a complex graph of runtime dependencies.

Eventually, this same colleague suggested we try a slightly quirky method: a factory based “implementation” with a second “interface” file which required dependencies, injected them, and exported the constructed dependency.


Example

tweet-after-delay_implementation.js
module.exports =  
    function tweetAfterDelayFactory(setTimeout, sendTweet) {
        return function tweetAfterDelay(message, delay) {
            setTimeout(sendTweet.bind(null, message), delay);
        };
    };
tweet-after-delay.js
const sendTweet = require(“./post-to-twitter”);  
const implementation = require(“./tweet-after-delay_implementation”);

module.exports = implementation(setTimeout, sendTweet);  

In this trivial example, we’ve got a code unit which sends a tweet after a customisable delay. Using the code would be as simple as: tweetAfterDelay(“Oh, hello there!”, 60000) — which would send the text ”Oh, hello there!” to Twitter one minute after execution.

Note that every single external behaviour is controlled — most notably, setTimeout. Most people wouldn’t consider setTimeout a dependency as it’s part of the standard library in JavaScript and available globally. But if it introduces behaviour external to your code unit — it’s a dependency. It should be injected and mocked. (This also affords the tremendous convenience of not having to wait a minute for the test to pass, as your mock can simply test the delay/interval, and execute the callback immediately.)

In the tests, require the implementation, never the CommonJS interface, or any of the dependencies. This will force you to properly mock, enforcing the isolation of the code unit.

Initially people turned their noses up at the double-file thing. Nobody likes boilerplate, after all. But this method has several key advantages:

  • Failure modes break the tests, instead of silently swallowing errors.
  • Isolation is enforced and easy to test/lint for: anything not defined in the body of the function or exposed as a factory argument is breaking isolation.
  • No additional libraries are required.
  • An existing CommonJS-style app can easily transition to this style, module by module — without breaking anything or requiring extensive refactoring.
  • The two-file thing means it’s possible to test that all dependencies are injected and properly controlled for — the presence of require in the implementation file can be made a lint error. Not using a separate interface file immediately breaks most of these guarantees.

Writing good assertions

Understanding what’s gone wrong when your tests fail is important. There’s nothing worse than sifting through an inscrutable test to work out which one of seventy individually mocked callbacks is the one throwing the inscrutable error — Error: expected undefined to equal true.

Assertions should be written in such a way that their failure exactly explains the problem.

Eric Elliot explains the approach relatively succinctly — suggesting tests should be a “good bug report”.

What do I mean by “a good bug report?”

I mean that whatever test runner and assertion library you use, a failing unit test should tell you at a glance:

  • Which component is under test?
  • What is the expected behaviour?
  • What was the actual result?
  • What is the expected result? How is the behaviour reproduced?

Eric Elliot: JavaScript Testing: Unit vs Functional vs Integration Tests

In my experience, the title of the test should satisfy the first criteria. Explicit is better than concise.

Test runners (such as mocha) which use method names like it encourage you to describe the behaviour in the test name. Do this — and repeat the desired behaviour in your assertion. Again, it’s better to be explicit. Often the presentation of test failures is different in CI tools to when you run the tests locally, so it’s good to keep behaviour context in as many places as possible.

The expected behaviour, actual result, and expected result are the domain of the assertion. A good assertion library can make the actual result and expected result clear, but you should always label assertions clearly.

In order to properly describe the “expected behaviour”, the assertion message should be a positive linguistic assertion (i.e. “The component should execute the handler three times…”) — not a negative statement of failure (i.e. “The component did not execute the handler…”.)

This seems simple, but keeping this consistent means failing tests are readable and failing tests read more readily as specifications.

I’ve written my fair share of tests with hundreds of assertions per test — this is a red flag that the code or the test should be broken up. Just as a code unit shouldn’t include any more than two or three behaviours, a test shouldn’t include any more than two or three assertions (Though I think you can make exceptions for property based testing or multiple assertions examining the characteristics of data resolved by a single behaviour.)

Workflow and automation

Your unit test suite should be easy to run, both from CI systems and from local machines. It should be executable in a single command after dependency installation (which should itself be minimal.)

You should use a pull-request model, and your CI systems should require the entire unit test suite to pass before allowing integration into master/trunk. Never skip this. The master branch should always have production ready, deployable code.

When something is deployed to production and found to have problems, the following should be performed, in order:

  • The relevant merge commit should be immediately reverted on master branch (using git revert or similar with a good message for traceability.)
  • Create a new branch for reintegrating the feature, from master.
  • Revert the reversion commit on the new branch (this sounds like a mouthful, but it ensures the SCM history clearly reflects what has actually happened.)
  • Write tests to validate the problem.
  • Write code to satisfy these tests, fixing the problem.
  • …then, and only then, file a new PR from this branch against master.

When testing a branch, the CI system should do the following:

  • Run the entire unit test suite against the new branch.
  • Run each new test in isolation. (If tests fail in isolation but work together, there’s a cross-dependency, and you don’t have pure tests.)

Since reading about the “Not Rocket Science” rule late last year, I’ve tried to also do the following in my CI system:

  • Run the newly created tests against master branch and verify that they fail.
  • Run each new test in isolation against master branch and verify individual failure.
  • Integrate the new code into the code from master on a new branch, and verify the integrated whole before merging.

This automatically provides a basic level of validation that the tests are doing what they say they are. If you don’t do this, you can’t be sure that the new tests aren’t just meaningless checkmarks designed to make you feel better.

In the absence of an automated system, you should manage this yourself — remove the changed code and verify the tests fail.

How I’ve managed this in the past

At News Corp, this workflow was run from a makefile wrapping mocha, outputting tests in TAP format, executed and managed by Bamboo. (I quite like make, if used sparingly. It’s very simple.)

Much as I love to hate Atlassian stuff — Bamboo provided a very clean way to concurrently run our various test tiers and integration checks. Each tier could be described to Bamboo as a “job”, and run simultaneously on one or more “agents” responsible for executing the job. The way Bamboo exposes the execution of these tasks is actually very clear.

Screenshot of jobs In Bamboo (Example taken from the Atlassian Bamboo Blog)

We used plan “stages” to split up our builds into source retrieval and setup (git checkout, dependency installation), validation (unit tests, linting, style checking, integration tests, view rendering tests, and our ‘regression’ suite — essentially a big external acceptance/functional testing tier) and asset compilation (so the built feature could be immediately and automatically deployed to staging and user acceptance testing environments should automated validation pass.)

Unfortunately given the proprietary nature of the system, I’m unable to share it. But I’m working on a similar solution for one of my open source projects which desperately needs some testing attention — and I’ll publish that when I’m finished.

Refactoring

An effective test suite will support refactoring. Run the tests constantly as you work. If you identify that you need to add a new code unit, stop immediately and write the test first — defining and documenting the interface and behaviours as described earlier.

When you’re following TDD and have a suite of comprehensive, loosely coupled tests, writing or refactoring the implementation for a given set of behaviours is like colouring in — you’ve got all the lines there for you and you’re just filling in the blanks. It’s not mindless — there’s a lot of creativity in how you decide to satisfy the test suite. But you’ll know pretty quickly when you’ve strayed outside the boundaries and done something wrong, and you’ll have ample opportunity to correct it upfront.

The hard part is when you’ve got a whole pile of existing code which doesn’t have pure tests (or tests at all!) I find the best way to tackle this is to reverse engineer the desired behaviours from the code, and formalise them. Then write tests for each one of these behaviours in a strict, isolated fashion. Once the interfaces for the code are completely described by the test suite, begin work on the refactored version.

Avoid APIs which encourage poor isolation

Many test suites have hooks which provide an opportunity to establish mocks and load fixtures for tests. These hooks might operate before or after each test, or before or after the whole suite. They exist outside of the scope of tests and therefore introduce sharing of state between tests.

For example, the mocha test framework provides the before, after, beforeEach and afterEach hooks.

Unless your test framework explicitly isolates test state, you should not use these in units (despite their utility in integration tests.)

Every variable retrieved from outside the scope of a test and pulled into a test to enable its operation is shared state — and it’s an easy way to break isolation and ensure your tests don’t run in different contexts.

As nasty and boiler-plate ridden as it might seem, create all your mocks and load all your fixtures using completely unshared or scope-hidden variables within the context of a single test. Do not use APIs which set up state for tests in a scope accessible from multiple tests.

If you find yourself writing tests containing mocks that are six-hundred lines long, this is not an indication that your test strategy is incorrect — it’s an indication that your code unit is far too large and you need to split it up.

Property based testing

Property based testing is a relatively new (~5-10 years) formal strategy derived from something we’ve all been doing for some time — taking a range of data satisfying a set of defined conditions and firing it successively into our code units until we’re sure the code unit handles the entire class of data — or it fails.

Primitive approximations of this technique might’ve been, for example, to loop through a range of integers and try them all.

Since the debut of QuickCheck for Haskell though, we’ve had a more defined idea of how this process should look.

QuickCheck and its ilk take a statistically random sampling of data matching specific criteria (for example, unsigned integers below 25,000) and fire off a litany of concurrent or successive tests to validate a property of the unit under test (such as commutativity — that the output of an operation is the same regardless of the order of the input parameters.)

This is an excellent technique to integrate into your testing strategy. Wherever you find yourself plucking a number or figure out of thin air to test with — you should consider running a property based test, with at least 100 discrete iterations.

There are some caveats.

Because property based testing tools use a statistically random sample, improper setup or immature tools can result in unrepeatable tests. Some property test tools can record the random seed values they used when failures were encountered, and use those values in future. This is better, but it’s still not perfect. Be aware that you might end up in a “works on my machine scenario.”

Additionally, the property based testing tools for impure languages are far less mature than QuickCheck. (This is especially frustrating because I think projects written in impure languages need these sorts of systems more.)

I have not yet found one I like for JavaScript. The tools I’ve tried often generate confusing assertion messages — if you use one, I recommend catching asserted errors emitted by these tools and adding some context to the message before throwing it again. When reviewing this article, my esteemed App.Net friend Jeremy W. Sherman brought to my attention the quite mature framework Hypothesis. If you’re using Python it certainly looks like a good choice (unfortunately JavaScripters are out of luck.)

The tools will inevitably mature, and once they cross that threshold we would be foolish not to use them.

In the meantime, it is possible to run simple property based tests without a property testing tool:


Example

Code unit
// Add two numbers
function add(a, b) { return a + b; }  
Test
const assert = require(‘assert’);  
const seed = 2359871.234985;

describe(“adding two numbers”, function() {  
    it(“has a verifiably commutative result”, function() {
        // Assuming we have a reference to add already, for brevity
        let iteration = 100;
        while (iteration -= 1) {
            let a = randomFromSeed(seed + iteration);
            let b = randomFromSeed(seed + iteration);
            assert.equal(add(a, b), add(b, a),
                “add(a, b) should be equal to add(b, a) (commutative)”);
        }
    });
});

Quantifying progress

Quantifying the scope of testing is a very difficult problem. Many of the common metrics thrown around by bragging developers are quantitive rather than qualitative statements, free of context, and are therefore largely useless (ratio of test SLOC to code SLOC, number of tests, code branches covered.)

We could all throw up our hands and admit defeat, but there are very valid reasons why we would want to understand how tested our code is.

We might want to know which areas need the most work — or where the greatest deployment risk might be. We might want to assess our testing strategy or the progress of our team.

The key consideration when it comes to test metrics is that because all of them are useless without the context of the project from which they were generated, they can only usefully be used to compare a single project to itself, rather than projects to other projects.

Why coverage sucks

Coverage is simultaneously the most useful and the most useless (even dangerous!) metric in common use.

In an ideal world, where every test deals directly with a very small code unit and has complete isolation, branch by branch coverage is actually a very good way of assessing how many behaviours each code unit has, and whether the tests are touching it.

Even with a horrible and undisciplined test suite, the data from a traditional test coverage analysis tool can be useful — code branches which have never been executed clearly represent behaviours which are untested. This is an easy thing to measure and is very easy to action.

The trouble lies with the inverse metric: where test coverage tools assert that a branch is tested. These coverage tools do not account for test isolation (or lack thereof) and therefore the branches that they record as having been “tested” could very well be entirely erroneous.

These tools typically work by reading in application code and instrumenting it — taking each code branch and wrapping it in an expression which records that the branch wrapped by that expression was executed. Rarely is any additional processing or logic performed to understand the context for the execution.

The result of this is that poorly isolated tests are rewarded, and completely isolated tests are penalised. If a test requires a huge swathe of the application, or fails to mock the dependencies of the code under test and large amounts of code is erroneously and unintentionally executed, the test coverage score will be higher than had that same test been written correctly in an isolated fashion.

From where I stand, this is pretty clearly incentivising poor test discipline (even if the developers using these coverage metrics don’t know it.) After all, who wants to spend more time on writing properly isolated code only to see their favourite success metric go down?

Other ways of measuring coverage

This seems like such an obvious issue that it’s staggering to me more people aren’t talking about it (well, it’s commonly understood that test coverage is a troubled metric, but nobody is talking about how to fix it — realign that incentive.)

I’ve done some rough research, but I’ve so far found no coverage tools for JavaScript which have sought to address this problem.

Back in 2014 (!!!) this frustration got the better of me, and I wrote a proof of concept instrumentor and analyser called SteamShovel. The intention behind this code was to create an alternative method for measuring coverage which would more accurately assess whether a code branch was intended to be tested or was accidentally invoked through a failure of isolation.

The SteamShovel instrumentor tags each code branch with a function call instead of just assigning a value to a key in a huge map — and when it does so, it records the stack depth at the time of execution (using a rather horrible hack.)

Stack depth of branch execution over time

SteamShovel compares the stack depth of each executed code branch to the stack depth of the test, and uses an inverse weighted logarithm to compute a “testedness” score for the branch. Accidentally invoked code found far from the test itself receives a branch score very close to zero. It also exposes this visually in its HTML reporter, allowing poorly isolated tests to be identified as well as to where erroneous execution has leaked.

Perhaps quite crucially, Steamshovel records a “test milestone” against each instance of branch execution (essentially the name of the test currently being run) — which exposes which tests are touching a given piece of code.

SteamShovel demonstrating just how poorly tested its own code is

Because of this, the SteamShovel methodology incentivises both test completeness as well as isolation.

Unfortunately I can’t recommend using SteamShovel, as it never evolved beyond the proof-of-concept phase. I’m really hoping somebody smarter than me and with much more willpower for this sort of challenge will have a stab at solving this problem, or take my idea and run with it.

But in my not-so-humble opinion, this serves to demonstrate that coverage can be a useful metric, as long as you understand how it is calculated and carefully align it to the kind of test strategy you want to follow.

Controlling code complexity

One of the biggest impediments to writing fast, repeatable, isolated tests is the complexity of the code under test.

Roughly ten years after JavaScript became acceptable as a “real” language, the community still hasn’t agreed on a single methodology for writing long chains of successive asynchronous operations. In place of a ubiquitous solution, (especially one available to those of us who might not be the type to endlessly research new methods in their own time) many developers are writing colossal pyramids of doom in single top level functions.

Pyramids of doom are a great indicator that the code you’re looking at will:

  • Be difficult to mock
  • Have an enormous number of side effects
  • As a result, be inadequately tested
  • Have an astronomical number of behaviours in a single unit of code.
  • Be exquisitely complex and therefore difficult to understand

Any tests for code that looks like this will be almost certainly guaranteed to be bad tests.

Thankfully, pyramids of doom are easy to refactor. Callback functions can be trivially lifted out and made discretely testable.

Furthermore, these pyramids are obvious in code. There’s no firm rule which defines the nesting level which is “too deep” — but I usually consider it to sit roughly around four or five levels of indentation (I’m giving myself some leeway because I write my code in the aforementioned factory function style, using up one of my available levels.)

You could enforce this limit directly, but my team found that hard-limiting line length to 80 characters achieved the same result while allowing a little more flexibility for those difficult cases where an additional level or two might be required. It also prevents people from writing enormous one-liners which require scrolling and look nasty in diffs.

Tooling depends on the language in use, of course, but my team used jscs and jshint as part of our linting stage in Bamboo to automatically fail releases with any lines of code longer than 100 characters, while concurrently maintaining a human-enforced limit of 80.

Cyclomatic Complexity

Cyclomatic complexity is a formal measure of the complexity of code expressed as a graph.

Tools exist to track cyclomatic complexity over your code. You should use this as a metric to understand whether code is increasing or decreasing in complexity. It should go without saying that you should always favour reducing complexity, as long as functional requirements are satisfied.

You should not use cyclomatic complexity to automatically consider a given piece of code too complex, as it has nuances which can push even behaviourally simple pieces of code over the edge.

Furthermore, traditional implementations of the formula weight edge nodes very heavily, penalising the early-return structure common to JavaScript code (and widely recognised to visually simplify the code itself.)

But you should definitely track cyclomatic complexity. Have your CI system record increases in complexity against merged code, so you can see whether styles of development are having a positive or detrimental effect on code complexity over time, and to associate complexity cost with feature development.

My team used Plato to track complexity. This ran as a separate stage in Bamboo, saving the complexity reports as a downloadable asset we could review once the build was complete.

Have an evolving strategy

The most important thing I’ve learnt over the years is that we’re always wrong (in subtle as well as major ways) about test strategy, regardless of how sophisticated or nuanced that strategy is.

As we learnt more about the specific requirements of our application, how best to work together as a team, and how swiftly (or not) we were able to operate under certain processes, we evolved them. Added new requirements to our process. Pulled old redundant ones out.

I believe what I’ve presented here is a good general set of advice, but you will undoubtedly develop much deeper understanding of your unique set of problems as time goes on.

Therefore, be open to change.

Final Words

Writing good tests is much harder than writing good code. On a bad day, testing can feel like telesurgery — you’re manipulating things inside an enclosed object from afar, without being able to properly see inside.

Especially with respect to impure languages, where guarantees of safety are not provided by your immediate working environment, you have to work out ways to provide as many of those guarantees yourself as you possibly can.

However, the outcome of maintaining, evolving, and exploring strong test discipline is that you can work faster, and have much more confidence that the code you’re writing is stable, reliable, reusable, and safe.

In summation, the following practices will stand you in good stead:

  • Isolate your various test suites from one another;
  • Isolate your code and tests using Dependency Injection, and mock every external dependency;
  • Use TDD, writing individual tests for every behaviour before writing the code;
  • Focus on speed, ensuring your whole unit test suite can run in seconds;
  • Write clear assertions, ensuring your failing tests read like a bug report;
  • Use property based testing where you can;
  • Have a CI workflow to run your tests;
  • Avoid APIs which encourage you to break test isolation;
  • Track complexity and coverage — but understand the context from which the metrics were derived before acting on them;
  • And finally, have an evolving strategy.

It’s worth the pain. A thousand times over.

Happy testing!

Here’s to not bringing down the production stack. 🎉

Thanks

I’m very grateful to both my internet superfriends Jeremy W. Sherman and Keitaroh Kobayashi for checking over this post, making great suggestions, and finding my silly errors before I went and embarrassed myself in public!

The title image for this post is used under a Creative Commons Non-Commercial By Attribution licence, and was taken By Corey Holms.