Steve Grunwell

Open-source contributor, speaker, and electronics tinkerer

Black and white photo of two female-presenting scientists looking into adjoining microscopes

The True Meaning of Code Coverage

Whether you’re auditing third-party libraries or trying to toughen up your own test suites, you’ve likely come across a project boasting “100% code coverage”. It just sounds so official, right? 100% of this code is covered by automated tests, so it must be good!

I have good news and bad news. The good news is that a high percentage of code coverage does generally reflect an above-average effort to test the code—I’d much rather use a library with 60% code coverage than 15% (or zero!), wouldn’t you?

The bad news is that a high percentage of code coverage doesn’t mean that the code is necessarily working the way it’s supposed to!

In this post, we’re going to look at what code coverage really means as a metric (specifically within the context of PHP, but these lessons should be broadly applicable), as well as several ways that code coverage can give you a false sense of confidence.

What is code coverage?

In its purest sense, “code coverage” is a measurement of how much of a codebase is actually tested by a test suite.

The most common (and easiest) form of code coverage is line coverage: a calculation of the number of lines executed during a test run divided by the total number of executable lines. To demonstrate, imagine the following:

In total, that function has 8 lines of code. We might write a test for it, which could look like this:

When we run the test and tell PHPUnit to report code coverage, it’s going to tell us that only two of the lines were executed. Generating an HTML report (via --coverage-html) reveals what it’s seeing:

Of the 8 lines of code, only 3 are really executable (the conditional and each line with a return statement), so we’ve covered 2/3 of the possible routes through the function.

If we called catDog(true) in multiple tests, our coverage won’t go up (we’re still only hitting 2/3 lines) but multiple tests can cover the same logical branch.

Other forms of code coverage

Line coverage is not the only form of code coverage, however, and PHPUnit is capable of calculating all of the following:

  1. Branch coverage: for each conditional, are we covering both the true and false paths?
  2. Path coverage: what percentage of the total possible paths through a codebase is covered?
  3. Function/method coverage: is every function/method covered?
  4. Class/trait coverage: are all of the methods within a class (or trait) covered?

Code can also be given a Change Risk Anti-Pattern (CRAP) score, which combines code coverage with cyclomatic complexity: highly-complex code with low test coverage will lead to a higher CRAP score. On the opposite end, clear code with good coverage will have a low CRAP rating.

Code coverage does not necessarily mean code quality!

Now we get into the nasty truth of code coverage: code coverage tells us that a certain percentage of lines were executed, but does not tell us whether or not the tests are useful nor correct!

Here are a few common ways in which code coverage can hide real issues:

Only testing the happy path

My friend Tim recently said that a test has two jobs:

  1. Verify that the thing does what it’s supposed to do
  2. Make sure something doesn’t do what it shouldn’t do

We often describe the former as the “happy path”: assuming everything works as expected, a user should be able to do this thing and see these results.

Writing tests for happy paths is great, and helps ensure that our applications behave the way we expect. However, we must also account for the non-happy paths, as that’s where most issues will arise.

Imagine the following HappyPath service, which uses Guzzle to make HTTP requests:

This shouldn’t be anything groundbreaking, but we’re injecting an instance of GuzzleHttp\Client via our class constructor (using the constructor property promotion available in PHP 8.0+). The get() method then uses the client to make an HTTP GET request to the given path.

A test for this service might look something like this:

Here we’re constructing a client (using Guzzle’s built-in request mocking) and asserting that when we call get() we receive a 200 HTTP response code.

Technically, that test would produce 100% code coverage for our HappyPath class as we’ve written it: we execute every line of code, satisfy all typed arguments, and our assertions pass.

In reality, our test method fails to account for the following conditions:

  • What happens if we get a non-2xx response code? (By default, Guzzle will convert 4xx and 5xx response codes into exceptions)
  • Are the $path, $headers, and/or $query arguments actually being handled properly?

If we were solely-focused on code coverage, we might leave the test as it is, but we’d be scratching our heads the first time we get unexpected input or the service we’re talking to becomes unavailable. 100% code coverage, but no protection against the things most likely to cause issues.

Accidental coverage

In software testing, we have the concept of a “System Under Test” (SUT), which is the actual thing we’re trying to test; if I’m writing unit tests for my User model, then that model is the SUT.

Of course, software isn’t always divided into neat little boxes, so we may have dependencies between them: maybe a User can have many SocialAccount records, for instance. It’s entirely likely that our UserTest class—unit tests for the User model—will contain some number of references to SocialAccount (e.g. “does getSocialAccounts() return a collection of social accounts?”, “Can I create a new social account and assign it to the user model?”, etc.).

In these circumstances, we’re not intending to test the SocialAccount class (we likely have a whole separate test class for those unit tests) but those lines are still being executed. As a result, UserTest may be contributing code coverage to SocialAccount totally unintentionally!

Within PHPUnit, we can use the #[CoversClass] attribute (or the @covers annotation for PHPUnit 9.x and earlier) to indicate that the given test class is only intended to test the given class(es), and anything else is incidental.

With this attribute in-place, PHPUnit will not consider any code within SocialAccount to be covered by UserTest because that’s not what it’s meant to test!

Explicitly declaring covered code may cause code coverage to “go down” (read: cease to be artificially-inflated), but we get more accurate coverage reports as a result.

The tests are only asserting that nothing blows up

A really common pattern I see when reviewing tests are what I like to call “Minimally Viable Tests”: tests that are only responsible for executing the code and alerting if anything blows up.

This is especially common with tests around failure states; imagine we have a controller method that looks something like this:

Without going to deep into the code, it would be very easy for a lot of things to go into that try/catch block; if our tests are only asserting that we did (or did not) receive an error message, there’s a lot of room for our tests to break in pretty spectacular ways.

For example, PHP 8.0 promoted a number of warnings to TypeError exceptions: where it may have previously coerced variables, something like the following will work (albeit with a warning) on PHP 7.4 but fail on PHP 8.0:

If that was inside the try/catch block, your “assert that we caught an exception” test might still pass but for the completely-wrong exception!

Testing language/framework features

Modern versions of PHP allow us to not only type our arguments, but also the return types of our functions and methods.

When we add return types, PHP will automatically throw a TypeError if we attempt to return the wrong type of value. For example:

When we attempt to call returnInteger(), we’ll get an error like the following:

The obvious upside is that we get language-level type enforcement: the function/method literally cannot return the wrong type. The downside is that tests like this become useless:

This is literally asking “does the function that we just said can only return a string, in fact, return a string?”. This would be equivalent to asserting that an object dropped off a building will fall towards the ground: the laws of gravity tell us that it will, and if it’s not then you have much bigger problems!

Over-reliance on test doubles

In software testing, “test doubles” generally fall into five buckets (shamelessly stolen from Jessica Mauerhan’s excellent The Five Types of Test Doubles and How to Create Them in PHPUnit):

  1. Dummy: Used only as a placeholder when an argument needs to be filled in.
  2. Stub: Provides fake data to the System Under Test
  3. Spy: Records information about how it is used, and can provide that information back to the test.
  4. Mock: Defines an expectation on how it will be used, and with what parameters. Will cause a test to fail automatically if the expectation isn’t met.
  5. Fake: An actual implementation of the contract, but is unsuitable for production.

The semantics aren’t super-important right now, but in-general we use test doubles to satisfy constraints while maintaining control over our test environment.

This becomes problematic when we have too many dependencies and end up mocking everything; instead of testing our customer-facing code, we’re testing a rough facsimile!

Imagine we have an Alarm class that includes the following methods:

We might inject this as a dependency for our Jailbreak class:

Given an understanding of how both Alarm and Jailbreak work, we could reasonably write a test like this:

However, this test would fail because $alarm is a mock (albeit with no expectations defined). By default, PHPUnit will treat any method calls here as no-ops, so calling $this->alarm->sound('Jailbreak') doesn’t actually do anything—no exception will be thrown!

In this case, we can pretty easily work around it by simply using a real (e.g. non-mocked) instance of Alarm—after all, we want it to behave as close to production as possible.

In the event that you want to maintain most of the functionality of Alarm but have the ability to override individual methods as-needed, you might consider a partial mock (e.g. “act like you would normally except when I tell you to do otherwise), but that’s a topic for a separate post.

Code is 100% covered…60% of the time

At a previous company, I was contributing some code to a Django (Python) application. I’m not really a Python developer, but was able to reason about the syntax enough to figure out how to make the change I needed. I opened a PR to the team that owned the app, only to have the CI pipeline reject it immediately.

The code worked when I tested it (manually), so what was the problem?

The project required 100% code coverage, and my PR didn’t include tests.

It was a frustrating setback, but I went looking through the application to see how similar changes had been tested. Suddenly, I was noticing an awful lot of comments like this: # pragma: no cover

I did a quick search and discovered what this meant: “exclude this block from code coverage calculations.” (PHPUnit has a similar mechanism)

The maintainers of this app were so proud of their “100% code coverage”, despite the fact that there were hundreds of parts that were completely excluded from code coverage! 🤦‍♂️

Clip from Elf (2003) where Buddy the Elf (Will Ferrell) threatens a mall Santa, claiming "you sit on a throne of lies"

In this case, it would have been far more preferable to have ~60% code coverage and know which areas of the app are uncovered; instead, that team was chasing a metric (100% code coverage) and ended up with a useless code coverage report.

So code coverage’s a bunch of bullshit?

Once you understand what code coverage really represents, it’s easy to become disillusioned: why are we putting so much emphasis on a metric that’s so flawed?

Clip from Avengers: Endgame (2019) wherein Scott Lang (Paul Rudd) questions "So Back to the Future's a bunch of bullshit?!"

While yes, code coverage is not necessarily an accurate measurement of code quality, it can still be useful when used in aggregate: are there areas not being touched by your test suites? Are bugs occurring in code that (allegedly) has 100% coverage?

To get the most out of code coverage, it’s important to be intentional about what and where we cover things: if you’re writing a unit test class, it should only contribute coverage towards the System Under Test (note that this may be multiple classes when writing integration/feature tests!). Look at the HTML reports generated by PHPUnit to see where coverage is coming from, and make sure that you’re not accidentally picking up coverage from elsewhere.

As you write your tests, don’t just think about the happy path: think through the corner-cases, failure conditions, and anything else that might happen. If a bug still slips through, be sure to write a corresponding regression test to make sure that bug can never happen again.

Add types in as many places as possible, as those give you language-level assurances that you’re dealing with the types you expect.

Most of all, don’t live or die by a percentage of code coverage: it’s not totally without value, but high coverage does not necessarily equal high-quality code!

Previous

Factory Methods for Hydrating Objects from JSON

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Be excellent to each other.