I fixed bug last night.

The bug was difficult to fix because the code was so complex. I wasn’t really able to fix it until I simplified the code. Once the code was simplified the bug was easy to fix. While doing this I found an anti-pattern that I now call “One loop trying too hard”.

TL;DR version

If A never changes, then…

1
2
3
    for ... {
        if A { B } else { C }
    }

…is probably best refactored as…

1
2
3
4
5
    if A {
        for ... { B }
    else
        for ... { C }
    }

Why? First of all, you are evaluating A once for each iteration. If A doesn’t change, that’s a waste of CPU. But more importantly, this is really just two loops that just happened to have the same for .... Splitting them out makes the code more readable.


The full story…

I was trying to fix a bug in DNSControl related to AWS Route53 DNS updates.

The code had been hacked on my many people over the years to add support for Route53-style aliases and other things. It was now very complex and difficult to understand.

The refactoring didn’t just simplify the code to make it easier to maintain, I was able to fix some other problems along the way. For example, the report of what updates are being performed is now sorted, which makes it easier for the user to notice irregularities.

Sidenote: The joy of automated integration tests

Reproducing the bug was easy because DNSControl has an embedded language for describing test cases. Any bug is first reproduced in this language. We can run the tests frequently while doing development so that we know when the bug is fixed.

We never delete the old use-cases. This was we get instant feedback if there are regressions (i.e. the bug creeping back in).

My favorite aspect of this is that I can make aggressive changes with confidence: if all the use-cases pass, I’m pretty sure I haven’t broken anything, even some obscured edge-case that was observed years ago.

I took advantage of this many times while working on this bug. In fact, there was one point where I wasn’t sure if a change should be done one way or another. I was sleepy and feeling lazy so I just ran all the tests both ways. One made nearly ever test fail; the other worked. I was too sleepy (and lazy) to investigate why, but it didn’t matter because… I was sleepy (and lazy).

By the way, these tests are fully automated and accessible to everyone. Until a week ago I was the only person that could run the full suite of tests. However last week my awesome coworker Max configured Github Actions so that all the tests run on each PR, no matter who submitted the code.

This makes it easy for others to contribute to the open source project because it reduces the anxiety over making changes. Now any contributor to the project gets the peace of mind that their code works and doesn’t break other people’s code. Previously contributors would manually run the test on the DNS provider they use. Now the automated tests run on the 7 most important providers; and more soon!

As the project maintainer it makes my job easier because I spend less time testing, more time reviewing the code. This is particularly important because this is an open source project, not my full-time job.

Ok, enough about that… let’s look at the refactoring!

The refactor itself

While diagnosing the bug, the code looked like this: (pseudocode)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Xran = false ; Yran = false
for range {
    if invariant {
        X ; Xran = true
    } else {
        Y ; Yran = true
    }
}
if Xran { Xfollowup }
if Yran { Yfollowup }

This was basically a single loop trying to be two different things. If invariant was true, it was doing task X, which required follow-up work (Xfollowup). If the invariant was false, it was doing task Y, which required different follow-up work. It was using Xran/Yran flags to record which variation of the loop was used so that the proper follow-up code could be executed.

This was refactored into two separate loops by moving the inner if to be outside, and repeating the loop code for X and again for Y:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
if invariant {
    for range {
        X
    }
    Xfollowup
} else {
    for range {
        Y
    }
    Yfollowup          <<< spoiler alert: this is the bug
}

The bug was triggered in some tests but not others. By thinking about what was different about the failing tests I was able to find the problem: the Yfollowup task needed to be done for every Y, not just at the end.

In other words, Yfollowup was to be moved up into the for loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
if invariant {
    for range {
        X
    }
    Xfollowup
} else {
    for range {
        Y
        Yfollowup      <<< moved into the loop
    }
}

After the code was refactored it was easy to fix the bug. Prior to the refactoring I kept adding more flags and conditionals and… ugh, it was getting worse and worse.

Once the code was refactored, I could fix the bug much easier.

The anti-pattern

We should name this the “One loop trying too hard” anti-pattern.

The complexity was due to a loop being used for two entirely different purposes. It was nice to not have to repeat the loop iterator, but it was making everything else more complex.

This is a common anti-pattern: one loop doing two things for no good reason. The code smell is that the entire body of the loop was contained in a big if/then/else. If there was no common code between the two invariants, why not have two loops?

More info

You can view the refactoring PR in Github. The loop is on around line 300 of providers/route53/route53Provider.go in StackExchange/dnscontrol#938.

If you would like to maintain your DNS zone files in a “infrastructure as code” manner, check out DNSControl. It now supports 28 DNS providers such as Route53, Google DNS, and Gandi. It has some interesting features such as the ability to optimize your SPF records, and use macros to assure consistency. Writing new providers is so easy that many people have written them as their first experience writing Go!

FunFact: At Stack Overflow we integrated it into our CI/CD system so that devs can send PRs to request DNS changes, and SREs don’t have to look at the PR until it passes all our tests (See? More automated testing!).