Unsigned Subtraction in Old C++ Code

I'm porting some old C++ code to Java. How old? So old that it doesn't just assume a long is 32 bits, it actually checks to make sure and bails if not. That's 1997 or so, by my guess.

I can already hear the programmers in the audience asking, "Why not just interface with the library using JNI?" We decided to port the code because we must support Unix and 64-bit environments, and the library specifically fails on all our Unix machines. We do have the source, so we could modify it... but then we'd be forced to deliver and maintain it, which carries its own set of headaches. So I'm working on 20,000 SLOC.

(Anybody wondered why the site has been so quiet for so long? It's because I'm porting 20,000 SLOC of OLD C++ to Java.)

Most of the code has been pretty straightforward. There have been the usual problems with C++ unsigned types, which don't have any corresponding Java type, as well as a bit of pointer arithmetic confusion. When I get confused, I just write a quick test program and make sure the C++ does what I expect.

The thing that blew my mind today was the weird use of the "magic number" 0x80000000.

Here's the code I found. It relates to a sliding bitmask, and verifying that blocks of data are received from the network in the proper order (names have been changed for my protection):

[code] // "Windowed" delta calculation unsigned long diff = block_id - lowest_pending_id; if ((diff > 0x80000000) || ((diff == 0x80000000) && (block_id > lowest_pending_id))) { if (num_mask_bits < (lowest_pending_id - block_id)) { return INVALID_ID; } else { return COMPLETED; } } [/code]

I figured the "window" the comment talked about was 0x80000000 blocks you had to hold on to in case an earlier block arrived in the wrong order. I started mindlessly converting everything into Java, carefully applying bit operations and casts to ensure that the arithmetic worked as expected.

In my defense, I was a little stunned from twenty thousand lines of underscores and braces. It's amazing how much you can get accustomed to your own conventions. (My project specifies that we shall put opening braces on the same line and create variable namesInCamelCase.)

Eventually, I took another swig of water, and my blood pressure increased enough to circulate oxygen back into my starving brain. 0x80000000 is a LOT of blocks. About 2 billion of them, in fact. Each block had been defined as 4K; that meant I was looking at holding 8 trillion bytes of stale data around, just hoping for an out-of-sequence block to arrive on the network.

The machines these guys were working with in 1997 didn't even have 8 GB of memory, let alone a spare 8 GB to use as a buffer. Heck, I don't have 8 GB of memory, and my machine is 15 years newer than theirs!

Well... 10 years newer. I do work for the Army, after all. You might think we'd get all the newest gadgets and latest technology, especially in wartime, but no. In fact, I'm lucky to be close enough to see the technological edge. Most Army guys have to work with tech so old their grandparents wouldn't have blinked at it.

Anyway.

I started looking at those subtractions again. Surely they must be checking for an edge case somewhere.

Well, surely diff > 0x80000000 shouldn't be any problem. That would happen if the block_id was MUCH bigger than lowest_pending_id. But then, the revelation: it would also happen when block_id was less than lowest_pending_id!

An example, in case one of my non-programming friends is reading. Suppose the block_id is 1, and the lowest_pending_id is 2. Then block_id - lowest_pending_id is -1. But these are unsigned long integers; they can't hold -1! What's a computer to do?

It returns 0xFFFFFFFF. (Or as many FFs as your long integer is... long.) The biggest possible number. That's because... well, because that's what computers do. I could explain base 2, twos complement representation, and all the weird mathematics that goes into making computers work, or you could just take my word for it. This article's long enough; just trust me and let's move on.

The big conclusion is that ANY TIME a negative number is needed, the "leftmost" bit of the number is set. For a 32-bit number, that's 0x80000000.

So if block_id < lowest_pending_id, then diff > 0x80000000. They were checking for a negative difference!

But what about the second half? When could the diff exactly equal 0x80000000, while block_id > lowest_pending_id? Nothing sprang directly to mind, so... I wrote a program to check.

[code] #include #include #include int main(int argc, char**argv) { uint32_t a = 0x01; uint32_t b = 0x02; uint32_t diff = a - b; // Quick sanity test, prints: // 0x00000001 - 0x00000002 == 0xffffffff fprintf(stderr, "%08x - %08x == %08x\n", a, b, diff); // Search for a case where a > b and a - b = 0x80000000 bool first = true; for (a = 0x01; a <= 0xFFFFFFFF; a++) { if (a == 0x01) { if (first) first = false; else break; } for (b = a - 1; b < a; b--) { diff = a - b; if (diff == 0x80000000) { fprintf(stdout, "\n%08x - %08x == %08x\n", a, b, diff); } } } } [/code]

IT RAN FOR HOURS. Of course. It's checking around 8,000,000,000,000,000,000 pairs of numbers; it's gonna take a while.

I left that running, just in case it was about to finish, and wrote a copy that only checked 32,768 values (if my calculations are correct). THAT ran a LOT faster.

And it told me that the only time the second half happens, under C99 unsigned subtraction rules, is when the block_id is 0x80000000 greater than the lowest_pending_id.

Whiskey Tango Foxtrot? (As they say around here.) What's so special about that number that it needs to be checked specifically?

Well, the older ANSI C / C89 standard doesn't really have anything to say about subtracting unsigned numbers. And they were supposed to be writing portable code (despite the travesty with the byte length of their variables). Maybe it's a special condition that only applies to their particular compiler. In that case, I could just ignore that part of their code.

That was a little too convenient, so I searched their code for the magic number, and found it all over the place. Many were checked for (a < b) instead of (a > b).

I modified my tiny checker to look at all 65,536 pairs of values. And I discovered that you hit 0x80000000 any time the two values are that far apart, regardless of which one is bigger! For instance, (0x90,000,000 - 0x10,000,000) and (0x10,000,000 - 0x90,000,000) are both 0x80,000,000! (Commas added for readability.)

So, to check if (a < b), you could check if (a - b >= 0x80000000)... but then you'd also catch 0x90000000 - 0x10000000. So you actually need to check (a - b >= 0x80000000) || (a - b == 0x80000000 && a < b).

The whole thing is one giant comparison check! I can just use if (block_id < lowest_pending_id)!

I'm still confused over the instances where they check (a > b). I'm betting it's a bug, caused by an incorrect copy of the earlier (a < b) code. They never get block IDs anywhere near that high, anyway, so they'd never hit that code.