Absolute value of a float

I did a job interview a while ago where I was asked, as the first question of a phone interview, how I would calculate the absolute value of a float. I didn’t answer this very well at all, in part due to the fact I wasn’t aware the interview I was going into was going to be a technical one, but more due to the fact that I didn’t have the greatest understanding about how a float actually works. I knew vaguely that it stored the mantissa and exponent, but had no idea about how many bits each took up or where those bits were located. And I just assumed that all negative numbers would be handled in two’s complement, similar to integers.

I did so horrendously bad at that interview in general that I went out and bought a book called Write Great Code: Volume 1: Understanding the MachineI’m only up to chapter 4 so far, but wow, this book is great. I feel like it should be required reading for all computer science undergrads, and wish I’d been taught a lot of this stuff much earlier.

So chapter 4 deals with floating point representation, and the question of finding the absolute value of a float that has been haunting me since the interview now seems a lot easier to tackle.

I’ve learned that single precision IEEE floating point format does indeed store a mantissa and exponent, but the mantissa uses 23 bits, followed by the exponent with 8 bits, and the remaining bit simply handling the sign:

2000px-ieee_754_single_floating_point_format-svg

So, finding the absolute value of this now seems super easy, we just need to zero that sign bit:

float myFloat = -1.45;
float absFloat = myFloat & 0x7FFFFFFF;

Ok, so that actually doesn’t work. Binary operators don’t actually work on floating point variables in C++. I guess that makes sense, really, the interview question would have been way too easy otherwise.

Casting to an int isn’t going to work, because that just rearranges how the value is stored (no mantissa or exponent there) not to mention losing any information following the decimal point.

After a quick Google I saw someone mention using a union. That sounds like a nice solution, let’s try that.

union hackCasting{
    int myInt;
    float myFloat;
};

hackCasting myUnion;
myUnion.myFloat = -1.45;
float absFloat = myUnion.myInt & 0x7FFFFFFF;
std::cout << absFloat << std::endl;

Which gave me the output of 1.06913e+09. Hmm. Clearly something is going wrong there. Oh, an inherent cast by creating a new float (absFloat) and storing my int calculation there.

union hackCasting{
    uint32_t myInt;
    float myFloat;
};

hackCasting myUnion;
myUnion.myFloat = -1.45;
myUnion.myInt = myUnion.myInt & 0x7FFFFFFF;
std::cout << myUnion.myFloat << std::endl;

Hurrah, a correct result of 1.45. (And a change to using an unsigned int after finding a comment by Tim Schaeffer on StackOverflow. Not sure if it’s actually Mr Schaeffer, but helpful all the same!)

So, question complete – take that interview question.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s