A difficult bug to trace..!

Posted: 18th June 2012 by Ammar Qammaz in Post
Tags: ,

During the last 2 weeks some major things happened in Greece including the second round of elections etc.
I spent most of the time stuck and trying to trace a bug in RemoteVariables ( which I am currently developing )..

For testing the under construction library I wrote a test suite which performed 100 variable changes from peer to peer and finally exited when all of the handshakes where complete..
The test suite was then debugged using valgrind and standard gdb and despite some early mixups and architecture problems that were very evident with the debugger output after they were gradually fixed some very strange behaviour ensued as the test suite would “randomly” stop at some point and the program “seemed” to stop acknowledging and transmitting changes on the variables without an apparent reason..

Adding to the complexity of identifying the problem was the fact that a total of 6 threads spanning in two procedures could potentially cause a deadlock somewhere ( despite the mutex protection ) ..

After 5 or 6 days of trying to fix the problem I started to get desperate going line by line and almost rewriting all the critical segments until I figured it out..

The problematic code was the following :

unsigned long GetVariableHash(struct VariableShare * vsh,void * ptr,unsigned int size_of_ptr)
{
if (size_of_ptr unsigned long * stacklongptr = ptr;
unsigned long stacklong = *stacklongptr;
return stacklong;
}else
{
return rvhash(ptr,size_of_ptr);
}
return 0;
}

Each pointer (void * ptr) points to a memory block of variable size (unsigned int size_of_ptr ) and the hash values get stored on an unsigned long ..
All is good for variables that are longer than an unsigned long since they are casted to a char * "string" and then a hash function converts them to a reasonably unique unsigned long.. But .. for smaller sizes of pointers such as unsigned ints my code directly casted the 4byte payload of the pointer to the 8byte accomodation of the unsigned long leaving 4 bytes uninitialized..

What was very funny was that in all the fprintf(stderr,"hash %u",(unsigned int) hash); calls I had added to review the function the results appeared completely normal but when two hashes where compared they failed the equal if operation even if they seemed to carry the same value when fprintf'ed

To better illustrate a simplified version , what I saw was an instance of

fprintf(stderr,"old hash %u , new hash %u\n",(unsigned int) oldhash,(unsigned int) newhash);
if (oldhash==newhash) { fprintf(stderr,"They are the same\n"); } else
{ fprintf(stderr,"They are not the same\n"); }

where I got output like :
old hash 5 , new hash 5
They are not the same

Needless to say valgrind ( which is a great tool and this wasnt its fault :) ) has an initialization check that did not report the unitialized bytes , the possible problem of unitialized bytes didnt even cross my mind while hastly writting the function @ 3 AM 2 years ago ( the problematic commit ) and 2 weeks of my life were "kind of" thrown away ..

It is true that you can sometimes shoot yourself in the foot when using tools that give you this kind of freedom , and this might be just such a case , but I frankly found it enlighting..
C is an awesome language!

The fix commit

Comments are closed.