When a 5-minute task became 4 hours mystery

Introduction
Analysis
Conclusion

Introduction

Recently, I worked on a new small feature that was necessary to add to an application. I estimated this work would be 2 hours tops as I was familiar with the code base, knew how things work, and had the necessary environment in standby mode. Part of this feature was to expand the protocol to include a new enum that clients would use to advertise its state. We had our custom serializer/deserializer for historical reasons, but adding the POD type was easy.

I updated all the relevant components, expanded the tests for the feature, and began testing on actual hardware. While testing, I noticed that it was not working, and updated states from clients were ignored. I reran the tests to see if I could identify the issue, but it was all green without any failures. As it was unclear where the problem was coming from, I recompiled the application with debug symbols enabled and reduced optimization to decrease occasions where variables became optimized. As often happens, the issue was no longer reproducing under gdb, and correct states were reflected on the server side.

So, I switched back to default settings (-O2) but focused solely on the server side to find what was causing it. I noticed that requests were coming in and payloads were correct, but deserialization, for some reason, was not working and instead yielded a default value.
This snippet shows how deserialization is handled:

  bool unpack_to(T &value) {
    ...
    unpack((std::underlying_type_t<T>&)value, payload);
    ...
    return true;
  }

Where T=AppState and payload=const char *

enum class AppState : uint32_t {
  invalid =  0,
  start = 1,
  ...
};

This odd std::underlying_type_t cast is in place to reuse the already implemented deserializer for integer types. I disassembled the binary, and this function call was optimized. Why ? What's wrong with it? Is it a bug in a compiler?

Analysis

I disassembled the binary again and found that -O1 preserved the call to deserialize data, while -O2 eliminated it. This explains why it works with -O1. Even after using __attribute__((noinline)) to prevent inlining, the code still broke with -O2, meaning the issue is specifically related to GCC's -O2 optimization. So, there should be some clever optimization pass in -O2 that breaks my code. After a bit of searching, I came up with this command to dump optimization flags that -O2 enables.

for opt in "O2" "O1" ; do gcc -Q --help=optimizers -$opt > /tmp/pass-$opt ; done
diff --color /tmp/pass-O1 /tmp/pass-O2 | grep "^>" | sed 's/^> //'

In my case, the optimization level -O2 enabled additional 45 options that control optimizer compared to -O1.

With bash help, I quickly identified a flag that triggered the issue.

for opt in "O2" "O1" ; do gcc -Q --help=optimizers -$opt > /tmp/pass-$opt ; done
for option in $(diff --color /tmp/pass-O1 /tmp/pass-O2 | grep "^>" | sed 's/^> //' | grep enabled | awk '{print $1}') ; do g++ -std=c++11 -O2  "-fno-${option:2}" testapp-2.cpp -o /tmp/test; /tmp/test | grep -q "\- 0" || { echo $option; break; }  ; done

It is -fstrict-aliasing.

I concluded that my code contained undefined behavior, particularly due to this shortcut of reusing the existing implementation with the std::underlying_type_t cast. This realization came as I learned more about the implications of -fstrict-aliasing. Furthermore, enabling the -Wall option would have triggered a warning about this issue. What a mistake! I'm still puzzled as to why -Wall wasn't enabled. It has become a standard default option in all our projects, yet this component didn't have it.

Conclusion

Always enable -Wall unless you want to spend hours solving a five-minute task. Another lesson I've learned is that if you notice odd behavior, check if -Wall is enabled. If it is, pay attention to any warnings.

-fstrict-aliasing is tricky, and it's easy to make mistakes and still assume your code is correct after reviewing it multiple times. Can you spot a major bug here?

#include <stdio.h>
#include <stdint.h>

__attribute__((noinline)) void printer(uint64_t value) {
  printf("0x%lX\n", value);
}

__attribute__((noinline)) void myfunc(uint32_t &value) {
  value = 0xFFFFF;
}

int main() {
  uint64_t value = 0xAAAAA;
  myfunc((uint32_t&)value);
  printer(value);
  return 0;
}

If you compile it with -O2 and run it, you will get 0xAAAAA instead of 0xFFFFF. This happens because the optimizer removes the call to myfunc, leaving you with a default value.

There is no warning unless -Wall is enabled. Note: you can turn off this behavior by passing -fno-strict-aliasing to gcc.

The last part I needed to address was why my unit tests worked well while the app itself did not. Well, all of our tests are compiled with clang, and it appears that clang does not perform the same type of optimizations as gcc.

When a 5-minute task became 4 hours mystery

Table of Contents

Introduction

Analysis

Conclusion