A good war story: debugging a low-probability problem, which caused build failures of Chrome on Windows but only a few percent of the time. I feel it should be possible to learn from this sort of story: a deep dive, a methodical approach, communicating with peers and with upstream providers.
Flaky failures are the worst. In this particular investigation, which spanned twenty months, we suspected hardware failure, compiler bugs, linker bugs, and other possibilities. Jumping too quickly to blaming hardware or build tools is a classic mistake, but in this case the mistake was that we weren’t thinking big enough. Yes, there was a linker bug, but we were also lucky enough to have hit a Windows kernel bug which is triggered by linkers!
View original post 2,043 more words