On Data and Being Honest
Published:
As I stared at my 300th co-occurence number, I let out a groan. Why was I spending hours late into the night staring at countless static numbers? I could’ve been building something, or writing something, or brainstorming. Instead I was doing what felt like grunt work. Double-checking number after number. My co-worker was asked to independently reproduce my model results, and we had to make sure that the thousands of metrics and results we produced for the model were all exactly the same. I felt bad for the guy, having to spend hours into the night writing code that I had already written that did boring, predictable things, just to make sure we got the same numbers. This wasn’t a chance to be creative. It wasn’t even really a chance for reward, because the better we did our job, the more likely we were to have to deliver bad news and stress everyone out. I felt guilty. Were we doing this because I made evaluation mistakes earlier? Did my friend have to spend a late night just because the noob new grad’s code couldn’t be trusted? It isn’t easy being positive when you’re tired and staring at a monolithic table of numbers.
As an ML engineer, your success is driven by the success of the model you build. But in most cases, especially when you’re not at a big company, the same person who builds the model is also the person who writes the code to evaluate its success. People don’t need to be malicious and actively lie about their results to make mistakes in their process that lead to a misunderstanding of the data. They just have to be a little less skeptical, a little less curious about learning about the ways in which what they’ve built might fall apart, and by the time it’s clear that the analysis was misleading and something was wrong, it’s often way too late. The reality as a data scientist is that you know your data better than pretty much anyone else, and the questions you ask about the data, and especially the ones you don’t ask, drive the story that other people hear about this data. With the wrong incentives and attitude, this can be dangerous.
I remember telling my grandfather years earlier with wide eyes that I had gotten into data science, imagining that he would see it as alien work from an alien time. It was to my surprise when he recommended me a book, “How to Lie with Statistics”. He told me he read it while in the Navy’s training program. His job was to interpret radar data for an aircraft carrier at sea. I imagined him as a sort of pre-historic data scientist, having to calculate countless numbers not with python scripts but with pen and paper. No for-loops, no standard library sorting functions, just good old fashioned grey matter getting the job done. God it sounded like hell. And yet, here he was, an ancient data scientist trying to impart wisdom of data honesty and integrity with this book recommendation to me.
It’s a trope that it’s easy to lie to people with statistics. Who doesn’t love the feeling of presenting the slide with a pretty graph that tells everyone you’re a genius? Sure, I’m a data scientist, but I’m also human, and so every now and then, I feel a devil on my shoulder yelling, “just fudge the metrics, no one will check!” Of course I never consciously act on it; what’s the point of chasing the dream of helping people make decisions with data if you ignore the data. What’s much more insidious, however, are not the ways we can lie to others with data, but the ways in which we can use data to lie to ourselves. My grandfather’s book recommendation reminded me that the task of painstakingly double-checking results was a deeply important lesson, not grunt work, and that though these lessons are new to me as I’m building machine learning systems deployed into the real world for the first few times, they are as old as data itself.