Rules for Interviewing: Don’t Discuss the ML Prototype Project

A common question in a data scientist interview is to walk through a project you’ve done in the past. Many candidates choose to discuss a ML prototype project, often something they had to do for data science bootcamp.

Don’t use that ML prototype project for your data science interview. It won’t impress the interviewer.

When an interviewer asks this question, they are looking for at least these 2 things:

  1. Competence
  2. Business intuition

Competence

To demonstrate competence in an interview, the candidate needs to be able to speak deeply about their work. That means that when the interviewer probes the candidate with follow-up questions about how and why they made certain decisions, they need to have thoughtful answers.

A ML prototype project generally only gives someone a superficial understanding of how to do ML. These projects are meant to be completed in a short time, and to make that possible, many of the most important decisions about the project are made before you even begin working on it. These include decisions like the evaluation metric, the label, the set of features that are allowable, and what counts as a training example. These are important and difficult decisions, and they’re already made up for you before you even begin.

Because the tough decisions have already been made before you even start, the ML prototype project just doesn’t give you the chance to show the interviewer that you can competently make those decisions in real life.

Business Intuition

Most data scientists work for companies that want to make money, or at least, organizations that want to make an impact. This means that data science projects need to add value, and when interviewers ask you to describe a project, they are looking for whether you have good sense of the value that you can add to a business or organization.

ML prototype projects generally do not have very specific goals, which makes it hard to explain their business value. If you try to explain your image classification model to the interviewer, you may be able to explain some techniques, but you will have virtually nothing interesting to say about why your decisions added business value.

The 4 Branches of Data Science

All data science projects fall within one or more of the following four areas:

  • Statistical Inference
  • Causal Inference
  • Machine Learning
  • Descriptive Statistics

Below are short descriptions of each, followed by examples.

Statistical Inference

Goal: measurement, typically optimizing for an unbiased and low variance estimator.

Examples:

  • Metrics based on samples, e.g. conducting human reviews of samples of content to measure the rate of violative content being seen by users on the platform
  • Hypothesis testing, e.g. testing whether user retention differs across demographics

Causal Inference

Goal: inferring the causal connection between two events

Examples:

  • A/B tests, e.g. the new UX design outperform the old one on conversion rate?
  • Observational studies, i.e. absent a randomized experiment, can we tell if one event caused another?

Machine Learning

Goal: learning prediction functions using data and algorithms

Examples:

  • Classifying violative content, e.g. videos, images, comments
  • Predicting the product that a customer will purchase next

Descriptive Statistics

Goal: summarize the data, primarily to tell the story of what is happening, or to help generate hypotheses. This is what is commonly meant by the term “analytics”.

Examples:

  • Creating a North Star metric for teams to optimize towards
  • Calculating the growth rates in different customer segments

Label Engineering vs. Feature Engineering

It’s conventional wisdom that feature engineering is an important part of model development. A less appreciated point is that label engineering can be just as important, and in many settings, much more can be done to improve the quality of your labels than the features.

When the labels for your model are not actually ground truth but a noisy proxy for them, you have noisy labels. This is an extremely common scenario when working with human-generated labels. In general, noisy labels are not a problem for your model as long as you have many of them and they are unbiased measurements of your ground truth.

In practice, noisy labels are rarely unbiased and training solely on the raw labels will cause your model to learn to predict outputs that are systematically different from the ground truth. In this case, counterintuitively, having more training labels can quickly become useless and can even become harmful for predicting the ground truth target.

Label engineering is the process of adjusting raw labels to make them more useful for predicting the ground truth target. There are now dozens of research papers that have proposed solutions to this problem but I will describe one of these approaches that is very useful in practice.

One method for dealing with noisy labels is to use a label correction model. This is an upstream model to your primary task that adjusts your labels to make them more accurate, or at least less biased. The cleaned labels are then passed to the primary model in the same way the raw labels were. This follows the teacher-student model paradigm, where a teacher model with access to more information or a richer model architecture produces predictions for a student model to learn from.

The label correction model approach requires the collection of a small ground truth set to learn the function that maps noisy labels to clean ones. Collection of this set can be expensive, but in most settings there is an existing process for collecting ground truth labels for the model eval set and this process can be used to collect a label correction model train set.

Tactical vs. Strategic Decisions in Data Science

On the Jocko Podcast, Jocko Willink describes the difference between tactical and strategic battles. According to Jocko, if the battle you are fighting is one that you must win, you are not fighting a tactical battle. As a decision maker, you shouldn’t always be trying to win every scrap and argument. Instead, you should focus on how your actions contribute over time to your ability to defeat the enemy. In other words, always be thinking strategically.

In data science, there are tactical decisions and strategic decisions, usually about which data science methodologies to use. It can be hard to distinguish between what’s a tactical or strategic decision without careful thought, but the investment in doing so can save you a lot of stress. Per Jocko, you can use a simple litmus test for whether you’re in a tactical or strategic battle: ask whether this is a battle you must win to achieve your goal.

Let’s consider an example: you want to measure how well your classification model performs for detecting a rare event, but constructing an eval set that has enough examples of the positive class is difficult or expensive. A difficult colleague suggests using importance sampling with the model to gather more examples of the positive class. You think this is unwise, since the positive class examples found will be by construction the ones your model can find, and the sampling will miss the ones the model is bad at finding. You engage in a disagreement with your difficult colleague.

Is this an example of a tactical or strategic battle? 9 times out of 10, this is a tactical battle. Both of you are trying to get to the same outcome: a reliable measure of model performance. The worst case outcome is that you end up collecting more eval data than you needed, i.e. wasted time. It won’t lead to ultimate failure. You can dig in and argue your position for 10 hours, but if you agree to go with your colleague’s plan and at worst it takes overall 4 hours to execute their plan then you’re up net 6 hours.

When you can’t help yourself from battling it out over some methodology, ask yourself whether winning that battle is really going to get you closer to your goal. Otherwise, you might be fighting a lot of battles just to satisfy your ego.

Data Science Tactics vs. Strategy

In general, tactics refer to specific actions that can be taken and strategy refers to how those actions are assembled towards the end goal.

In the data science context, data science tactics primarily map to statistical or machine learning methods. Data visualizations, analysis writeups, code documentation fall under tactics too. Less commonly spoken about, persuasion skills also can be considered part of data science tactics. Data Science strategy maps to the culture how those methods are used to solve data science problems.

Data Science Tactics

All data scientists know that they need to train up on statistics and machine learning tactics. These are tools like logistic regression, deep neural nets, sampling theory, hypothesis testing, mixed effects models. Being competent with these tools is table stakes for data scientists. It’s common for data scientists to want to focus 80% of their time on this area —it’s the fun stuff.

Most data scientists also eventually come to understand that 80% of what others perceive about your work is in the form of documents and presentations. That means that to be an effective data scientist, it’s important to develop your own principles for making docs and presentations that communicate what you want effectively. The principles serve as an algorithm, your own human algorithm for producing solid docs and presentations consistently and quickly.

Some data scientists also eventually realize that there are tactics of persuasion that can make your life easier. Example: suppose you develop a model that outperforms someone else’s, what do you say to them? One way would simply be: “hey my model is better than the one we have, let’s launch it!” That can work among close colleagues but can backfire especially when you’re new to the team. For example, your model can be killed with responses like “Oh your model is overly complicated and the gain isn’t that big so we prefer the simpler one” or “Your model is good but we’re working on a new version that will subsume what your model does, thanks anyway.”

One issue with the direct approach of saying your model is better is that it can feel threatening to others. So, something you can use that can feel less threatening is something that I call the “collateral damage” principle. The idea is to start by recognizing the strength of the current approach (example: the current approach is simple and works pretty well), then to show the collateral damage of the technique (example: here are some misclassifications from the model), and then to unveil how your approach eliminates or mitigates the weakness. By recognizing the merits of their current approach, you earn a better chance at persuading them.

Data Science As Chess: Don’t Blunder

As many others have during the Covid-19 pandemic, I picked up chess as a new hobby. I had always wanted to be good at chess, but had never understood how to be good at it. The only learning strategies that I knew of were 1) improvement through random play and 2) memorization of openings. 

I think most data scientists follow similar strategies for learning how to be “good” at Data Science. They tend to emphasize 1) improvement through accumulation of experience and 2) memorization of ML or statistical tactics. Both of these learning strategies have their place but today I will argue for a third path: developing principles that are generalizable across situations yet specific enough to be useful. 

To illustrate, I’ll take one of the chess principles I picked up from John Bartholomew and map it to a data science principle that I’ve found to be useful. One of the most important and difficult principles to follow in chess is: “Don’t blunder.” As John shows in this YouTube chess fundamentals video, you can defeat a ton of players at lower ranks by simply avoiding blundering your own pieces.

In data science, the “Don’t blunder” principle does a lot of work too. It sounds simple and obvious but, as with chess, it’s really hard to do but can help you progress a very long way. To blunder in data science is to present your analysis findings with errors in them. Example: you accidentally overfit a ML model to your eval set and present performance metrics that are better than they will actually be in the wild. Another example: you forget to apply the correct filters to your data to get the right treatment and control populations when calculating experiment effects. Such mistakes are not only personally embarrassing but can damage your credibility. 

It can be really hard (i.e. impossible) to completely avoid blunders from happening, but there are principles that can help. In chess, one way of avoiding blunders is to make sure you don’t have undefended pieces on the board. In data science, you can avoid blunders by sanity checking your work. Example: calculate a simpler metric that should closely approximate you more complicated one and check that they are close. After a while, sanity checking becomes second-nature and the extra work feels less cumbersome. Over time, you will earn the trust of your team and have greater effectiveness as a data scientist.