How to Make Progress As A Data Scientist

This post is about what it takes to make progress for your team/company as a data scientist.

Making progress in any professional field, including in data science, is not the same thing as completing tasks. Anyone who has spent at least a couple of years working has probably experienced the feeling of being really busy and yet always just barely keeping up. I would argue that by definition, this state of affairs indicates progress has not been made.

To make progress, one must be able to solve problems in a permanent way. That means that old problems are no longer problems, not for you, and ideally not for anyone else either.

One reason why the tech industry is so different is that software is particularly focused on making progress. Code changes not only fix problems for one person in one situation but have the potential to have solve problems permanently for everyone. When Data Science is done well, it inherits this potential for progress making.

Data Science Management Principle 1: Do What’s Right For Your People

For first time Data Science managers, many may have the impression that proving their technical leadership is the most important aspect of leadership. This is a wrong view.

The most important thing that one must focus on as a Data Science manager is this: do what’s right for your people. This begins even before someone is even hired for your team —you must do what’s right for candidates too, e.g. by being honest with them about the job and the team, but not exaggerating the positive aspects of the job, and by showing your honest self to them. This requires a lot of vulnerability and confidence, but will pay off in the long run.

How to Scope a New Data Science Hire’s Role

  1. Make a list of things you would like to do, but haven’t had time to do. Don’t just include things you didn’t want to do, because this new hire probably won’t want to do those things either
  2. Talk to teammates about what they have long wanted to have seen get done
  3. Look ahead to the needs of the next year—where is the company heading?
  4. What are the projects that require deep investment but could produce step-function increases in results?

How to Write a Data Scientist Perf Self-Assessment

Self-assessments during Perf cycles are necessary but not fun. Here is a breakdown of steps for data scientists to write their effective self-assessments.

Step 0: Keep track of accomplishments and artifacts during the cycle

This is the most important step to preparing a self assessment, and it begins as soon as the next cycle starts. Throughout a cycle, I like to keep a running list of accomplishments with links to artifacts like change lists, docs, and colabs. Having the content to show your productivity through the cycle is critical. Having this list is an easy way to keep track of that content. When perf comes, nobody wants to spend hours chasing down that colab or doc from 6 months ago.

Step 1: Summarize the project context and your contributions

Explain what the project is, why it’s important, and your specific contributions.

Step 2: Leadership

Indicate how your contributions exemplify leadership. Here are some specific actions that show leadership:

  • Identifying a critical problem
  • Bringing clarity to a complex problem
  • Owning the design of a solution
  • Advocating for the solution to other parts of the organization

Step 3: Difficulty

Difficulty is about the problem, not necessarily about the solution. Problems are difficult when they are 1) ambiguous and/or 2) technically challenging.

Step 4: Impact

Impact can either be measurable or immeasurable. Always include measurements of impact if it can be measured. Otherwise, emphasize how your work moved the project forward or changed how decision-makers think about a problem.

Applying Occam’s Razor to Developing Business and Model Metrics

Metrics are used to align an organization or team towards a shared goal and to track progress towards that goal. A successful metric is one that helps to inspire and coordinate efforts across many people. This means that, at a minimum, a metric should be easily understood by people across the organization. Understandability becomes even more important as the number of people or the seniority of people who rely on it increases.

Consequently, when defining a new metric, an extremely effective tool for metric selection is Occam’s razor. When there are multiple possible metrics that could be used to track the same notion of business or model performance, the simplest metric is generally the most useful. This tends to be true for a few reasons:

  • Simple metrics are usually pretty “accurate”. That is, they often achieve a good balance between bias and variance with respect to the true notion of performance
  • Simple metrics feel transparent and thus, more trustworthy
  • Simple metrics are robust: modifying them or complicating them slightly tends not to make much of a difference
  • Simple metrics tend to have clear statistical properties

A common mistake especially of new data scientists is to favor complicated metrics that may have theoretical advantages or motivations but are impenetrable to non-DS. Complications are best hidden under the hood of a neural net, where they are expected to be.

Operationally, “simple” means that to the extent possible, the metric should have the following properties:

  1. Use only counting or basic arithmetic
  2. Has no parameters that need to be chosen
  3. Does not rely on an underlying statistical or theoretical model for its interpretation
  4. Has an understandable name or quick description

There are other desirable properties but simplicity is the most effective filter.

Data Science and The Booth at the End

A while back there was a show called “The Booth at the End”. The premise of the show is that there is a man who sits in a booth at a diner who can grant you anything you want if you complete the tasks that he requires. The show is about what people want, why they want it, what they’ll do to get it, and what they discover about themselves and what they truly want after attempting to complete their task.

I realized from watching this show that as a data scientist I often feel like this man in the booth. People come to me asking for something, and I often spend a lot of time deciphering with them what they really want and why they want it. I then offer them a task, sometimes a difficult one, that makes them question how badly they really wanted what they asked for. Like the man in the booth, it’s not really up to me whether they get what they want—my job is simply to make the choices clear.

Data Science Presentation Tips

The First Two Slides Are 80% of Your Talk

Summarized from https://www.brown.edu/Research/Shapiro/pdfs/applied_micro_slides.pdf

  1. Assume that your audience does not care about your topic. Spend the first 2 slides convincing them otherwise.
  2. Present the key findings and just enough of the methodology that the results don’t feel like magic.
  3. Use the talk to tell your story. If you talk had to fit in two slides, that’s what your first two sides are for. You can even adopt the “once upon a time” and “one day, as hero was going about their normal day…” tones in your talk.

How to Practice

  1. Write down your entire talk. Go through it over again until it is simple and there isn’t anything left that sounds clunky or that is too subtle to remember. This is especially useful if you have to give your talk multiple times.
  2. Practice frequently at first and then at longer and longer intervals between practices.
  3. Use a safe person for a dry run to clear the obvious land mines away.

Overcoming Fear

  1. Imagine that someone, a safe person to you, will be in the room with you at your talk. Visualize them being at the talk instead of all the other people who cause you anxiety.
  2. Remind yourself that once the time arrives for your presentation, you will think to yourself, “so this is the way it is”, and then the moment passes, and life goes on.
  3. Present your work as if it was done by someone else, not you. This will help you detach from your topic and remember that it is the work that is in front of the audience, not you or your weaknesses or anything personal about you.
  4. Practice the physiological sigh: double inhale through the nose and long exhale through the mouth, repeat for 20-30 seconds at least.
  5. Lateral eye movements have been shown to help manage anxiety via amygdala deactivation.

How to Stay Sharp in Data Science: Establish a Training Regimen

Data science is an enormous and growing field. Over time in a data science career, it becomes hard to stay sharp on all areas. Everyone has their own reasons, but here are some of the excuses that I have told myself:

  • Life is busy enough. I don’t have extra time to train
  • I’m doing pretty well in my job, so I must be good to go
  • If I want to get ahead, it’s not going to be from more technical skills, but from leadership skills

Why Training is Not Optional

If you do not stay sharp, you expose yourself to two risks:

  1. Burnout
  2. Obsolescence

To avoid either of these two outcomes, a data scientist must periodically invest in training and retraining in their skills.

How to Develop a Data Science Training Regimen

  1. Identify the skills
  2. Identify the resources
  3. Block off time

Identify the Skills

To begin training, a useful starting point is to develop a “reading list” of skills or topics to cover. In data science, at a high level your reading list will generally cover statistics, machine learning, programming, and maybe some random topics like optimization. At a later blog post, I’ll share my own list.

Identify the resources

Your “reading list” obviously doesn’t have to be just text sources. I often prefer to cover the same topic with multiple types of media: academic articles, Wikipedia, YouTube, GitHub. Getting the same topic from different angles helps me feel more deeply engaged on the topic.

Block off time

Arguably the hardest part, you have to find time to actually do your training. I personally like to wake up before my family and do some reading or watch some YouTube lectures.

How to Present a Technical Data Science Topic to an Exec-level Audience

Data scientists are often called on to present their work to non-technical exec-level audiences. For example, you may be asked to present at a weekly VP meeting that hosts speakers to give deep dives about new projects in the company.

Assuming you already have a technical presentation appear prepared, how do you turn it into something that can be presented to an exec level audience?

1. Add an Executive Summary Slide

An executive summary slide describes:

  • Basic context
  • Problem being solved
  • Proposed Solution, with costs and benefits

If it doesn’t all fit into one slide, put the context on its own slide and the other points on a separate executive summary slide.

In your talk, you can say that “first, I’ll preview the key points, which are on this slide, and then I’ll go into detail about exactly what we’ve found and why it makes sense.”

2. Expand on the Context

An exec audience needs to know a certain minimal set of facts to understand what you’re doing and why it’s important. These facts describe the current state of the world: specifically the objects and their operation.

Objects needs to be defined: “Cobra is a machine learning model that predicts fraudulent transactions…”

Operations need to be described concisely and may link other objects together: “Cobra is trained from human generated examples collected from a team called Bigfoot.”

3. Explain the Problem

Given the context, what is the problem? Two possibilities:

  1. Objects: there may be objects at play that shouldn’t be, or there may need to be additional objects that aren’t yet there
  2. Operation: how the objects perform may be suboptimal

After identifying the problem, emphasize why it’s a significant problem. Why should the execs care? Translate the technical problem into the big picture business problem.

4. Simplify Your Formulas

Complicated formulas should be avoided—they can take a long time to explain to an exec audience and can lead to a lot of confusion. Replace a complicated formula with an extremely simple one or with a diagram.

If you’ve expressed the problem in terms of the objects and their operation, then you can also express your solution as a change in the set of objects or how they operate.

5. Show the Costs and Benefits

The easiest way to convince leaders of something is to show them quantitative proof that your proposal is a good idea. If the benefits hugely outweigh the costs, then supporting you becomes a no-brainer.

6. Reiterate the Main Takeaway

End with “if there’s the one thing I’d like you to walk away with after this presentation, is that…”

Finally, when giving the talk, don’t feel the need to improvise. Stick to your slides, and if in practicing you find yourself needing to improvise, update the slides to fill in what you will need to say.

How to Choose Great Data Science Projects: Focus on the Obvious

There are many factors that could be used to help decide which data science projects to prioritize. In this post, I share a simple principle that I have seen work many times over:

Focus on the obvious. The ideas with the best chance of making a big impact are those that address obvious problems and/or propose obviously better solutions.

Examples

Example of an obviously good idea: suppose you are working on improving a production machine learning model. If the prod model was trained using noisy labels and you now have access to higher quality, training a new model that incorporates the new clean labels is obviously a good idea. The idea is obvious because it’s a common sense thing to do and has an obviously good chance of succeeding: 9 times out of 10, there is a way to improve the original model using the clean labels.

Example of a not so obviously good idea: suppose you are still trying to improve that machine learning model. You want to try out some new techniques involving ensembling multiple models that have been trained separately but from different randomly initial weights. This is idea is not an obviously good idea: it’s complicated, involves way more compute, makes the production setup more fragile, and has very little chance of making a big impact. Maybe in some rare circumstances it can help, where the loss surface has a lot of

Advantages from Focusing on the Obvious

  • Obvious ideas work: they tend to be more robust and have a higher chance of succeeding
  • Easy to explain: it will take much less effort to persuade others that your idea is good
  • Prevents self delusion: the human brain has an amazing ability to delude itself into believing almost anything, especially that an idea it creates is a good one. Focusing on the obvious prevents your own fanciful thinking from hijacking your time