Take-home messages from Data Science Handbook book


The Data Science Handbook is an awesome book based on interviews with 25 amazing data scientists where you can quickly grab really useful advice and insights from those scientists including DJ Patil (Chief Data Scientist of the White House from 2015 to 2017). 

The question-answer style writing is easy to follow and personally quite enjoied reading this book, in which the advice and insights covering a wide range of aspects of being a data scientist - as a beginner, practitioner, and leader. 

In this post, I summarize take-home messages categorized into the following 5 questions:
  • How to define a data scientist or data science?
  • How to develop and what kinds of skills to have to be a good data scientist?
  • What things to work on as a data scientist?
  • How to choose a good data science position?
  • What will be the future of data science in organization?


How to define a data scientist or data science?


A data scientist is someone who sits down with a question and gathers some data to answer it, or someone who starts with a data set and ask questions to learn more about it.  In addition, they need to take what they've learned and communicate it to people who were not involved in the analytical process. Explaining it to someone who doesn't come from the same background, you have to knwo how they think so you can translate it into something they can understand.

Data science splits into two fields, and I believe a lot of hiring companies are starting to reflect this. Data science is starting to break off into descriptive analytics and predictive analytics.

Part of the job is really use-case discovery.

Good data science can't be 100% theoretical or 100% practical, there has to be a mix.


How to develop and what kinds of skills to have to be a good data scientist?

Presentation, story telling, communication:
So my argument is that people right now don't know how to make things. And once you make it, you must also be able to tell the story, to create a narrative around why you made it.

Everything in science is about a fully detailed presentation of an idea, but then the opposite is true in business.

I think communication is the difference between the good scientists and the great.

Besides the usual skills, the other thing that's really important is the ability to make, storytell, and create narratives. Also, never losing the feeling of passion and curiosity.

In many cases, when you can't explain something clearly, it's a sign that you haven't thought it through fully yourself.

I'm still working on this (presentation) today. Despite how silly it seems, I totally recommend that my fellow introverts try the videotape technique. Andrew Ng recently shared a great post on how he used a similar technique (deliberate practice) to become a better teacher and presenter.

A core skill that any data scientist should possess is the ability to communicate with the business.

Effective data scientists are the ones who can communicate effectively.

The subtext of "I'm smart" doesn't matter anywhere outside of graduate school. You have to start with: Here's what I found and why you should care about it.

Team work: 
People make a mistake by forgetting that data science is a team sport.

Working with other people is essential to working with more complex concepts and systems. Rome wasn't built by some guy, and probably not at a weekend hackathon.

The need to be best friends with whoever is running the infrastructure that holds the data, and they need to be able to work with the product and business side as well.

Learning:
Because of the pace at which the world changes, the only way to prepare yourself is by having that dynamic range.

One of the things I tell new data scientists when they get into the organization is that they better be the first ones in the building and the last ones out.

When giving advice on undergraduate coursework... Take as many physics and math classes as you can, but also learn computer science.

So a redimentary understanding of mathematics and statistics will get you 85% of the way there, while the last 15% will come from basic coding skills. A statistical background and intuition will get you a long way.

Visualization: As soon as someone hands you a dataset or gives you access to a stream, the very first thing to do is find an interesting variable in the dataset and plot it.

The best advice is just to think hard about what you want your audience to take away from the visualization. 

More data beats better models; better data beats more data; and the 80/20 rule.

Just understanding theory is not enough. You need data sense.

The types of questions they ask are as important, or more, than the methodology behind solving them.

I ate up Khan Academy and Coursera videos.

I took ML with Andrew Ng. Probablistic Graphical Models with Daphne Koller, Data Visualization with Jeff Heer, and Mining Massive Data Sets with Jure Leskovec.

The lession is the following: If you take initiative and acquire skills that increment your value, the market is able and willing to reward you.

My approach to tools is: is the cost-benefit of me taking the time to learn the tool going to have a significant impact on getting my work done more efficiently or effectively?

The coding side of it pervades all of this work. The faster you can code, the faster you can implement ideas. If you have a good sense of building systems, you can scale what started out as a research project into something operational.

I think that is powerful and pertinent for people who feel they can't get started doing the things they want because they havn't checked all the technical boxes. It seems like another way to do it is actually go into what is the problem you want to solve.

You should approach things in the T-shaped model, where you accumulate a great deal of breadth and a concentration in one skill that gives you depth.

I'm a data scientist and I'm also an engineer. At the end of the day I want to solve problems. So if I can solve problem today better than yesterday, then that's a success.

Characteristics:
I think like any scientist the thing that drives me the most and really compels me to work late into the night until the sun comes up is curiosity.

I think one of the most important things is to leanr to be curious.

Good data science is more about the questions you pose of the data rathern than data munging and analysis. 

For example, it took me a long time to grasp that improving the efficiency of a business process might actually be perceived as threatening to someone's job, and the natural reaction of that person might be to consciously or unconsciously undermine any progress. So you have to develop empathy for people involved in business processes, and create solutions that help those people transition to higher-value work.

A better mindset is to think of yourself as the business owner who is responsible for changing how the business works. That's a whole different mindset.

The challenge for a lot of people is the ability to apply these insights into value. Not all interesting problems can produce insights, and not all interesting insights can inspire action that causes change.

If we are not changing the product and changing it in a way that delivers better outcomes to users, then we're not doing our job.

You have to be energetic and work really hard, but not get discouraged just because you don't know everything.

I always tell students that I think the most useful skill you learn in grad school is how to teach yourself stuff and how to figure out things that you don't know. That's one thing. The second thing is to be stubborn and beat your head on a problem until you make progress. It's really those two things.

Work on a hard problem for a long time and figure out how to push through and not be frustrated when something doesn't work.






What things to work on as a data scientist?


Only work on simple things; simple things become hard, hard things become intractable.

You've got an infinite list of questions you can look into -- how do you pick the ones that are going to have the biggest impact?

You need to ask yourself thse questions: What are you working on? How will I know when it's done? What does it impact? 

Assuming everything works perfectly and everyone in the world uses our solution, how does it change human behavior?

The really hard problems are ones for which we don't have good well-defined definitions for yet. Or we recognize the problem but it's not obvious how to find the relevant data that goes with it.

Don't fall in love with your own ideas. Market feedback is the only thing that matters.

There are too many challenges with data. We know the saying: All models are wrong, but some models are useful.

Whenever possible, ask fundamental questions like, who cares?

Before you build model, you need to know what data sources are available to you within the company, what techniques are available to you, you have to define the problem appropriately and engineer the features.

There is so many steps before you get to modeling that are crucial. Can I ever ask a Kaggle competition, is this the competition this company should actually be having?

A big component of data science is questioning why you are doing what you are doing -- choosing problems to solve while rejecting other problems that are irrelevant to the business.

They are simple problems. They are simple but not easy. Losing weight is simple but not easy. Most industrial problems are simple but not easy.

A good solution applied with vigor now is better than a perfect solution applied ten minutes later.

Prioritize things that would have the most impact for the company.

Fast, iterpretable, and reliable instead of theoretically perfect.


How to choose a good data science position?


The most exciting data opportunity is when you have the flexibility to collect data.

The way I think about it is, wherever you go, make sure you're around the best people in the world. 

Startups are the perfect place where you have a product that's gathering it's own data.

Go to one where you think: This is somewhere I can learn from for a year. I think I will be happy here for about that long

Talk with people who can recognize hustle and grit, and not necessarily those who are looking to match a pattern drawn from your previous experience. Often, these kinds of people run startups.

Whenever I talk to other data teams, I always ask where the data team is on the organization chart, and that will tell you a huge amount about the implicit skillsets they expect you to have. Software engineering is a non-trival part of our job as a data scientist.


What will be the future of data science in organization?


Culture is also a big part of the practice. I think data culture will continute to grow, even among people who are'nt data scientists.

So we've invested a lot in the structure of our data warehouse and the tools used for accessing it so that it's intuitive to people with less experience working with data.

The terms come in and out of style, but if you are good at understanding problems and communicating with people, and answering their questions with data, the need for you in particular will never go away. You will never be automated. You will have plenty of job security.

Data scientists and data teams do a variety of things beyond just business intelligence. They also do algorithmic engineering, build new features, collect new data sets, and open up potential futures for the project or business. I don't think data scientists will be out of work anytime soon.



No comments:

Post a Comment