How to be an efficient data scientist

3 min readOct 11, 2021

This article definitely sounds like a cliché for someone who is striving to be a data scientist and have read numerous articles with the similar titles. What I try to tell you through this is just my experience on who could be the right fit for a data scientist role.

Data Science is a very vast field and is ever growing. It is difficult for one person to know everything, but if you are clear about few basics I have listed here, you can increase the odds of being a good data scientist.

Passion for Data

Data Science is truly about data. The more you explore data more you gain insights about it and better you can explain it to others. Understand that you can conclude most of it from this data analysis phase itself. what you need is a thorough understanding of

Data types- essentially numeric (discrete, continuous) and categorical. And depending on the language used for coding(python Scala pyspark )
Univariate Analysis- Analyzing the distribution of each of the features in you data independently. Most of the redundant features can be removed at this step and the important features can be identified as well. Identifies the missing values and outliers of those features and the treatment for it can decided on the basis of the distribution of it and with business judgement.
Bivariate analysis- understanding the relationship of each of your identified important feature with your target variable. Depending on the type of the target variable, Pearson’s correlation, Spearman correlation and ANCOVA

Class Imbalance

Must to understand how to treat imbalanced data. In the real world, most of the problems have imbalanced target class.

It is important to understand the math behind under-sampling, over-sampling, SMOTE, GAN and apply according to the problem statement.

Train, test and validation sets

The very thing that comes to mind is the 80/20 split, this may be useful in some competitions but selecting the train, test and validation set is a highly important step in a data science project.

The ground rule in selecting the split is, train and test should come from a similar distribution. Always check the distribution of the train and test split using Kolmogorov-Smirnov test, use k-fold cross validation etc.

There are many instances of different distribution found for validation sets, especially in computer vision problems, in that case try shuffling the data and re-sampling, using data augmentation etc.

However in time-series/panel data, I have not seen this happen too often, unless triggered by a unexpected event.

Feature Engineering

During multiple interviews conducted, I have seen candidates creates as many features as their imagination permits. Although it is great we can think of as many, but can you explain these features and their relationship with your target variable. If yes then that is the feature we need for the model.

Less the features better the model is what I believe, it would have simplicity and makes it self explanatory.

Log transforms and ratios have lot of advantage over using the raw data.

Choosing model algorithm

we have a variety of models available for problems such as regression, classification and clustering ranging from simple traditional methods such as regression to transformers. Choose the model which has

simplicity and can be explained to any non-technical person. Always remember that the end consumer of your product is business.
Do not compromise the model performance, find a trade-off between model performance and interpretability of the model.

Thorough understand of the model architecture is really important. It is impossible to know all but a few models should be very clear. Pick the ones you have written in your project and understand its architecture well.

Model Performance

There are multiple metrics to look for in model performance.

Error analysis — MAPE, MAE, weighted MAPE, F1, precision, recall, accuracy etc. Carefully select the one suitable for your project
Data drift — understanding the shift in the underlying data from the historical data
covariate and concept drift — how far the independent and dependent features have changed

I hope this will help in your step towards an efficient data scientist. Please do add comments if any valuable point is missed.