According to Harvard Business Review, “Data scientist is a high-ranking professional with the training and curiosity to make discoveries in the world of Big Data”. In a world where 2.5 Quintilian bytes of data are produced every day, the data scientist has become the sexiest job title of the 21st Century.

Despite of having so many opportunities, it is necessary to kick start your career in this field through “Data science Projects”. With data science projects you can learn how to practically apply your learning and showcase on your CV. Nowadays, companies evaluate a candidate’s potential by his/her work and don’t put a lot of emphasis on certifications.

There are many data science competitions on platforms like Kaggle, Analytics Vidhya and International Data Analysis Olympiad (IDAHO) that give exposure to solving real life problems. These platforms are an opportunity to push limits and encourage creativity among the best and brightest in a variety of data related fields. The more time you spend on practicing, the better you become!

Now the question that arises is “Which dataset should a newbie begin with?”

Here I am assuming two things:

  1. You possess some knowledge of machine learning
  2. You know how to use machine learning libraries/packages in R, Python, Java etc

For a beginner level, choose data sets that are fairly easy to work with and don’t require complex data science techniques . You can solve these using regression or classification algorithms. The following are the list of data sets as a beginner level.

1. Amazon Employee Access Challenge competition

Since you have basic machine learning/data mining knowledge, Amazon Employee
Access Challenge competition hosted on Kaggle is a good starting point. The following are
all good reasons to start:

  • No missing data
  • Randomly labelled data
  • All categorical features
  • Size of data is relatively small

2. Iris Data Set

If you are totally new to data science, this is your start line. The objective is to predict the
class of the flower based on available attributes. This is the easiest and versatile data set
for pattern recognition. Iris data set is also good because of the following reasons.

  • Easy to understand
  • Small data set with 4 columns and 150 rows
  • No missing values

3. Titanic: Machine Learning from Disaster

This is the best, first challenge for you to dive into ML competitions and familiarize yourself
with how the Kaggle platform works. The competition is simple that uses machine learning
to create a model that predicts which passengers survived the Titanic shipwreck. With
titanic data set you will learn the following:

  • Basic feature engineering
  • Data exploratory analysis
  • Well labelled data, that means you will be able to ask some interesting questions and
    explore the data to test your hypothesis and come out with a new feature to better classify
    the data
  • Interval and categorical feature

4. Loan Prediction Dataset

The insurance and banking industry have one of the largest use of analytics and data
science methods. This competition is launched by Analytics Vidhya. It is a classification
problem where the objective is to predict whether the loan will be approved or not. You will
also learn how real data sets look like and what challenges are faced. You will also learn
how impactful a variable is and how to deal with those variables.

5. Forest Cover Type Prediction

It is a multiclass classification problem. The competition is launched by Kaggle. The
objective is to predict classification for the forest cover type. The data consists of 56
columns and the training set consists of continuous and classification data types. This
dataset gives good exposure to the multiclass classification problem.

6. Bike Sharing Demand

According to global market insights, the bike-sharing market is predicted to grow by 15%
between 2019 and 2025. These insights are derived using data science techniques. The
dataset on bike-sharing demand is available on Kaggle where the objective is to forecast the
use/demand of a city bike-share system. This is a regression problem. With this dataset, y
you can have good exposure to regression problems.

7. Time Series Analysis Dataset

Time series is an important area of machine learning and the most commonly used
technique in data science. It has a wide range of applications in statistics, earthquake
predictions, weather forecasting and so on. The following are very good time series data
sets by Kaggle.

8. House Prices: Advanced Regression Techniques

With some good experiences in R or Python and machine learning, you can apply advanced
regression techniques like random forest and gradient boosting on house prices prediction
data set provided by Kaggle. The objective is to predict SalePrice. You will enjoy doing
creative feature engineering.

9. Mall Customer Segmentation Data

It is the easiest dataset that is created for the learning purpose of the customer
segmentation concept. You can start by implementing an unsupervised learning algorithm
such as K-means clustering. The objective is to understand customer behavior and group
customers who have a high likelihood of converging.

10. Convolutional NN (Dogs vs. Cats)

This dataset is the starting journey for CNN (convolutional neural network). In this
competition, you will write an algorithm to classify whether images contain either a dog or a
cat. It is easy for humans to identify dogs or cats but for computers, it is a bit more difficult.

Out of the 10 projects listed above, you can start by finding one that matches your skills. I
hope that the above article has given you direction to choose few best projects as a
beginner in data science. The projects add more value to your resume.

Happy learning!

Leave a comment