How to Become a Good Data Scientist ?

1737480896.png

Written by Aayush Saini · 9 minute read · May 26, 2020 . Data Science, 66

A data scientist is a professional known as the sexiest job of the 21st century. Also, research conducted by McKinsey Global Institute back in 2013 projected that there will be approximately 425,000 and 475,000 unfilled data analytics’ positions in North America by 2018. The take-home message here is that there will be a constant stream of analytic talent that will be required in all industries, where companies collect and use data for their competitive advantages.

In the business analysis domain, Data Science is a great progressive step that combines the methodologies and practices of computer science, modeling, statistics, analytics, and mathematics to drive business growth. Data Science involves leveraging automated methods to analyze a vast amount of data in order to extract insights from them. In their early career, many Data Scientists have served as statisticians or data analysts. But as big data began to developed and flourish, the roles surrounding data analysis evolved as well. To become a Data Scientist, the foremost skills you require is to be good at statistics and possess an analytical aptitude besides having the ability to think business, rather economics.

What exactly a data scientist?

A data scientist is a professional who can work with a large amount of data and extract analytical insights. They communicate their findings to the stakeholders (i.e., senior leadership, management, and clients). Thus, companies can benefit from making the best-informed decisions to drive their business growth and profitability (i.e., depends on the context of industries).

Why is it so hard to become a data scientist?

The nature of data science is a hybrid of many disciplines. Where it composed of different subject areas like math (i.e., statistics, calculus, etc.), database management, data visualization, programming/software engineering, domain knowledge, etc. In my opinion, this may be the primary reason why people interested in jumping into the entry-level data science career often feel completely lost. Most people don’t know where to start because you may lack in one area completely or multiple areas depend on one’s educational background and work experience.

However, the good news is that you don’t need to worry too much about it. These days, we face completely opposite sides of an issue. There are simply too many resources out there to pick. So, you don’t necessarily know which one might work out best for you. This article is focused on how to become a data scientist from three perspectives.

Section 1: Where to learn data science?

There are three major pathways to obtain data science education from Massive Open Online Courses (MOOC), university degree/certificate, and boot camp training.

Basically, each option has pros and cons with regards to the cost, flexibility, and program length. However, the best tip in making the right decision is to ask yourself what really matters most to you. For example, you have a luxury of time and want to minimize the investment cost. Or you might be a person who wants to land on a job as soon as possible even if the initial investment cost is high.

Section 2: What to learn data science?

There are many things to learn for sure as a data scientist. Let’s start looking at the data science education pathway from five major steps.

  • Step 1: Catching up on the knowledge about basic math related to statistics, calculus and linear algebra is a good start. This is essential as a data scientist to understand the mechanisms behind how different algorithms work. It builds intuition about how to tweak or modify algorithms for solving unique business problems. Also, knowing the statistics helps you to convert your findings from the experimental design tests (i.e., A/B testing) into key business metrics.
  • Step 2: Data scientists must be familiar with a toolset to work with data in various environments. A toolset contains a combination of SQL, command line, coding, and cloud tool. Here is a summary of how each tool is used. For data extraction and manipulation from the relational databases, SQL is the fundamental language used in almost anywhere. For general programming purposes (i.e., functions, for loops, iterations, etc.), Python is a good choice since it already packaged with many libraries (i.e., visualization, machine learning, etc.). For an additional boost, knowing command lines provide extra benefits especially for running jobs within cloud environments.
  • Step 3: This is the best time to pick up some language for building the data science foundation. For commercial software, you have a choice between SAS or SPSS. From open-source platforms, many people choose either R or Python. From here, you can grab concepts about data munging/wrangling (i.e., import data, aggregation, pivoting data, and missing value treatment). After this, you have the most fun part of learning your data from data visualization (i.e., bar charts, histograms, pie charts, heat maps, and map visualizations).
  • Step 4: You have a choice to pick between applied machine learning or big data ecosystem pathway. You can always come back to master another path later. Basically, applied machine learning covers the aspect of building a machine learning model from an end to end (i.e., data exploration to model deployment). For learning about the big data, I will cover more about where to obtain that education (i.e., books and courses).
  • Step 5: This is the most crucial step to showcase your potential as a data scientist candidate. Once you familiarize yourself with doing the data science, one must have a project portfolio. A project portfolio is your best opportunity to show what you have done from learning and work experiences. Starting from the data collection (i.e., where to pick or scrape data on your own), come up with your hypothesis, perform exploratory analysis (i.e., extract some interesting insights), build your machine learning model(s) and finally share your findings from write up or presentations.

Section 3: How to learn data science?

In this section, you are going to learn how to pick the best resources for becoming a data scientist.

For SQL education, the DAT201x course offered by Microsoft from the Edx is one of the best choices. The course covers the following aspects of SQL from data types, filtering, joins, aggregation (group by), window functions, and advanced concepts (i.e., stored procedures). The course ensures you to practice a lot by using the best sample data warehouse (i.e., AdventureWorks). Otherwise, you can use the Mode Analytics platform to practice and enhance your SQL skills. The best thing about Mode Analytics is you don’t need to have a SQL server and sample data warehouse installed in your machine. All you need is to have a free account and Internet connection to enjoy your learning.

For machine learning education, there are two options that are recommended. The first course is well-known from any data science practitioners out there in the field. Andrew Ng’s machine learning course from Coursera. This course is the best to understand basic concepts and tips on how to tune my machine learning models. For the coding experience perspective, there is a book that is highly recommended called Python Machine Learning 2nd edition by Sebastian Raschka. This book helps you understand from basic mechanisms of each algorithm, a lot of coding examples, and supplemental references (i.e., research articles). The best thing about this book is that he walkthroughs how to implement each machine learning algorithm line by line with thorough explanations. This is super important as mentioned by many data scientists, one should be able to write up coding from scratch and know how to implement it. These days, there are many complex problems that you cannot solve directly by using existing libraries from Python.

Here is a full list of resources you can reference for learning each building block of the data science education.

1. Math:

  • Khan Academy Math Track
  • MIT Open Courseware: linear algebra and calculus
  • Udacity: Intro and Inferential Statistics

2. Data Science Toolkit:

  • SQL
  • Edx: DAT201x — Querying with Transact SQL (*)
  • Mode Analytics: SQL Tutorial (Intro to Advanced)
  • WiseOwl: SQL Tutorial (Intro to Advanced) (*)
    • Command Line
  • Book: Data Science at Command Line
    • Python Coding
  • Udemy: Complete Python Bootcamp
  • Book: Learn Python the Hard Way (3rd Edition)
  • Book: Automate Boring Stuff with Python

3. Machine Learning:

  •  
    • Coursera: Machine Learning by Andrew Ng (*)
    • Coursera: Applied Machine Learning (U Michigan)
    • Harvard: CS109 — Intro to Data Science (*)
    • Book: Python Machine Learning (2nd Edition) by Sebastian Raschka (*)
    • Book: Python Machine Learning by Example
  •  
    • Book: Intro to Machine Learning with Python

4. Big Data:

  •  Hadoop
  • Book: Hadoop The Definitive Guide
  • Udacity: Intro to Hadoop and MapReduce
  • IBM: Hadoop Fundamentals Learning Badge
  • Spark
  • Edx: UC Berkeley Spark Courses (CS105, CS120)
  • Datacamp: Intro to PySpark, Building Recommendation Engine in PySpark
  • Book: Learning PySpark, Advanced Analytics with Spark

Most Important: Ask for Help and Networking

As a newcomer to this field, you don’t necessarily have a mentor who can guide your learning experience. Thus, you need a place to ask for opinions and feedback from the data science community. There are a couple of forums out there you can ask for help with your problems. A few websites like StackOverflow, Quora, etc. let you post your question and receive a reply to your posts.

Another tip is related to networking. This really applies to anyone who is really looking for new opportunities and build connections. There are many local meetups and big conferences related to data science. Try to attend events as many as you can and introduce yourself (i.e., motivation, objective, passion). Also, if you have opportunities to connect with speakers and event organizers, work to establish meaningful connections with them. Also, seek opportunities to present your project portfolio on whatever available medium i.e. either opportunity to present at local meetups or even video webcast through the remote data science office hour. From this experience, you will be able to learn from your silly mistakes and make improvements from one presentation to another. This brings a lot of value as a data scientist candidate to deliver an effective presentation and able to clearly communicate the analytical insights.

We hope to bring more enjoyable and resourceful information and add value and more experience in your journey of becoming a data scientist.

 

Thanks for Reading

Share   Share