Course Objective

This course is to teach the analytical mindset & programming skills relevant to data science. Students will continue to polish the basics of the Python programming language, along with a set of tools for data science in Python, including the Jupyter (IPython) Notebook, NumPy, Pandas, Matplotlib and Scikit-learn. Students will learn skills that cover the various phases of exploratory data analysis:

  • Importing data
  • Cleaning and transforming data
  • Algorithmic thinking
  • Grouping and aggregation
  • Visualization
  • Statistical modeling/prediction
  • Communication of results

The course will utilize data from a wide range of sources and will culminate with a final project and presentation.

Course Outline

Part 1: The Pandas DataFrame Library

  • Pandas & DataFrames • Pandas Basics • Interaction with DataFrames
  • Importing Data • Importing data from a list or dictionary • Importing data from a flat file • Importing data from a database • Importing data from a JSON file
  • Data Exploration • Describe() • Unique counts • Basic Pandas charting to see distribution of data
  • Data Cleaning • Grouping and Replacing values • Data types • String cleaning • Handling null values • Removing duplicates. • Renaming columns • Dropping columns • In-line lambda functions • Lambdas with functions for complicated logic
  • Data Filtering • loc, iloc, and slicing functions • Categorical and distinct filters using boolean indexing • Numeric and range filters • Date filters • Multi-level filters
  • Data Joining • Inner Joins • Left Joins • Difference between join and merge functions • Concatenating (Unions)
  • Aggregating Data • Rolling data up to a higher level (equivalent of SQL group by clause) • Multi-level aggregations • Understanding the reset_index function
  • Outputing Data • Exporting as CSV • Exporting as Excel • Export options • Exporting to Database

Part 2: Python and Data Science Applications

  • Machine learning: classification • Introduction to the package scikit-learn • Classification for data exploration using decision trees • Classification for prediction • Measuring classification performance
  • Machine learning II: Regression • Regression for prediction • Linear • Logistic • Difference between regression and classification • Measuring regression performance
  • Machine learning III: Clustering (if time at the end) • What is clustering and how is it used? • Clustering algorithms: K-NN • Describing clusters and goodness of fit • Export options • Exporting to Database
  • Advanced Data Cleaning (If time allows) • Pivot Data • Adding information from aggregated data back down to row level items • Using where functions • Using map functions • Data formats for different output use cases such as BI Reporting, machine learning, relational database, data warehouse, etc. •Scheduling and automation using command line


Duration 3 to 5

Get a Quote