Python for Data Science
Course Objective
This course is to teach the analytical mindset & programming skills relevant to data science. Students will continue to polish the basics of the Python programming language, along with a set of tools for data science in Python, including the Jupyter (IPython) Notebook, NumPy, Pandas, Matplotlib and Scikit-learn. Students will learn skills that cover the various phases of exploratory data analysis:
- Importing data
- Cleaning and transforming data
- Algorithmic thinking
- Grouping and aggregation
- Visualization
- Statistical modeling/prediction
- Communication of results
The course will utilize data from a wide range of sources and will culminate with a final project and presentation.
Course Outline
Part 1: The Pandas DataFrame Library
- Pandas & DataFrames • Pandas Basics • Interaction with DataFrames
- Importing Data • Importing data from a list or dictionary • Importing data from a flat file • Importing data from a database • Importing data from a JSON file
- Data Exploration • Describe() • Unique counts • Basic Pandas charting to see distribution of data
- Data Cleaning • Grouping and Replacing values • Data types • String cleaning • Handling null values • Removing duplicates. • Renaming columns • Dropping columns • In-line lambda functions • Lambdas with functions for complicated logic
- Data Filtering • loc, iloc, and slicing functions • Categorical and distinct filters using boolean indexing • Numeric and range filters • Date filters • Multi-level filters
- Data Joining • Inner Joins • Left Joins • Difference between join and merge functions • Concatenating (Unions)
- Aggregating Data • Rolling data up to a higher level (equivalent of SQL group by clause) • Multi-level aggregations • Understanding the reset_index function
- Outputing Data • Exporting as CSV • Exporting as Excel • Export options • Exporting to Database
Part 2: Python and Data Science Applications
- Machine learning: classification • Introduction to the package scikit-learn • Classification for data exploration using decision trees • Classification for prediction • Measuring classification performance
- Machine learning II: Regression • Regression for prediction • Linear • Logistic • Difference between regression and classification • Measuring regression performance
- Machine learning III: Clustering (if time at the end) • What is clustering and how is it used? • Clustering algorithms: K-NN • Describing clusters and goodness of fit • Export options • Exporting to Database
- Advanced Data Cleaning (If time allows) • Pivot Data • Adding information from aggregated data back down to row level items • Using where functions • Using map functions • Data formats for different output use cases such as BI Reporting, machine learning, relational database, data warehouse, etc. •Scheduling and automation using command line
Duration
3 to
5