Data Science Curriculum

Data Science Boot Camp

Prerequisites: Students should have some previous exposure to basic probability, statistics, linear regression, matrix and vector notation, as well as some familiarity with the Python programming languages.

Monday, August 12, 2019

AM Session: Intro to Python for Data Analytics with hands on lab

Python is a commonly-used scripting language, that has become increasingly popular as a data analysis tool.  This half-day session will focus on how Python can be efficiently used to develop data analysis applications.  There will be special focus on the NumPy and SciPy packages, which are commonly used for numerical, statistical, and scientific computing.  Students should have some familiarity with the Python language.

Instructor: Chris Jermaine, Ph.D., Professor, Computer Science, Rice University

______________________________________________________________

PM Session: The Cloud, AWS, Hadoop and Spark with hands on lab

We will introduce the idea of cloud computing.  We will focus on tools that can be used for distributed analysis of very large data sets on a set of machines such as those in a compute cloud.  We will cover MapReduce, a popular programming paradigm for processing very large data sets using a “shared nothing” cluster of loosely-coupled nodes, and Spark, a popular open-source Big Data analysis tool that provides the ability to analyze data using MapReduce, as well as using other computational paradigms.  

Students will gain experience using Spark to analyze data stored in the Hadoop distributed file system, using Amazon’s EC2 public cloud infrastructure. Students should have a reasonable familiarity with the Python programming language.

Instructor: Chris Jermaine, Ph.D.,  Professor, Computer Science, Rice University

______________________________________________________________

Tuesday, August 13, 2019

AM Session: The Cloud, AWS, Hadoop and Spark with hands on lab

Continued from previous day.

Instructor: Chris Jermaine, Ph.D., Professor, Rice University

______________________________________________________________

Tuesday, August 13, 2019

PM SessionIntro to Modern Regression & Cross Validation with hands on lab

This module will cover various regression techniques through the use of scikit-learn and pandas Python libraries. We will focus on central concepts like training and test errors, the bias-variance trade-off, model selection and cross-validation.

Instructor: Natalie Berestovsky, Ph.D., Anadarko Petroleum/ Rice Data Science Conference Program Committee

______________________________________________________________

Wednesday, August 14, 2019

AM Session: Continued from previous day.

PM Session: Introduction to Unsupervised Learning with hands on lab

In this module, we will present a number of unsupervised learning techniques for finding patterns and associations in Big Data. These include dimension reduction and pattern recognition techniques such as principal components analysis, as well as clustering techniques such as k-means clustering, hierarchical clustering.

Additionally, we will present methods for determining the fit of models and quality of results via cross-validation and consensus methods. The main emphasis will be on applying methods to analyze real data sets and interpreting results. The techniques discussed will be demonstrated in Python.

This course assumes some previous exposure to basic probability, statistics, linear regression, matrix and vector notation, as well as some familiarity with Python or another programming language.

Instructor: Natalie BerestovskyPh.D., Anadarko Petroleum/ Rice Data Science Conference Program Committee

______________________________________________________________

Thursday, August 15, 2019

PM Session: Introduction to Supervised Learning with hands on lab

This module serves as an introduction to solving big data analytics problems using supervised statistical machine learning methods. We will cover the theory behind effective methods for regression and classification including regularized linear and logistic regression, generative models (discriminant analysis, Naive Bayes), large margin classifiers (support vector machines), adaptive basis function models (decision trees and neural networks), and ensemble methods (bagging and boosting).

We will also cover the basics of deep learning including generative networks. The emphasis will be on formulating real-world modeling and prediction tasks as supervised machine learning problems, evaluating the quality of models using cross-validation, and comparing different algorithms in terms of practical efficacy and scalability.

Students will learn to fit and evaluate such models using tools in Python, with examples drawn from image recognition, text classification, and making recommendations. This module assumes background in basic probability and statistics, calculus, linear algebra, as well as some experience in Python programming.

Instructor: Devika Subramanian, Ph.D., Professor, Computer Science, Rice University

______________________________________________________________

Friday, August 16, 2019

AM+PM Session: Introduction to Supervised Learning with hands on lab

Continued from previous day.

Instructor: Devika Subramanian, Ph.D., Professor, Computer Science,  Rice University

 

Comments are closed.