Data Science Curriculum

Prerequisites: Students should have some previous exposure to basic probability, statistics, linear regression, matrix and vector notation, as well as some familiarity with R and Python programming languages.

Monday, August 6
AM SessionIntroduction to R for Data Analytics with hands on lab
This module will introduce R, a popular open-source statistical computing environment. The module begins with a brief introduction to the basic syntax and usage of R, followed by data import and export, data manipulation and transformation, and data visualization. The module will conclude with an introduction to R Markdown and the use of R for communicating the results of data analyses.Throughout the module, data analysis best practices and use of modern “tidyverse” tools will be emphasized. No prior familiarity with R will be assumed; use of the RStudio IDE will be encouraged.
Instructor: Michael Weylandt, Graduate Student, Statistics

PM Session: Intro to Modern Regression & Cross Validation with hands on lab
This module will cover least squares regression as well as regularized regression methods such as ridge and lasso.  Through these techniques, we will focus on central concepts like training and test errors, the bias-variance trade-off, model selection and cross-validation.

Recommended Reading: James et al. (2013) Introduction to Statistical Learning. Springer Series in Statistics. Available for free download at www.statlearning.com.

Recommended Programming Language: R.  Available for free download from www.r-project.org/ or www.rstudio.com/
Instructor: Genevera Allen, Associate Professor, Statistics

Tuesday, August 7
Introduction to Unsupervised Learning with hands on lab
In this module, we will present a number of unsupervised learning techniques for finding patterns and associations in Big Data. These include dimension reduction and pattern recognition techniques such as principal components analysis and the non-negative matrix factorization, as well as clustering techniques such as k-means clustering, hierarchical clustering and biclustering.  Additionally, we will present methods for determining the fit of models and quality of results via cross-validation and consensus methods.  The main emphasis will be on applying methods to analyze real data sets and interpreting results.  The techniques discussed will be demonstrated in R. This course assumes some previous exposure to basic probability, statistics, linear regression, matrix and vector notation, as well as some familiarity with R or another programming language.

Recommended Reading: James et al. (2013) Introduction to Statistical Learning. Springer Series in Statistics. Available for free download at www.statlearning.com.

Recommended Programming Language: R.  Available for free download from
www.r-project.org/ or www.rstudio.com/
Instructor: Genevera Allen, Associate Professor, Statistics

Wednesday, August 8
AM Session: Intro to Python for Data Analytics with hands on lab
Python is a commonly-used scripting language, that has become increasingly popular as a data analysis tool.  This half-day session will focus on how Python can be efficiently used to develop data analysis applications.  There will be special focus on the NumPy and SciPy packages, which are commonly used for numerical, statistical, and scientific computing.  Students should have some familiarity with the Python language.
Instructor: Chris Jermaine, Professor, Rice University

PM Session: The Cloud, AWS, Hadoop and Spark with hands on lab
We will introduce the idea of cloud computing.  We will focus on tools that can be used for distributed analysis of very large data sets on a set of machines such as those in a compute cloud.  We will cover MapReduce, a popular programming paradigm for processing very large data sets using a “shared nothing” cluster of loosely-coupled nodes, and Spark, a popular open-source Big Data analysis tool that provides the ability to analyze data using MapReduce, as well as using other computational paradigms.  Students will gain experience using Spark to analyze data stored in the Hadoop distributed file system, using Amazon’s EC2 public cloud infrastructure.  Students should have a reasonable familiarity with the Python programming language.
Instructor: Chris Jermaine, Professor, Rice University

Thursday, August 9
AM Session: The Cloud, AWS, Hadoop and Spark with hands on lab
Continued from previous day.
Instructor: Chris Jermaine, Professor, Rice University

PM Session: Introduction to Supervised Learning with hands on lab
This module serves as an introduction to solving big data analytics problems using supervised statistical machine learning methods. We will cover the theory behind simple but effective methods for regression and classification including regularized linear and logistic regression,  generative models (discriminant analysis), large margin classifiers (support vector machines), adaptive basis function models (decision trees and neural networks), and ensemble methods (bagging and boosting). The emphasis will be on formulating real-world modeling and prediction tasks as supervised machine learning problems, evaluating the quality of models using cross-validation, and comparing different algorithms in terms of practical efficacy and scalability. Students will learn to fit and evaluate such models using tools in Python, with examples drawn from image recognition, text classification, and analysis of microarray/proteomic data. This module assumes background in basic probability and statistics, linear algebra, as well as some experience in programming.
Instructor: Devika Subramanian,
Professor, Computer Science

Friday, August 10
AM Session: Introduction to Supervised Learning with hands on lab
Continued from previous day.
Instructor: Devika Subramanian,
Professor, Computer Science

PM Session: Special Topics Overview (1-4pm)
Presenter:  Various Speakers TBA

 

Comments are closed.