TAs: Benwei Shi (email) | Office hours: Tuesdays and Thursdays 2-3pm in MEB 3423.

Alok Jadhav (email) | Office hours: Mondays 1-3pm in MEB 3423.

Fall 2019 | Mondays, Wednesdays 3:00 pm - 4:20 pm

WEB 2230

Catalog number: CS 3190 01

This class will be an introduction to computational data analysis, focusing on the mathematical foundations. The goal will be to carefully develop and explore several core topics that form the backbone of modern data analysis topics, including Machine Learning, Data Mining, Artificial Intelligence, and Visualization. This will include some background in probability and linear algebra, and then various topics including Bayes Rule and its connection to inference, linear regression and its polynomial and high dimensional extensions, principal component analysis and dimensionality reduction, as well as classification and clustering. We will also focus on modern PAC (probably approximately correct) and cross-validation models for algorithm evaluation.

These topics are often very breifly covered at the end of a probability or linear algebra class, and then are often assumed knowledge in advanced data mining or machine learning classes. This class will fill that gap. The planned pace will be closer to CS3130 or Math2270 than the 5000-level advanced data analysis courses.

We will use Python in the class to demonstrate and explore basic concepts. But programming will not be the main focus.

This is a draft of a book I started writing in Fall 2016 for this course.

More outside

The official pre-requisites are CS 2100, CS 2420, and Math 2270. These are to ensure a certain very basic mathematical maturity (CS 2100) a basic understanding of how to store and manipulate data with some efficiency (CS2420), and basics of linear algebra and high dimensions (MATH 2270).

We have as a co-requisite CS 3130 (or Math 3070) to ensure some familiarity with probability.

A few lectures will be devoted to review linear algebra and probability, but at a fast pace and a focus on the data interpretation of these domains.

This class will soon become a pre-requisite for CS 5350 (Machine Learning) and CS 5140 (Data Mining), as part of a new Data Science pipeline.

Date | Chapter | Topic | Assignment |
---|---|---|---|

Mon 8.19 | Class Overview | ||

Wed 8.21 | Ch 1 - 1.2 | Probability Review : Sample Space, Random Variables, Independence | |

Mon 8.26 | Ch 1.3 - 1.6 | Probability Review : PDFs, CDFs, Expectation, Variance, Joint and Marginal Distributions | HW1 out |

Wed 8.28 | Ch 1.7 | Bayes Rule | |

Mon 9.02 | |||

Wed 9.04 | Ch 1.8 | Bayes Rule : Bayesian Reasoning | |

Mon 9.09 | Ch 2.1 | Convergence : Central Limit Theorem and Estimation | |

Wed 9.11 | Ch 2.2 - 2.3 | Convergence : PAC Algorithms and Concentration of Measure | HW 1 due |

Mon 9.16 | Ch 3.1 - 3.2 | Linear Algebra Review : Vectors, Matrices, Multiplication and Scaling | Quiz 1 |

Wed 9.18 | Ch 3.3 - 3.5 | Linear Algebra Review : Norms, Linear Independence, Rank | HW 2 out |

Mon 9.23 | Ch 3.6 - 3.8 | Linear Algebra Review : Inverse, Orthogonality, numpy | |

Wed 9.25 | Ch 5.1 | Linear Regression : dependent, independent variables | |

Mon 9.30 | Ch 5.2-5.3 | Linear Regression : multiple regreesion, polynomial regression | HW 2 due |

Wed 10.02 | Ch 5 | Linear Regression : mini review + slack | Quiz 2 |

Mon 10.09 | |||

Wed 10.11 | |||

Mon 10.14 | Ch 5.4 | Linear Regression : overfitting and cross-validation | HW 3 out |

Wed 10.16 | Ch 6.1 | Gradient Descent : functions, minimum, maximum, convexity | |

Mon 10.21 | Ch 6.2 - 6.3 | Gradient Descent : gradients and algorithmic variants | |

Wed 10.23 | Ch 6.4 | Gradient Descent : fitting models to data and stochastic gradient descent | |

Mon 10.28 | Ch 7.1 - 7.2 | PCA : SVD | |

Wed 10.30 | Ch 7.2 - 7.3 | PCA : rank-k approximation and eigenvalues | HW 3 due |

Mon 11.04 | Ch 7.4 | PCA : power method | HW 4 out |

Wed 11.06 | Ch 7.5 - 7.6 | PCA : centering, MDS, and dimensionalty reduction | |

Mon 11.11 | Ch 8.1 | Clustering : Voronoi Diagrams | Quiz 3 |

Wed 11.13 | Ch 8.3 | Clustering : k-means | |

Mon 11.18 | Ch 8.4, 8.7 | Clustering : EM, Mixture of Gaussians, Mean-Shift | |

Wed 11.20 | Ch 9.1 | Classification : Linear prediction | HW 4 due |

Mon 11.25 | Ch 9.2 | Classification : Perceptron Algorithm | HW 5 out |

Wed 11.27 | Ch 9.3 | Classification : Kernels and SVMs | |

Mon 12.02 | Ch 9.4 - 9.5 | Classification : Neural Nets | Quiz 4 |

Wed 12.04 | In-class review |
||

Fri 12.06 | HW 5 due | ||

Thu 12.12 | FINAL EXAM (3:30pm - 5:30pm) |
(practice) |

The homeworks will usually consist of an analytical problems set, and sometimes light programming exercizes in python. When python will be used, we typically will work through examples in class first.

This class has the following collaboration policy:

For assignments, students may discuss answers with anyone, including problem approach, proofs, and code. But all students must write their own code, proofs, and write-ups. If you collaborated with another student on homeworks to the extent that you expect your answers may start to look similar, you must explain the extent to which you collabodated explicitly on the homework. Students whose homeworks appear too similar, and did not explain the collaboration will get a 0 on that assignment.

For quizzes and the final exam, talking to anyone (other than instructors/TAs) during the examination period is not allowed and will result in a 0 on that test or quiz.

Here are a few books that cover some of the material, but at a more advanced level:

Understanding ML | Foundations of Data Science | Introduction to Statistical Learning

Here is a list nice resources I believe may be useful with relevant parts at roughly the right level for this course: