Jeff M. Phillips

This ``book'' consists if a series of lecture notes I have built for an introductionory class on the mathematical Foundations of Data Analysis at the University of Utah. It is designed for sophomore or junior undergraduate who has had some programming, and some basic exposure to probability and linear algebra.

These notes also overlap some with another ``book" on Data Mining : Algorithms, Geometry, and Probability, that builds on some of these topics for more advanced students.

You can either download the complete book or individual chapters below.

0.

This book is meant for a self-contained course introducing many basic principals and techniques needed for modern data analysis. In particular, this course was designed as preparation for students planning to take rigorous Machine Learning and Data Mining courses. With this goal in mind, it both introduces key conceptual tools which are often helpful to see multiple times, and the most basic techniques to begin a basic familiarity with the backbone of modern data analysis.

Although we touch on Bayesian Inference, we do not cover most of classical statistics; neither frequen- tist hypothesis testing or the similar Bayesian perspectives. Most universities have well-developed classes on these topics which while also very useful, provide a complimentary view of data analysis. Classical statistical modeling approaches are often essential when a practitioner needs to provide some modeling assumptions to harness maximum power from limited data. But in the era of big data this is not always necessary. Rather, the topics in this course provide tools for using some of the data to help choose the model.

We also survey basic techniques in supervised (regression and classification) and unsupervised (principal component analysis and clustering) learning. We make an effort to keep the presentation and concepts on these topics simple. We stick to those which attempt to minimize sum of squared errors, and do not go into regularization. We stick to classic but magical algorithms like Lloydâ€™s algorithm for k-means, the power method for eigenvectors, and perceptron for linear classification. For many students (especially those in a computer science program), these are the first numerical algorithms they will have encountered.

But as important as writing the code itself is a discussion on when and when-not to use techniques from the immense toolbox available. This is one of the main ongoing questions a data scientist must ask. And so, the text attempts to introduce the readers to this ongoing discussion.

1.

Probability is a critical tool for modern data analysis. It arises in dealing with uncertainty, in randomized algorithms, and in Bayesian analysis. To understand any of these concepts correctly, it is then paramount to have a solid and rigorous statistical foundation. Hence we will review some key definitions.

2.

This topic is on Bayes' Rule and Bayesian reasoning. Bayes' Rule is the key component in how to build likelihood functions, which are key to evaluating models based on data. Bayesian reasoning is surprisingly different, it is much more about modeling uncertainty.

3.

This topic will overview a variety of extremely powerful analysis results that span statistics, estimation theorem, and big data. It provides a framework to think about how to aggregate more and more data to get better and better estimates. It will cover the Central Limit Theorem (CLT), Chernoff Hoeffding bounds, and Probably Approximately Correct (PAC) algorithms.

4.

In this chapter we briefly review many key aspects of linear algebra that will be necessary for the remainder of the course.

5.

We introduce the basic model of linear regression. It builds a linear model to predict one variable from one other variable or from a set of other variables. We will demonstrate how this simple technique can extend to building potentially much more complex polynomial models. Then we will introduce the central and extremely powerful idea of

6.

In this chapter we discuss optimizing over general functions f. Typically the function is defined f : R^d -> R; that is its domain is multi-dimensional (in this case d-dimensional) and output is a real scalar (R). This often arises to describe the "cost" of a model which has d parameters which describe the model (e.g., degree (d-1)-polynomial regression) and the goal is to find the parameters with minimum cost. Although there are special cases where we can solve for these optimal parameters exactly, there are many cases where we cannot. What remains in these cases is to analyze the function f, and try to find its minimum point. The most common solution for this is gradient descent where we try to "walk" in a direction so the function decreases until we no-longer can.

7.

This chapter builds a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is to understand the "shape" of the data, for instance a low-dimensional representation that captures most of meaning of the high-dimensional data. This is sometimes referred to as unsupervised learning (as opposed to regression and classification, where the data has labels, known as supervised learning). Like most unsupervised settings, it can be a lot of fun, but its easy to get yourself into trouble if you are not careful.

We will cover many interconnected tools, including the singular value decomposition (SVD), eigenvectors and eigenvalues, the power method, principal component analysis, and multidimensional scaling.

8.

This chapter focuses on automatically grouping data points into subsets of similar points. There are numer- ous ways to define this problem, and most of them are quite messy. And many techniques for clustering actually lack a mathematical formulation. We will focus on what is probably the cleanest and most used formulation: k-means clustering. But, for background, we will begin with a mathematical detour in Voronoi diagrams.

9.

This chapter returns to prediction. Unlike linear regression where we were predicting a numeric value, in this case we are predicting a class: winner or loser, yes or no, rich or poor, positive or negative. Ideas from linear regression can be applied here, but we will instead overview a different, but still beautiful family of techniques based on linear classification.