Data Mining: Algorithms, Geometry, and Probability
Jeff M. Phillips
This ``book'' consists if a series of lecture notes I have built over many years teaching a
Data Mining Course
at the University of Utah. It is designed for senior undergraduates, or first year graduate students in a computing program. It assumes basic programming, and basic knowledge about probability, linear algebra, and algorithms.
When writing these notes, I was heavily influenced by the following two books, which were developed partially in parallel and cover similar material, but from a different perspective, IMO.
MMDS: Mining Massive Data Sets by Anand Rajaraman, Jure Leskovec, and Jeff Ullman.
FoDS: Foundations of Data Science by Avrim Blum, John Hopcroft and Ravindran Kannan.
These notes also overlap some with another ``book" on The Foundations of Data Analysis, I have created on similar topics aimed at less advanced students.
Many of my lectures on this material appear on Utah's School of Computing's YouTube Channel.
1. Introduction
2. Statistical Principles, Hashing, and Concentration of Measure
more on Chernoff-Hoeffding Bounds
Similarities and Distances
3. Jaccard Distance and nGrams
4. MinHashing
5. Locality Sensitive Hashing (LSH)
6. Distances
7. Approximate Nearest Neighbors
Clustering
8. Hierarchical Aggolerative Clustering
9. Assignment-based Clustering (k-means etc)
10. Spectral Clustering
Streaming and High Frequency Items
11. Deterministic Heavy-Hitters and Quantiles
12. Count-Min Sketch and Frequent Itemsets
Regression and Dimensionality Reduction
13. Types of Regression in 2 Dimensions
14. Singular Value Decomposition (SVD)
15. Metric Learning
16. Matrix Sketching
17. Random Projections
18. Orthogonal Matching Pursuit and Compressed Sensing
19. Ridge Regression and Lasso
Managing and Using Noise
20. Outliers and Cross-Validation
21. Privacy
Graph Analysis
22. Markov Chains
23. PageRank and Search Engines
24. MapReduce and the Big Data Revolution
25. Detecting Communities
26. Graph Sparsification