Hari Sundar


#Big Data Computer Systems

Fall 2015

Tue,Thu 3:40pm-5:00pm
WEB L102

Catalog number: CS 5965/6965

Office Hours: Tue,Thu 2-3pm MEB 3454

Lectures

Assignment 0
Assignment 1 - Solutions
Assignment 2 - Solutions
Assignment 3

##Overview

The exponential increase in the quantity and quality of measurements and data holds tremendous promise for data-driven scientific discovery. However, much of this data remains unused as algorithms---to infer knowledge from that data---are unable to scale up to the amount of data being generated. This course will discuss and explore scalability issues for the big data era. We shall be using the elephant cluster run by CHPC. Please fill out this form as soon as possible to request accounts.

We shall be using Spark for all assignments in this course. Please install spark on a machine you have access to, e.g. your laptop. The first few assignments will be small enough to test/run on your laptops. This is the first assignment.

We shall be using Mining of Massive Datasets as the textbook for this course. Additional reading materials will be provided for materials not covered by the book.

Prerequisites

There is no formal prerequisite for this course but CS 3505 or equivalent programming experience is desired. Please contact the instructor if you are not sure whether you possess the necessary programming experience. Some projects might require knowledge of Numerical Methods or Linear Algebra (CS 3200). Students are not required to have this background, but do let the instructor know in case you do, so that relevant projects can be assigned.

The course is open to graduate as well as undergraduate students.

Adherence to the CoE and SoC academic guidelines is expected. Please read the following.

Assignments and Grading:

There will be a few programming assignments, a midterm exam and a final project. The assignments will be based on materials discussed in class. The first two assignments (after assignment 0) will be simple spark programming problems meant to familiarize you with spark. These will also get you used to running these jobs on elephant. Assignments 3 and 4 will be significantly harder and you will need to run these on the cluster.

The final project will be chosen by you. The instructor will help you devise a concrete scope for your project. You will be asked to submit a project proposal before the start of the fall break. You should also be prepared to make a presentation of your final project.

A tentative grade division is listed below.

0% - assignment #0

5% - assignment #1

5% - assignment #2

15% - assignment #3

15% - assignment #4

20% - mid-term

40% - final project

Syllabus

##Resources

I shall keep adding resources during the semester. Here are some resources that are useful.