Download Data

In order to complete this lab, please download data from here. Create a folder sample_data inside your current directory and upload all the data files (data0.csv, etc.) to this folder. Now you are all set up for completing this lab!

Part 1. KMeans with CPU

Q1. Read the five provided datasets

Read the five provided datasets (data{0, 1, 2, 3, 4}.csv) using pandas.read_csv() and save the data to pandas.DataFrame variables. Additionally, measure the time taken to read each dataset for later use. You can use time() function to measure running time.

Q2. Perform KMeans implemented in sklearn

Run KMeans clustering with sklearn.cluster.KMeans. Measure the running time foor KMeans on each dataset.

Part 2. KMeans with GPU

Q3. Read datasets with cuDF

Read the five datasets with cudf.read_csv() function in cuDF library and save the data into cudf.DataFrame variables. Additionally, measure the time taken to read each dataset for later use.

Q4. Perform KMeans implemented in cuML

Perform KMeans clustering implemented in cuML. Measure the running time for KMeans on each dataset.

Part 3. Plot the values

Now that we have performed K-means with both sklearn and cuML, we can clearly see that there is a huge difference in the time taken in both cases. Let us plot these values to see if we can find a trend!