No description has been provided for this image

Lab: Introduction to GPU Accelerated Pandas¶

Table of Contents¶

This Lab notebook explores the fundamentals of data acquisition and manipulation using the DataFrame APIs of the library Pandas, covering essential techniques for handling and processing datasets. This notebook covers the below sections:

Data Background
pandas and cuDF
- pandas
- cuDF
- cuDF.pandas
Data Acquisition
Initial Data Exploration
Indexing and Data Selection with .loc Accessor
Basic Operations
- Exercise #1 - Convert county Column to Title Case
Aggregation
Applying User-Defined Functions (UDFs) with .map() and .apply()-with-.map()-and-.apply())
Filtering with .loc and Boolean Mask
- Exercise #2 - Counties North of Sunderland
Creating New Columns
The Relationship Between pandas and cuDF
- Exercise #3 - Line Profiler
Conclusion

As this lab is an introduction, we will be using only pandas and the GPU accelerated extensions, cuDF.pandas to accomplish this lab. Usage of native cuDF for GPU processing, will be covered in a different lab.

Data Background¶

For this workshop, we will be reading almost 60 million records (corresponding to the entire population of England and Wales) which were synthesized from official UK census data.

pandas and cuDF¶

pandas¶

pandas is a widely-used open-source library for data manipulation and analysis in Python. It provides high-performance, easy-to-use data structures and tools for working with structured data. It popularized the term DataFrame as a data structure for statistical computing. In data science, pandas is used for:

Data loading and writing: reads from and writes to various file formats like CSV, Excel, JSON, and SQL databases
Data cleaning and processing/preprocessing: helps users with handling missing data, merging datasets, and reshaping data
Data analysis: performs grouping, aggregating, and statistical operations

Note: Data preprocessing refers to the process of transforming raw data into a format that is more suitable for analysis and other downstream tasks.

cuDF¶

Similarly, cuDF is a Python GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data. cuDF is designed to accelerate data science workflows by utilizing the parallel processing power of GPUs, potentially offering significant speed improvements over CPU-based alternatives for large datasets. However, it allowed as for 60% of the Pandas API The key features of cuDF include:

GPU Acceleration: leverages NVIDIA GPUs for fast data processing and analysis
pandas-like API: provides users a familiar interface and transition to GPU-based computing
Integration with other RAPIDS libraries: works seamlessly with other GPU-accelerated tools in the RAPIDS ecosystem

cuDF is scalable, able to be used in environments that have

single node, single GPU, like a laptop, standard desktop, or the basic level of many popular GPU cloud instances.
single node, multi-GPU, like a professional workstation or high end gaming PC.
multi-node, multi-GPU, like in data centers or large enterprise clusters, whether on prem or in the cloud.

This allows you to process even petabytes of data quickly and easily.

Note: Both Pandas and cuDF serve similar purposes in data manipulation and analysis, but cuDF is specifically optimized for GPU acceleration, making it particularly useful for working with large datasets where performance is critical.

pandas and cuDF¶

cuDF, starting with version 23.10.01, introduced a pandas GPU accelerator mode (cudf.pandas) that allows support of 100% of the pandas API. This mode allows users to accelerate pandas code on the GPU without requiring any code changes. Not all operations can be performed on the GPU. When using cudf.pandas, operations that can be accelerated will run on the GPU, while unsupported operations will automatically fall back to pandas on the CPU. For example, .read_sql(). this will first read sql with pandas and move the data to cuDF.

There are two ways to activate cuDF pandas:

Jupyter Magic Command

%load_ext cudf.pandas
import pandas
...

Python Import

.;
import cudf.pandas
cudf.pandas.install()

import pandas as pd
...

Note: There are no other changes required - this is useful to quickly accelerate existing workloads with minimum code change. More information about cuDF pandas can be found here.

cuDF pandas is a no code change accelerator for pandas for automatic acceleration of any supported pandas call.

cuDF pandas only supports one environment single node, single GPU, like a laptop, standard desktop, or the basic level of many popular GPU cloud instances.

Below we run some basic DataFrame operations that are accelerated by cuDF.

Data Acquisition¶

In our context, data acquisition refers to the process of collecting and importing data from various sources into a Python environment for analysis, processing, and manipulation. Data can come from a variety of sources:

Local file in various formats
Databases
APIs
Web scraping

It's worth noting that Python's rich ecosystem of libraries makes it versatile for acquiring data from various sources, allowing data scientists to work with diverse datasets efficiently. CPU processing will be responsible for acquiring data from APIs or Web Scraping. In most cases, network bandwidth will likely be the bottleneck. Furthermore, cuDF doesn't have a way to get transactions from SQL data bases directly into GPU memory. The recommended approach for reading data from a database is to first use CPU-based methods (i.e. pandas), then convert to cuDF for GPU-accelerated processing. Previously, you owuld have had to run something like:

import pandas as pd
import cudf
df = pd.read_sql(<data>) # read SQL data using CPU
gdf = cudf.from_pandas(df) # sends data to GPU(s)

But now, with cuDF.pandas we can simply run it as if it was all pandas code and get the benefits of the GPU acceleration for the single GPU later.

Below we use the head linux command to display the beginning of the data file. This allows us to understand how to read the data correctly.

Download Data¶

In [ ]:

# Download and unzip files if they do not exist
!if [ ! -f "./uk_pop.zip" ]; then curl "https://data.rapids.ai/teaching-kit/uk_pop.zip" -o ./uk_pop.zip; else echo "Population dataset already downloaded"; fi
!if [ ! -f "./uk_pop.csv" ]; then unzip -d ./ ./uk_pop.zip ; else echo "Population dataset found and ready"; fi

In [ ]:

# DO NOT CHANGE THIS CELL
!head -n 5 ./uk_pop.csv

One row will represent one person. We have information about their age, sex, county, location, and name. Using cuDF, the RAPIDS API providing a GPU-accelerated DataFrame, we can read data from a variety of formats, including csv, json, parquet, feather, orc, and pandas DataFrames, among others.

In [ ]:

# DO NOT CHANGE THIS CELL
%load_ext cudf.pandas
import pandas as pd

In [ ]:

# DO NOT CHANGE THIS CELL
import cupy as cp
import numpy as np

from datetime import datetime
import random
import time

Below we read the data from a local csv file directly into GPU memory with the read_csv() function.

In [ ]:

# DO NOT CHANGE THIS CELL
start=time.time()
df=pd.read_csv('./uk_pop.csv')
print(f'Duration: {round(time.time()-start, 2)} seconds')

Note: Because of the sophisticated GPU memory management behind the scenes in cuDF, the first data load into a fresh RAPIDS memory environment is sometimes substantially slower than subsequent loads. The RAPIDS Memory Manager is preparing additional memory to accommodate the array of data science operations that we may be interested in using on the data, rather than allocating and deallocating the memory repeatedly throughout the workflow.

Below we get the general information about the DataFrame with the DataFrame.info() method.

In [ ]:

# DO NOT CHANGE THIS CELL
df.info(memory_usage='deep')

The DataFrame is a two-dimensional labeled data structure. It's organized in rows and columns, similar to a spreadsheet or SQL table. Both rows and columns have labels. Rows are typically labeled with an index, while columns have column names. Data is aligned based on row and column labels when performing operations. This is useful for enabling highly efficient vectorized operations across columns or rows. A Series refers to a one-dimensional array and is typically associated with a single column of data with an index.

There are ~60MM records across 6 columns. cuDF is able to read data from local files directly into the GPU very efficiently. By default, cuDF samples the dataset to infer the most appropriate data types for each columns.

Note: The DataFrame has .dtypes and .columns attributes that can be used to get similar information.

Initial Data Exploration¶

Now that we have some data loaded, let's do some initial exploration.

Below we preview the DataFrame with the DataFrame.head() method.

In [ ]:

# DO NOT CHANGE THIS CELL
df.head()

Indexing and Data Selection with `.loc` Accessor¶

The .loc accessor in cuDF DataFrames is used for label-based indexing and selection of data. It allows us to access and manipulate data in a DataFrame based on row and column labels. We can use DataFrame.loc[row_label(s), column_label(s)] to access a group of rows and columns. When selecting multiple labels, a list ([]) is used. Furthermore, we can use the slicing operator (:, i.e. start:end) to specify a range of elements.

In [ ]:

# DO NOT CHANGE THIS CELL
# get first cell
display(df.loc[0, 'age'])
print('-'*40)

# get multiple rows and columns
display(df.loc[[0, 1, 2], ['age', 'sex', 'county']])
print('-'*40)

# slice a range of rows and columns
display(df.loc[0:5, 'age':'county'])
print('-'*40)

# slice a range of rows and columns
display(df.loc[:10, :'name'])

Note: df[column_label(s)] is another way to access specific columns, similar to df.loc[:, column_labels].

Basic Operations¶

cuDF support a wide range of operations for numerical data. Although strings are not a data type traditionally associated with GPUs, cuDF supports powerful accelerated string operations.

Numerical operations:
- Arithmetic operations: addition, subtraction, multiplication, division
String operations:
- Case conversion: .upper(), .lower(), .title()
- String manipulation: concatenation, substring, extraction, padding
- Pattern matching: contains()
- Splitting: .split()
Comparison operations: greater than, less than, equal to, etc.

These operations will be performed element-wise for each row. This allows for efficient vectorized operations across entire columns. These operations are implemented as vector operations instead of iteration because vector operations can be applied to entire arrays of data, instead of iterating through each element individually. Vectorization is significantly faster than iterating over elements, especially for large datasets. When operating on multiple columns, operations are aligned by index, ensuring that calculations are performed on the correct corresponding elements across columns. These element-wise operations are typically highly optimized and can be much faster than explicit loops, especially for large datasets. We can get the underlying array of data with the .values attribute. This is useful when we want to perform operations on the underlying data.

Note: Iterating over a cuDF Series, DataFrame or Index is not supported. This is because iterating over data that resides on the GPU will yield extremely poor performance, as GPUs are optimized for highly parallel operations rather than sequential operations.

Below we calculate the birth year for each person.

In [ ]:

# DO NOT CHANGE THIS CELL
# get current year
current_year=datetime.now().year

# derive the birth year
display(current_year-df.loc[:, 'age'].head())

# get the age array (CuPy for cuDF)
age_ary=df.loc[:, 'age'].values

# derive the birth year
current_year-age_ary

When performing operations between a DataFrame and a scalar value, the scalar is "broadcast" to match the shape of the DataFrame, effectively applying it to each element.

current_year - df.loc[:, 'age']
-------------------------------
  (scalar)          (array)    
    2024,    -         0
    2024,    -         0
    2024,    -         0
    2024,    -         0
    2024,    -         0
    ...      -         ...

This partially explains why cuDF provides significant performance improvements over pandas, especially for large datasets. The parallel processing architecture of GPUs are designed with thousands of small, specialized cores that can execute many operations simultaneously. This architecture is ideal for vectorized operations, which perform the same instruction on multiple data elements in parallel.

Exercise #1 - Convert `county` Column to Title Case¶

As it stands, all of the counties are UPPERCASE. We want to convert the county column to title case.

Instructions:

Modify the <FIXME> only and execute the below cell to convert the county column to title case.

In [ ]:

df['county'].str.<<<<FIXME>>>>

Performing comparison operations or applying conditions create boolean values (True/False) that corresponds element-wise.

Below we check if each person is an adult.

In [ ]:

# DO NOT CHANGE THIS CELL
(df['age']>=18).head()

Aggregation¶

Aggregation is an important operation for data science tasks, allowing us to summarize and analyze grouped data. It's commonly used for tasks like calculating totals, averages, counts, etc. cuDF supports common aggregations like .sum(), .mean(), .min(), .max(), .count(), .std()(standard deviation), etc. It also supports more advanced aggregations like .quantile() and .corr() (correlation). With the axis parameter, aggregation operations can be applied column-wise (0) or row-wise (1).

When the aggregation is implemented as a vector operation, specifically a reduction operation, it is very efficient on the GPU becasue a large number of data elements can be processed simultaneously and in parallel. Column-wise operations also benefit from the Apache Arrow columnar memory format.

No description has been provided for this image

Below we calculate the arithmetic mean of lat and long to get an approximate center.

In [ ]:

# DO NOT CHANGE THIS CELL
df[['lat', 'long']].mean()

Applying User-Defined Functions (UDFs) with `.map()` and `.apply()`¶

The .map() and .apply() methods are the primary ways of applying user-defined functions element-wise, and row or column-wise, respectively. We can pass a callable function (built-in or user-defined) as the argument, which is then applied to the entire data structure. Not all operations can be vectorized, especially complex custom logic. In such cases, methods like .apply() or custom UDFs might be necessary.

Below we use .apply() to check if each person is an adult.

In [ ]:

# DO NOT CHNAGE THIS CELL
# define a function to check if age is greater than or equal to 18
start=time.time()
def is_adult(row):
    if row['age']>=18:
        return 1
    else:
        return 0

# derive the birth year
display(df.apply(is_adult, axis=1).head())
print(f'Duration: {round(time.time()-start, 2)} seconds')

We can also use a lambda function when the function is simple. Lambda functions are limited to a single expression but can include a conditional statement and mulitple arguments.

In [ ]:

# DO NOT CHANGE THIS CELL
# derive the birth year
start=time.time()
display(df.apply(lambda x: 1 if x['age']>=18 else 0, axis=1).head())
print(f'Duration: {round(time.time()-start, 2)} seconds')

Note: The .apply() function in pandas accepts any user-defined function that can include arbitrary operations that are applied to each value of a Series and DataFrame. cuDF also supports .apply(), but it relies on Numba to JIT compile the UDF (not in scope) and execute it on the GPU. This can be extremely fast, but imposes a few limitations on what operations are allowed in the UDF. See the docs on UDFs for details.

In [ ]:

# DO NOT CHANGE THIS CELL
# derive the birth year
start=time.time()
display((df['age']>=18).astype('int').head())
print(f'Duration: {round(time.time()-start, 2)} seconds')

Below we use Series.map() to determine the number of characters in each person's name.

In [ ]:

# DO NOT CHANGE THIS CELL
df['name'].map(lambda x: len(x)).head()

Filtering with `.loc` and Boolean Mask¶

A boolean mask is an array of True/False values that corresponds element-wise to another array or data structure. It's used for filtering and selecting data based on certain conditions. In this context, the mask can be used to index or filter a DataFrame with .loc, selecting only the elements where the mask is True.

Note: Boolean masking is often more efficient than iterative approaches, especially for large datasets, as it leverages vectorized operations.

Below we use the .loc accessor and a boolean mask to filter people whose names start with an E.

In [ ]:

# DO NOT CHANGE THIS CELL
boolean_mask=df['name'].str.startswith('E')
df.loc[boolean_mask]

Multiple conditions can be combined using logical operators (& and |).

Note: When using multiple conditions, it's important to wrap each condition in parentheses (( and )) to ensure correct order of operations.

Below we use the .loc accessor and multiple conditions to filter adults whose names start with an E.

In [ ]:

# DO NOT CHANGE THIS CELL
df[(df['age']>=18) | (df['name'].str.startswith('E'))]

Exercise #2 - Counties North of Sunderland¶

This exercise will require to use the .loc accessor, and several of the techniques described above. We want to identify the latitude of the northernmost resident of Sunderland county (the person with the maximum lat value), and then determine which counties have any residents north of this resident. Use the Series.unique() method of to de-duplicate the result.

Instructions:

Modify the <FIXME> only and execute the below cell to identify counties north of Sunderland.

In [ ]:

sunderland_residents=df.loc[<<<<FIXME>>>>]
northmost_sunderland_lat=sunderland_residents['lat'].max()
df.loc[df['lat'] > northmost_sunderland_lat]['county'].unique()

Creating New Columns¶

We can create new columns by assigning values to the column label. The new column should have the same number of rows as the existing DataFrame. Typically, we create new columns by performing operations on existing columns.

Below we create a few additional columns.

In [ ]:

# DO NOT CHANGE THIS CELL
# get current year
current_year=datetime.now().year

# numerical operations
df['birth_year']=current_year-df['age']

# string operations
df['sex_normalize']=df['sex'].str.upper()
df['county_normalize']=df['county'].str.title().str.replace(' ', '_')
df['name']=df['name'].str.title()

# preview
df.head()

In [ ]:

# DO NOT CHANGE THIS CELL
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

The Relationship Between pandas and cuDF¶

It's important to note that the performance benefits of cuDF can vary depending on the specific operation, data size, and hardware configuration. For smaller datasets or simpler operations, the overhead of GPU initialization might make pandas on CPU faster.

Comparing to pandas, cuDF tends to perform better for large datasets because of the following features:

GPUs excel at parallel computation, which is advantageous for many data science and machine learning tasks.
GPUs typically have much higher memory bandwidth than CPUs, allowing for faster data access in memory-bound operations.
cuDF leverages GPU's ability to perform vectorized operations efficiently, which is particularly beneficial for large datasets.
cuDF uses a columnar data format, which can lead to more efficient memory access patterns on GPUs. When performing data operations on cuDF Dataframes, column operations are typically much more performant than row-wise operations.

Pandas tends to perform better when your dataset has the following features:

dataset is small
The dataset is extremely wide (has lots of columns). Think thousands
the large string texts per row element
deep nested rows
The resultant dataframe won't fit in your GPU memory, but you have enough host system memory. Example: 8GB GPU but 32GB host memory. This would cause an Out Of Memory (OOM) error on your GPU. You can also try using chunking file formats ro strategies, like parquet files instead of csvs.

You will want to approproiately spec your system out to your task.

Below are the results for a simple data processing pipeline with and without cuDF acceleration.

No description has been provided for this image

Exercise #3 - Line Profiler¶

Instructions:

Execute the below cell to import the dependencies.
Uncomment the %%cudf.pandas.line_profile magic command to use the line profiler and execute the cell below.
Optionally, comment out the %load_ext and %%cudf.pandas.line_profile magic commands and execute the pipeline without cuDF acceleration. Be sure to restart the kernel.

In [ ]:

# DO NOT CHANGE THIS CELL
%load_ext cudf.pandas

import pandas as pd
import time
from datetime import datetime

In [ ]:

# %%cudf.pandas.line_profile
# DO NOT CHANGE THIS CELL
start=time.time()

df=pd.read_csv('./uk_pop.csv')
current_year=datetime.now().year

df['birth_year']=current_year-df['age']

df['sex_normalize']=df['sex'].str.upper()
df['county_normalize']=df['county'].str.title().str.replace(' ', '_')
df['name']=df['name'].str.title()

print(f'Duration: {round(time.time()-start, 2)} seconds')

display(df.head())

In [ ]:

# DO NOT CHANGE THIS CELL
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Conclusion¶

Well Done! You now know the fundamentals of data manipulation using pandas data frames and how to GPU accelerate it's processing. In this lab, we covered

pandas and cuDF the differences between pandas, cuDF, and cuDF.pandas
Data Acquisition and Data Exploration
Basic operations including:
- Indexing and Data Selection with .loc Accessor
- Popular Operations
- Aggregation
- Applying User-Defined Functions (UDFs)-with-.map()-and-.apply())
- Filtering
- Creating New Columns

Continue your GPU accelerated data science journey by going to https://github.com/rapidsai-community/showcase/tree/main/accelerated_data_processing_examples

Lab: Introduction to GPU Accelerated Pandas¶

Table of Contents¶

Data Background¶

pandas and cuDF¶

pandas¶

cuDF¶

pandas and cuDF¶

Data Acquisition¶

Download Data¶

Initial Data Exploration¶

Indexing and Data Selection with .loc Accessor¶

Basic Operations¶

Exercise #1 - Convert county Column to Title Case¶

Aggregation¶

Applying User-Defined Functions (UDFs) with .map() and .apply()¶

Filtering with .loc and Boolean Mask¶

Exercise #2 - Counties North of Sunderland¶

Creating New Columns¶

The Relationship Between pandas and cuDF¶

Exercise #3 - Line Profiler¶

Conclusion¶

Indexing and Data Selection with `.loc` Accessor¶

Exercise #1 - Convert `county` Column to Title Case¶

Applying User-Defined Functions (UDFs) with `.map()` and `.apply()`¶

Filtering with `.loc` and Boolean Mask¶