Lab: Introduction to NetworkX accelerated by NVIDIA cuGraph¶

Table of Contents¶

This Lab notebook explores the fundamentals of data acquisition and manipulation using the Graph analytics APIs of the library NetworkX, covering essential techniques for creating, manipulating, and studying graph based relationships. This notebook covers the below sections:

  1. Introduction to NetworkX and NVIDIA cuGraph
  2. Data Background
  3. 1. Environment Setup
  4. 2. Form the Data into a NetworkX Graph
  5. 3. Identify Which Cluster Contains the Most Important Patent
  6. 4. Load the Enrichment Data Containing the Patent Titles
  7. Conclusion

Introduction to NetworkX and NVIDIA cuGraph¶

NetworkX is the most popular Python graph analytics library. Among other capabilities, it allows for the:

  • Creation of directed, undirected, weighted, bi, and multi graphs
  • Manipulation of graphs by adding or removing edges, nodes, metadata, and other modifications to the graph structure
  • Study the relationships between edges and nodes using many graph-based algorithms

The ability to create, manipluate, and study these graphs allows researchers and professionals to model many types of relationships and processes in physical, biological, social and information systems. A graph consists of nodes or vertices (representing the entities in the system) that are connected by edges (representing relationships between those entities). By navigating the edges and nodes to discover and understand complex relationships and/or optimize paths between linked data in a network.

The NetworkX open source project is led by community maintainers, and install instructions are here. It is readily accessible and free to download.

NVIDIA cuGraph provides GPU acceleration for popular graph algorithms such as PageRank, Louvain, and betweenness centrality. Depending on the algorithm and graph size, it can significantly accelerate NetworkX workflows, up to 50x, even 500x over NetworkX on CPU.

NetworkX now includes a GPU backend powered by NVIDIA cuGraph that allows you to seamlessly handle large graphs - exceeding 100,000 nodes and 1 million edges - on a single GPU. This allows you to maintain the flexibility of NetworkX while dramatically improving performance.

Data Background¶

For this lab, we'll be working with a couple of datasets containg patents and citations from PatentsView. Both files are used under the Creative Commons license https://creativecommons.org/licenses/by/4.0/

The first file, g_patent.tsv.zip, contains summary data for each patent such as id, title and the location of the original patent document. The table description is available on the PatentsView site.

The second file, g_us_patent_citation.tsv.zip, contains a record for every citation between USPatents. The description of this table is also available on the PatentsView site.

Citation: U.S. Patent and Trademark Office. “Data Download Tables.” PatentsView. Accessed [10/06/2024]. https://patentsview.org/download/data-download-tables.

1. Environment Setup¶

This notebook will demonstrate NetworkX both with and without acceleration by NVIDIA cuGraph

The NetworkX open source project is led by community maintainers, and install instructions are here. It is readily accessible and free to download.

Importing and installing NetworkX and it's GPU accelerator backend¶

To install both NetworkX and its accelerator, you can just run this command:

> pip install nx-cugraph-cu12 --extra-index-url https://pypi.nvidia.com

Users can access the GPU backend using an environment variable. For more details, visit the NetworkX and cuGraph documentation.

To begin, let's install NetworkX, nx-cugraph, and cuDF for the pandas GPU accelerator.

In [ ]:
!pip install nx-cugraph-cu12 cudf-cu12 --extra-index-url=https://pypi.nvidia.com

This notebook will be using features added in NetworkX version 3.3+, so we'll import it here to verify we have a compatible version.

In [ ]:
import networkx as nx
nx.__version__

WARNING: If your NetworkX is below version 3.3, please uncomment the cell below to upgrade NetworkX, as well as restart the jupyter notebook kernel. Then, please check the NetworkX version again by rerunning the cell above before proceeding on with the rest of the lab.

In [ ]:
# !pip install networkx --upgrade
# get_ipython().kernel.do_shutdown(restart=True)

Now, let's configure the NetworkX backend to use cuGraph for it's GPU acceleration

In [ ]:
nx.config.backend_priority=["cugraph"]  # NETWORKX_BACKEND_PRIORITY=cugraph
nx.config.cache_converted_graphs=True   # NETWORKX_CACHE_CONVERTED_GRAPHS=True

Just to make the output cleaner, we'll ignore warnings about using a cahced graph

In [ ]:
import warnings
warnings.filterwarnings("ignore", message="Using cached graph for 'cugraph' backend")

Download the Data¶

In order to complete this lab, please download from here. Now you are all set up for completing this lab!

In [ ]:
# Download and unzip files if they do not exist
!if [ ! -f "./g_us_patent_citation.tsv.zip" ]; then curl "https://s3.amazonaws.com/data.patentsview.org/download/g_us_patent_citation.tsv.zip" -o ./g_us_patent_citation.tsv.zip; else echo "Population dataset found"; fi
!if [ ! -f "./g_us_patent_citation.tsv" ]; then unzip -d ./ ./g_us_patent_citation.tsv.zip

!if [ ! -f "./g_patent.tsv.zip" ]; then curl "https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip" -o ./g_patent.tsv.zip; else echo "Population dataset found"; fi
!if [ ! -f "./g_patent.tsv" ]; then unzip -d ./ ./g_patent.tsv.zip

2. Form the Data into a NetworkX Graph¶

Q1. Load the citation Data into a pandas Dataframe¶

Read the patent citation dataset (g_us_patent_citation.tsv) using pandas.read_csv() and save the data to pandas.DataFrame variables. Additionally, measure the time taken to read each dataset for later use. Use cudf pandas to accelerate dataloading with the GPU.

In [ ]:
%load_ext cudf.pandas
import pandas as pd

# TODO: Read the citation dataset using pd.read_csv() and find the length of the resulting dataframe

citation_df = []

Since the dataframe is using cuDF pandas Accelerator Mode, accessing it is fast !!

In [ ]:
len(citation_df)

Q2. Create a NetworkX Graph from the dataset and count the edges in the graph¶

This will take a few minutes. It is using NetworkX to create a 142 million edge graph on the cpu. This is a necessary overhead for loading the graph that will be later transformed into the cuGraph GPU-resident graph that will be reused in each algorithm we call, accelerating those algorithms dramatically.

In [ ]:
# TODO: Create the graph using NetworkX from_pandas_edgelist function

# measure the running time on each dataset and append to sklearn_running_times
G = []

3. Identify Which Cluster Contains the Most Important Patent¶

Run PageRank to find the most important patent in the citation graph and then cluster the graph using Louvain.

Q3. What is the most important patent¶

Run pagerank on the citation dataset, save the most important patent.

In [ ]:
# TODO: run pagerank on the graph using the cuGraph backend

pr_results = []
# rerun the pagerank to see how prebuilding the graph effects the alrithm run time.
# sort the results to find the most important patent and save it
mip = []
most_important_patent = mip[0]

Q4. Call Louvain algorithm to create clusters. Find which cluster contains the most important patent and save it¶

Use the cugraph backend to execute louvain on the citation graph.

In [ ]:
clusters = []

# This cluster contains most important patent
save_cluster = []

4. Load the Enrichment Data Containing the Patent Titles¶

Q5. How can we use more data to enrich the graph clusters?¶

Create a dataframe with the patent id and title using the read_csv function Enrich the cluster by merging the cluster ids with the dataframe containing the ids and titles.

In [ ]:
#TODO: Read the g_patent.tsv file into a dataframe and merge with the save_cluster with merge how="inner" parameter
cluster_df = []
enriched_df = []

Conclusion¶

Well Done! In this lab, you have learned how basic usage and how to study relationships through graphs using the NetworkX library and the GPU accelerated backend. :

  1. Set up the NetworkX environment
  2. How to form the tabular data into a NetworkX graph
  3. Use NetworkX to understand the relationships in the graph such as
    • Finding out the most important patents
    • Understanding communities of patents
    • Which are the most important clusters
    • improving a graph's power using data enrichment

Continue your GPU accelerated data science journey by going to https://github.com/rapidsai-community/showcase/tree/main/accelerated_data_processing_examples