Dataset Visualization and Statistics¶
This tutorial shows you how to use TGX for visualizations and obtaining dataset statistics.
For comprehensive API references, visit TGX website.
import tgx
from tgx.utils.plotting_utils import plot_for_snapshots
Load TGB Dataset¶
For the examples used in this tutorials, first, load the TGB datasets in TGX from Temporal Graph Benchmark for Machine Learning on Temporal Graphs (NeurIPS 2023 Datasets and Benchmarks Track).
data_name = "tgbl-wiki"
dataset = tgx.tgb_data(data_name)
Dataset tgbl-wiki version 2 not found. Please download the latest version of the dataset. Download started, this might take a while . . . Dataset title: tgbl-wiki Download completed Dataset directory is /home/fpour/projects/def-rrabba/fpour/proj/TGX/ENV/lib/python3.11/site-packages/tgb/datasets/tgbl_wiki file not processed, generating processed file
Load Built-in Dataset¶
Load the built-in datasets in TGX that come from Towards Better Evaluation for Dynamic Link Prediction (NeurIPS 2022 Datasets and Benchmarks Track).
dataset = tgx.builtin.uci()
Data missing, download recommended! ./data https://zenodo.org/record/7213796/files/uci.zip Downloading uci dataset . . . Download completed
Graph Discretization for Visualization¶
We can discretize a temporal graph into snapshots (i.e., equally spaced durations) for visualization purposes.
ctdg = tgx.Graph(dataset) # retrieve the continuous time dynamic graph
time_scale = "weekly" #"minutely", "hourly", "daily", "monthly", "yearly", "biyearly"
dtdg, ts_list = ctdg.discretize(time_scale=time_scale, store_unix=True)
Number of loaded edges: 59835 Number of unique edges:20296 Available timestamps: 58911 Discretizing data to 28 timestamps...
TGX Features¶
TGX provides a suite of visualization tools for better analyzing the dynamics of temporal graphs.
Below is a list of visualization approaches offered by TGX.
Function | Description |
---|---|
tgx.degree_over_time |
Plot the average node degree over time |
tgx.nodes_over_time |
Plot the number of active nodes per snapshot |
tgx.edges_over_time |
Plot the number of edges per snapshot |
tgx.nodes_and_edges_over_time |
Plot the number of active nodes and edges in the same figure |
tgx.connected_components_per_ts |
Plot the number of connected components per timestamp. |
tgx.degree_density |
Plot the density map of node degrees per time window |
tgx.TEA -Plot) |
Plot Temporal Edge Appearance (TEA) (from Poursafaei et al. 2022) |
tgx.TET -Plot) |
Plot Temporal Edge Traffic (TET) from (Poursafaei et al. 2022) |
For each visualization tool, you can specify the output path with filepath
, otherwise the output is saved in the current directory.
tgx.degree_over_time(dtdg, network_name=dataset.name, filepath=filepath)
In what follows, we cover some of the visualizations offered by TGX.
Average Node Degree Over Time¶
The goal is to plot the average node degree of snapshot.
In this plot, the x-axis is the snapshot index (or timestamps), while the y-axis is the average node degree.
Number of Nodes Over Time¶
The goal is to plot the number of active nodes per snapshot.
In this plot, x-axis is the snapshot index (or timestamps), while the y-axis denotes the number of active nodes.
Number of Edges Over Time¶
The goal is to plot the number of edges per snapshot.
The x-axis is the snapshot index (or timestamps), while the y-axis denotes the number of edges.
Number of Nodes and Edges Over Time¶
The goal is to plot the number of active nodes and edges per snapshot in the same figure.
The x-axis is the snapshot index (or timestamps), while the y-axis denotes the active number of nodes / edges.
Number of Connected Components¶
The goal is to plot number of connected components per snapshot.
The x-axis is the snapshot index (or timestamps), while the y-axis denotes the number of connected components.
Degree Density¶
The goal is to plot the heatmap of node degrees per snapshot.
The x-axis is the snapshot index (or timestamps), while the y-axis denotes the node degree.
Temporal Edge Appearance (TEA) Plot¶
A TEA plot illustrates the portion of repeated edges versus newly observed edges for each timestamp in a dynamic graph.
This plot is proposed in Poursafaei et al. 2022.
Temporal Edge Traffic (TET) Plot¶
A TET plot visualizes the reocurrence pattern of edges in different dynamic networks over time
This plot is proposed in Poursafaei et al. 2022.
tgx.TET(dtdg,
network_name=dataset.name,
figsize = (9, 5),
axis_title_font_size = 24,
ticks_font_size = 24)
Info: Number of distinct edges (from index-edge map): 20296
29it [00:00, 3438.15it/s]
Info: edge-presence-matrix shape: (29, 20296) First level processing: Detecting edges present in train & test sets
100%|██████████| 24/24 [00:01<00:00, 20.33it/s]
Detecting transductive edges (seen in train, repeating in test)
100%|██████████| 5/5 [00:00<00:00, 23.35it/s]
Second level processing: Detecting edges 1) Only in train set, 2) only in test (inductive)
100%|██████████| 29/29 [00:01<00:00, 24.76it/s]
Info: edge-presence-matrix shape: (29, 20296) Info: plotting edge presence heatmap for . ...
Info: plotting done!
Temporal Graph Statistics¶
TGX provides APIs to compute the statistics of temporal graphs.
Here, we cover some the functionalities for obtaining temporal graph statistics provided by TGX.
Function | Description | Returns |
---|---|---|
tgx.get_reoccurrence |
Calculate the recurrence index | float |
tgx.get_surprise |
Calculate the surprise index | float |
tgx.get_novelty |
Calculate the novelty index | float |
tgx.get_avg_node_activity |
Calculate the average node activity | float |
tgx.size_connected_components |
Calculate the sizes of connected components | List[List[float]] |
tgx.get_avg_node_engagement |
Calculate the average node engagement | List[float] |
Since some the measures require distinct test split, we should first set the test_ratio
.
Please note that temporal graph data is generally split in a chronological manner.
You can use plot_for_snapshots
for visualizing the statistics reports.
test_ratio = 0.15
# compute reocurrence
tgx.get_reoccurrence(ctdg, test_ratio=test_ratio)
# compute surprise
tgx.get_surprise(ctdg, test_ratio=test_ratio)
# compute novelty
tgx.get_novelty(dtdg)
# compute node activity
tgx.get_avg_node_activity(dtdg)
INFO: Reoccurrence: 0.03712061378765655 INFO: Surprise: 0.7961586121437423 INFO: Novelty: 0.6590964516995399 INFO: Node activity ratio: 0.16558624321330645
0.16558624321330645
Size of Connected Components¶
You can also visualize some statistics such as how the size of the largest component changes over time.
component_sizes = tgx.size_connected_components(dtdg)
largest_component_sizes = [max(inner_list) if inner_list else 0 for inner_list in component_sizes]
filename = f"{dataset.name}_largest_connected_component_size"
plot_for_snapshots(largest_component_sizes, y_title="Size of Largest Connected Component", filename="./"+filename)
Average Node Engagement¶
The goal is to calculate the average node engagement over time. Node engagement represents the average number of distinct nodes that establish at least one new connection during a timestamp.
engagements = tgx.get_avg_node_engagement(dtdg)
filename = f"{dataset.name}_average_node_engagement"
plot_for_snapshots(engagements, y_title="Average Engagement", filename="./"+filename)