Load built-in and ported datasets from TGB¶
This tutorial shows you how to load built-in datasets
import tgx
Access TGB datasets¶
In order to load TGB datasets you should first install the TGB package:
pip install py-tgb
Then write name of the dataset in the parantheses:
tgx.data.tgb("name")
The dataset names are as follow
tgbl-wiki
, tgbl-review
, tgbl-coin
, tgbl-comment
, tgbl-flight
tgbn-trade
, tgbn-genre
, tgbn-reddit
data_name = "tgbl-wiki"
dataset = tgx.tgb_data(data_name) #tgb datasets
ctdg = tgx.Graph(dataset)
raw file found, skipping download Dataset directory is /mnt/f/code/TGB/tgb/datasets/tgbl_wiki loading processed file Number of loaded edges: 157474 Number of unique edges:18257 Available timestamps: 152757
Access other datasets¶
To load built-in TGX datasets (from Poursafaei et al. 2022). You can write the name of the dataset instead of datasest_name
:
tgx.data.dataset_name
The dataset names are as:
mooc
, uci
, uslegis
, unvote
, untrade
, flight
, wikipedia
, reddit
, lastfm
, contact
, canparl
, socialevo
, enron
dataset = tgx.builtin.uci()
ctdg = tgx.Graph(dataset)
Number of loaded edges: 59835 Number of unique edges:20296 Available timestamps: 58911
Custom Datasets¶
You can load your own custom dataset from .csv
files and read it into a tgx.Graph
object
Let's start by loading a toy dataset into pandas and then visualize the rows
import pandas as pd
toy_fname = 'toy_data.csv'
df = pd.read_csv(toy_fname)
df
time | source | destination | |
---|---|---|---|
0 | 0 | 1 | 2 |
1 | 0 | 2 | 1 |
2 | 0 | 3 | 1 |
3 | 1 | 2 | 2 |
4 | 1 | 1 | 2 |
5 | 1 | 3 | 1 |
from tgx.io.read import read_csv
# header indicates if there is a header row at the top
# index whether the first column is row indices
# t_col indicates which column corresponds to timestamps
edgelist = read_csv(toy_fname,
header=True,
index=False,
t_col=0,)
tgx.Graph(edgelist=edgelist)
Number of loaded edges: 5 Number of unique edges: 4 Available timestamps: 2
<tgx.classes.graph.Graph at 0x7fde4755aca0>
Subsampling graphs¶
To perform subsmpling graphs you should follow these steps:
descritize the data
create a graph object of data (G)
subsample the graph by
tgx.utils.graph_utils.subsampling
create a new graph from the subsampled subgraph
from tgx.utils.graph_utils import subsampling
sub_edges = subsampling(ctdg, selection_strategy="random", N=1000) #N is # of nodes to be sampled
subgraph = tgx.Graph(edgelist=sub_edges)
Generate graph subsample...