DGL 라이브러리로 GNN 돌려보기

카테고리 없음

DGL 라이브러리로 GNN 돌려보기

뿅삥 2022. 1. 23. 19:43

DGL 라이브러리 사용해서 GNN 관련 예제를 돌려봅니다.

https://www.dgl.ai/

Deep Graph Library

Deep Graph Library Easy Deep Learning on Graphs Install GitHub

www.dgl.ai

홈페이지에 들어가면 install에 대한 안내가 있습니다.

사용자 환경에 맞춰 설치를 진행합니다.

Tutorial에 들어가서 돌릴만한 예제를 찾았습니다.

이전 글과 마찬가지로 MUTAG 데이터셋 graph classification을 따라 해 봅니다.

dgl 라이브러리와 torch를 불러옵니다.

import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

데이터셋을 불러옵니다.

import dgl.data
dataset = dgl.data.GINDataset('MUTAG', self_loop=True)

dgl에서도 MUTAG 데이터셋을 바로 불러올 수 있습니다.

이제 데이터를 DataLoader로 보냅니다.

from dgl.dataloading import GraphDataLoader
from torch.utils.data.sampler import SubsetRandomSampler

num_examples = len(dataset)
num_train = int(num_examples * 0.8)

train_sampler = SubsetRandomSampler(torch.arange(num_train))
test_sampler = SubsetRandomSampler(torch.arange(num_train, num_examples))

train_dataloader = GraphDataLoader(
    dataset, sampler=train_sampler, batch_size=5, drop_last=False)
test_dataloader = GraphDataLoader(
    dataset, sampler=test_sampler, batch_size=5, drop_last=False)
    
it = iter(train_dataloader)
batch = next(it)

배치 설정해줍니다.

batched_graph, labels = batch
print('Number of nodes for each graph element in the batch:', batched_graph.batch_num_nodes())
print('Number of edges for each graph element in the batch:', batched_graph.batch_num_edges())

# Recover the original graph elements from the minibatch
graphs = dgl.unbatch(batched_graph)
print('The original graphs in the minibatch:')
print(graphs)

이후에 graph classification을 위한 모델을 정의합니다. GraphConv는 dgl.nn에 이미 정의되어 있습니다. GraphConv를 불러와서 torch 기반하여 모델을 정의합니다. torch 사용자에게는 익숙한 부분일 것이라 생각됩니다.

from dgl.nn import GraphConv

class GCN(nn.Module):
    def __init__(self, in_feats, h_feats, num_classes):
        super(GCN, self).__init__()
        self.conv1 = GraphConv(in_feats, h_feats)
        self.conv2 = GraphConv(h_feats, num_classes)

    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        g.ndata['h'] = h
        return dgl.mean_nodes(g, 'h')

모델을 생성, 학습 파라미터를 설정하고 학습을 실행합니다.

# Create the model with given dimensions
model = GCN(dataset.dim_nfeats, 16, dataset.gclasses)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(20):
    for batched_graph, labels in train_dataloader:
        pred = model(batched_graph, batched_graph.ndata['attr'].float())
        loss = F.cross_entropy(pred, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

num_correct = 0
num_tests = 0
for batched_graph, labels in test_dataloader:
    pred = model(batched_graph, batched_graph.ndata['attr'].float())
    num_correct += (pred.argmax(1) == labels).sum().item()
    num_tests += len(labels)

print('Test accuracy:', num_correct / num_tests)


# Test accuracy: 0.02631578947368421

검증 결과 정확도 2%가 나오네요.

뭔가 이상합니다.

데이터를 살펴봅니다.

dataset.labels

#tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
#        0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

label 0,1이 섞여있지 않습니다. train/test 데이터 나누는 과정에 문제가 있을 것 같습니다.

train_sampler.indices

#tensor([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
#         14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,  27,
#         28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,  40,  41,
#         42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,  53,  54,  55,
#         56,  57,  58,  59,  60,  61,  62,  63,  64,  65,  66,  67,  68,  69,
#         70,  71,  72,  73,  74,  75,  76,  77,  78,  79,  80,  81,  82,  83,
#         84,  85,  86,  87,  88,  89,  90,  91,  92,  93,  94,  95,  96,  97,
#         98,  99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111,
#        112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
#        126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139,
#        140, 141, 142, 143, 144, 145, 146, 147, 148, 149])

위에서 생성한 train_sampler의 id를 살펴보니 예상대로 이 부분에서 잘못된 점을 발견했습니다.

학습 데이터에 0,1 label이 잘 섞일 수 있게 다시 만들어줍니다.

import random
ids = list(range(188))
train_ids = random.sample(ids,150)
test_ids = [x for x in ids if x not in train_ids]

이후 다시 학습합니다.

train_sampler = SubsetRandomSampler(torch.tensor(train_ids))
test_sampler = SubsetRandomSampler(torch.tensor(test_ids))

train_dataloader = GraphDataLoader(
    dataset, sampler=train_sampler, batch_size=5, drop_last=False)
test_dataloader = GraphDataLoader(
    dataset, sampler=test_sampler, batch_size=5, drop_last=False)
    
it = iter(train_dataloader)
batch = next(it)


batched_graph, labels = batch
print('Number of nodes for each graph element in the batch:', batched_graph.batch_num_nodes())
print('Number of edges for each graph element in the batch:', batched_graph.batch_num_edges())

# Recover the original graph elements from the minibatch
graphs = dgl.unbatch(batched_graph)
print('The original graphs in the minibatch:')
print(graphs)


# Create the model with given dimensions
model = GCN(dataset.dim_nfeats, 16, dataset.gclasses)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(20):
    for batched_graph, labels in train_dataloader:
        pred = model(batched_graph, batched_graph.ndata['attr'].float())
        loss = F.cross_entropy(pred, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

num_correct = 0
num_tests = 0
for batched_graph, labels in test_dataloader:
    pred = model(batched_graph, batched_graph.ndata['attr'].float())
    num_correct += (pred.argmax(1) == labels).sum().item()
    num_tests += len(labels)

print('Test accuracy:', num_correct / num_tests)

# Test accuracy: 0.6578947368421053

제대로 된 것 같네요.

결론

유저가 사용하는 frame work에 맞게 모델 생성이 가능해 보입니다. Pytorch , Tensor flow, MXNet 중 선택해서 사용 가능한 것 같습니다. 또한, CUDA 버전에 맞게 다르게 설치하고, os나 파이썬 버전도 확인하는 것을 보니 dependency관리가 굉장히 잘 되어있는 것 같습니다. 물론 아직 GPU를 사용해서 돌려보지는 않았습니다만 multi gpu 사용방법도 잘 설명되어 있습니다. github를 가보니 star 수도 8.8k로 stellargraph 보다 더 인기가 있어 보입니다.

또한, 모델 구조나 trainer 부분에 대한 커스터마이징도 쉽게 가능한 형태인 것 같아 사용하기 좋아 보입니다.