【原】DGL&RDKit|基于GCN与基于3D描述符的分子溶解度预测模型对比

DrugAI 2022-04-19

展开全文

GCN

图卷积神经网络（Graph Convolutional Networks, GCN ）

图卷积的原理

处理图形或网络的数据形式存在许多重要的实际问题，如社交网络、知识图形、蛋白质相互作用网络和分子图形等。然而，将深度学习应用于这些图形数据是非常重要的，因为它具有独特地图特征。人们非常关注神经网络模型对这种结构化图形数据的概括。过去的几年中，许多论文重新讨论推广神经网络以处理任意结构化图形的问题。下面的小节中给出了图的表示和图卷的两种方式，即空间卷积和谱卷积。空间卷积GCN是可区分的消息传递模式，其在局部图形邻域上操作到任意图形。对于社交网络，知识图和分子图等图形，它比谱卷积更受欢迎。谱卷积GCN的思想是利用光谱理论在拓扑图上实现卷积运算，通常用于处理数据，如图像和视频。

图形定义

图（graph）是一种数据格式，它可以用于表示社交网络、通信网络、蛋白分子网络等，图中的节点表示网络中的个体，连边表示个体之间的连接关系。许多机器学习任务例如社团发现、链路预测等都需要用到图结构数据，因此图卷积神经网络的出现为这些问题的解决提供了新的思路。

空间卷积

早期尝试推广结构化数据的判别嵌入中，Dai等人提出了structure2vec，一种用于嵌入图结构化数据的潜变量模型，在图形模型中使用近似推理算法。推理算法的解决方案意味着一个传播方程，其中节点的表示是邻域边缘和来自邻居消息的函数。后来大部分GCN都建立在这个概念之上，并进行了广泛的修改，称为空间卷积。

空间卷积旨在直接在顶点域中构造卷积。关键思想是通过聚合来自其相邻节点的信息来更新某个节点的表示。空间卷积与Weisfeiler-Lehman算法一致，通常用于测试两个图是否是同构，其中节点标签由相邻节点的有序标签集重复地增强。这种传播的基本机制是首先将邻域信息视为图子结构，然后通过将不同的子结构递归地投影到不同的特征空间中，通过可微函数对这种子结构进行建模。邻居和中心节点之间的信息也称为消息。消息传递到中心节点的方式产生表征网络体系结构的不同传播规则。

谱卷积

分子图

图形提供了一种自然的方式来表示化学结构。这种情况下，原子是图的节点，键是边。然后可以为原子类型的节点“着色”，并为键类型“权衡”边缘。该图包含分子图和不同特性的边。

分子图的元素是：

原子
键

任何两个原子之间仅允许一个连接/键，并且不允许自连接。

基于深度图学习框架DGL和GCN预测溶解度

导入库

from collections import namedtuple
import dgl
from dgl import DGLGraph
import dgl.function as fn
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, TensorDataset
from torch.utils.data import DataLoader
import torch.optim as optim
from rdkit import Chem
from rdkit.Chem import RDConfig
import os
import copy
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

定义元素列表

ELEM_LIST = ['C', 'N', 'O', 'S', 'F', 'Si', 'P', 'Cl', 'Br', 'Mg', 'Na', 'Ca', 'Fe', 'Al', 'I', 'B', 'K', 'Se', 'Zn', 'H', 'Cu', 'Mn', 'unknown']
ATOM_FDIM = len(ELEM_LIST) + 6 + 5 + 1
MAX_ATOMNUM =60
BOND_FDIM = 5 
MAX_NB = 10

代码来自dgl的 junction tree，生成分子结构图

PAPER = os.getenv('PAPER', False)
def onek_encoding_unk(x, allowable_set):
    if x not in allowable_set:
        x = allowable_set[-1]
    return [x == s for s in allowable_set]
# Note that during graph decoding they don't predict stereochemistry-related
# characteristics (i.e. Chiral Atoms, E-Z, Cis-Trans).  Instead, they decode
# the 2-D graph first, then enumerate all possible 3-D forms and find the
# one with highest score.
'''
def atom_features(atom):
    return (torch.Tensor(onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()]))
'''
def atom_features(atom):
    return (onek_encoding_unk(atom.GetSymbol(), ELEM_LIST)
            + onek_encoding_unk(atom.GetDegree(), [0,1,2,3,4,5])
            + onek_encoding_unk(atom.GetFormalCharge(), [-1,-2,1,2,0])
            + [atom.GetIsAromatic()])
def bond_features(bond):
    bt = bond.GetBondType()
    return (torch.Tensor([bt == Chem.rdchem.BondType.SINGLE, bt == Chem.rdchem.BondType.DOUBLE, bt == Chem.rdchem.BondType.TRIPLE, bt == Chem.rdchem.BondType.AROMATIC, bond.IsInRing()]))
def mol2dgl_single(mols):
    cand_graphs = []
    n_nodes = 0
    n_edges = 0
    bond_x = []
    for mol in mols:
        n_atoms = mol.GetNumAtoms()
        n_bonds = mol.GetNumBonds()
        g = DGLGraph()        
        nodeF = []
        for i, atom in enumerate(mol.GetAtoms()):
            assert i == atom.GetIdx()
            nodeF.append(atom_features(atom))
        g.add_nodes(n_atoms)
        bond_src = []
        bond_dst = []
        for i, bond in enumerate(mol.GetBonds()):
            a1 = bond.GetBeginAtom()
            a2 = bond.GetEndAtom()
            begin_idx = a1.GetIdx()
            end_idx = a2.GetIdx()
            features = bond_features(bond)
            bond_src.append(begin_idx)
            bond_dst.append(end_idx)
            bond_x.append(features)
            bond_src.append(end_idx)
            bond_dst.append(begin_idx)
            bond_x.append(features)
        g.add_edges(bond_src, bond_dst)
        g.ndata['h'] = torch.Tensor(nodeF)
        cand_graphs.append(g)
    return cand_graphs
msg = fn.copy_src(src="h", out="m")

定义GCN

class NodeApplyModule(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(NodeApplyModule, self).__init__()
        self.linear = nn.Linear(in_feats, out_feats)
        self.activation = activation
    def forward(self, node):
        h = self.linear(node.data['h'])
        h = self.activation(h)
        return {'h': h}
class GCN(nn.Module):
    def __init__(self, in_feats, out_feats, activation):
        super(GCN, self).__init__()
        self.apply_mod = NodeApplyModule(in_feats, out_feats, activation)
    def forward(self, g, feature):
        g.ndata['h'] = feature
        g.update_all(msg, reduce)
        g.apply_nodes(func=self.apply_mod)
        h =  g.ndata.pop('h')
        #print(h.shape)
        return h
class Classifier(nn.Module):
    def __init__(self, in_dim, hidden_dim, n_classes):
        super(Classifier, self).__init__()
        self.layers = nn.ModuleList([GCN(in_dim, hidden_dim, F.relu),
                                    GCN(hidden_dim, hidden_dim, F.relu)])
        self.classify = nn.Linear(hidden_dim, n_classes)
    def forward(self, g):
        h = g.ndata['h']
        for conv in self.layers:
            h = conv(g, h)
        g.ndata['h'] = h
        hg = dgl.mean_nodes(g, 'h')
        return self.classify(hg)

模型训练

epoch_losses = []
for epoch in range(200):
    epoch_loss = 0
    for i, (bg, label) in enumerate(data_loader):
        bg.set_e_initializer(dgl.init.zero_initializer)
        bg.set_n_initializer(dgl.init.zero_initializer)        
        pred = model(bg)
        loss = loss_func(pred, label)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        epoch_loss += loss.detach().item()
    epoch_loss /= (i + 1)
    if (epoch+1) % 20 == 0:
        print('Epoch {}, loss {:.4f}'.format(epoch+1, epoch_loss))
    epoch_losses.append(epoch_loss)
plt.plot(epoch_losses, c='b')

模型评价

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
accuracy_score(test_y, pred_y)

0.7665369649805448

print(classification_report(test_y, pred_y))

基于3D分子描述符预测溶解度

导入包

from rdkit.Chem import AllChem
from rdkit.Chem.Descriptors import rdMolDescriptors
from sklearn.preprocessing import normalize
from sklearn.ensemble import RandomForestClassifier

定义3D描述符计算函数，并计算3D描述符

def calc_dragon_type_desc(mol):
    return rdMolDescriptors.CalcAUTOCORR3D(mol) + rdMolDescriptors.CalcMORSE(mol) + \
        rdMolDescriptors.CalcRDF(mol) + rdMolDescriptors.CalcWHIM(mol)
train_X = normalize([calc_dragon_type_desc(m) for m in train_mols2])
test_X = normalize([calc_dragon_type_desc(m) for m in test_mols2])

基于随机森林的分类

rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(train_X, train_y)

模型评价

参考资料

项目地址：https://github.com/dmlc/dgl

所有示例模型的详细从零教程：

https://docs./tutorials/models/index.htm

DGL | 基于深度学习框架DGL的分子图初探

作者 / 编辑：王建民

DrugAI

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自： DrugAI > 《待分类》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

DrugAI

关注对话

TA的最新馆藏

J. Chem. Inf. Model. | 双通道异构图神经网络用于预测microRNA调控的药物敏感性
基因组深度学习模型很难很好地解释个体转录组的变异
深度学习在人类基因组学中的应用：下一代测序数据的综述
Soft-introspective VAEs：超越AlphaFold2，揭示K-Ras蛋白新视野
自我一致性可以提升语言模型中推理的质量
化学家现在可以将单个原子从分子核心中移入和移出

喜欢该文的人也喜欢更多

热门阅读换一换