PaddlePaddle
diff --git a/‎.gitignore
+7-1 b/‎.gitignore
+7-1
diff --git a/‎README.md
+4 b/‎README.md
+4
diff --git a/‎examples/metapath2vec/README.md
+98 b/‎examples/metapath2vec/README.md
+98
diff --git a/‎examples/metapath2vec/config.yaml
+43 b/‎examples/metapath2vec/config.yaml
+43
diff --git a/‎examples/metapath2vec/data_preprocess.py
+157 b/‎examples/metapath2vec/data_preprocess.py
+157
diff --git a/‎examples/metapath2vec/datasets/__init__.py
+19 b/‎examples/metapath2vec/datasets/__init__.py
+19
@@ -57,8 +57,14 @@ coverage.xml
 
 # Sphinx documentation
 /docs/_build/
+docs/build
 
 # tutorials: jupyter log
 tutorials/.ipynb_checkpoints/
 .ipynb_checkpoints
-*.ipynb
+**checkpoints
+**logs
+**outputs
+pgl/tests/local*
+jupyter
+.gitignore.bak
@@ -23,6 +23,10 @@ Leaderboards can be found [here](https://ogb.stanford.edu/kddcup2021/results/).
 - Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification, to appear in **IJCAI2021**.
 - HGAMN: Heterogeneous Graph Attention Matching Network for Multilingual POI Retrieval at Baidu Maps, to appear in **KDD2021**.
 
+**PGL Dstributed Graph Engine API released!!**
+
+- Our Dstributed Graph Engine API has been released and we developed a [tutorial](./tutorial/working_with_distributed_graph_engine.ipynb) to show how to launch a graph engine and a [demo](./examples/metapath2vec) for training model using graph engine.
+
 
 PGL v2.1 2021.02.02
 
 
@@ -0,0 +1,98 @@
+# metapath2vec: Scalable Representation Learning for Heterogeneous Networks
+
+[metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf) is a algorithm framework for representation learning in heterogeneous networks which contains multiple types of nodes and links. Given a heterogeneous graph, metapath2vec algorithm first generates meta-path-based random walks and then use skipgram model to train a language model. Based on PGL, we reproduce metapath2vec algorithm using PGL graph engine for scalable representation learning.
+
+
+## Dependencies
+
+- paddlepaddle>=2.1.0
+
+- pgl>=2.1.4
+
+- OpenMPI==1.4.1
+
+## Datasets
+
+You can download datasets from [here](https://ericdongyx.github.io/metapath2vec/m2v.html).
+
+We use the "aminer" data for example. After downloading the aminer data, put them, let's say, in `./data/net_aminer/`. We also need to move the `label/` directory to `./data/` directory.
+
+## Data preprocessing
+
+After downloading the dataset, run the folowing command to preprocess the data:
+
+```
+python data_preprocess.py --config config.yaml
+```
+
+## Hyperparameters
+
+All the hyper parameters are saved in `config.yaml` file. So before training, you can open the `config.yaml` to modify the hyper parameters as you like.
+
+## PGL Graph Engine Launching
+
+Now we support distributed loading graph data using **PGL Graph Engine**. We also develop a simple tutorial to show how to launch a graph engine, please refer to [here](../../tutorials/working_with_distributed_graph_engine.ipynb).
+
+To launch a distributed graph service, please follow the steps below.
+
+### IP address setting
+
+The first step is to set the IP list for each graph server. Each IP address with port represents a server. In `ip_list.txt` file, we set up 4 ip addresses as follow for demo:
+
+```
+127.0.0.1:8553
+127.0.0.1:8554
+127.0.0.1:8555
+127.0.0.1:8556
+```
+
+### Launching Graph Engine by OpenMPI
+
+Before launching the graph engine, you should set up the below hyper-parameters in `config.yaml`:
+
+```
+etype2files: "p2a:./graph_data/paper2author_edges.txt,p2c:./graph_data/paper2conf_edges.txt"
+ntype2files: "p:./graph_data/node_types.txt,a:./graph_data/node_types.txt,c:./graph_data/node_types.txt"
+symmetry: True
+shard_num: 100
+```
+
+Then, we can launch the graph engine with the help of OpenMPI.
+
+```
+mpirun -np 4 python -m pgl.distributed.launch --ip_config ./ip_list.txt --conf ./config.yaml --mode mpi --shard_num 100
+```
+
+### Launching Graph Engine manually
+
+If you didn't install OpenMPI, you can launch the graph engine manually. 
+
+Fox example, if we want to use 4 servers, we should run the following command separately on 4 terminals.
+
+```
+# terminal 3
+python -m pgl.distributed.launch --ip_config ./ip_list.txt --conf ./config.yaml --shard_num 100 --server_id 3
+
+# terminal 2
+python -m pgl.distributed.launch --ip_config ./ip_list.txt --conf ./config.yaml --shard_num 100 --server_id 2
+
+# terminal 1
+python -m pgl.distributed.launch --ip_config ./ip_list.txt --conf ./config.yaml --shard_num 100 --server_id 1
+
+# terminal 0
+python -m pgl.distributed.launch --ip_config ./ip_list.txt --conf ./config.yaml --shard_num 100 --server_id 0
+```
+
+Note that the `server_id` of 0 should be the last one to be launched.
+
+
+## Training
+
+After successfully launching the graph engine, you can run the below command to train the model.
+
+```
+export CUDA_VISIBLE_DEVICES=0
+python train.py --config ./config.yaml --ip ./ip_list.txt
+```
+
+Note that the trained model will be saved `./ckpt_custom/$task_name/`
@@ -0,0 +1,43 @@
+task_name: distributed_metapath2vec
+
+# ---------------------------数据配置-------------------------------------------------#
+# for data preprocessing
+data_path: ./data/net_aminer
+author_label_file: ./data/label/googlescholar.8area.author.label.txt
+venue_label_file: ./data/label/googlescholar.8area.venue.label.txt
+processed_path: ./graph_data
+
+# for pgl graph engine
+etype2files: "p2a:./graph_data/paper2author_edges.txt,p2c:./graph_data/paper2conf_edges.txt"
+ntype2files: "p:./graph_data/node_types.txt,a:./graph_data/node_types.txt,c:./graph_data/node_types.txt"
+symmetry: True
+meta_path: "c2p-p2a-a2p-p2c"
+first_node_type: "c"
+
+shard_num: 100
+
+walk_len: 24
+win_size: 3
+neg_num: 5
+walk_times: 20
+
+
+# ---------------------------模型参数配置---------------------------------------------#
+model_type: SkipGramModel
+warm_start_from: null
+num_nodes: 5000000
+embed_size: 64
+sparse_embed: False
+
+# ---------------------------训练参数配置---------------------------------------------#
+epochs: 1
+num_workers: 4
+lr: 0.001
+lazy_mode: False
+batch_node_size: 200
+batch_pair_size: 1000
+pair_stream_shuffle_size: 100000
+log_dir: ./logs
+output_dir: ./outputs
+save_dir: ./checkpoints
+log_steps: 1000
@@ -0,0 +1,157 @@
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Data pre-processing for metapath2vec model.
+"""
+
+import os
+import sys
+import tqdm
+import time
+import logging
+import random
+import argparse
+import numpy as np
+import pickle as pkl
+
+from pgl.utils.logger import log
+from utils.config import prepare_config, make_dir
+
+# name  ID  g_index
+
+
+def remapping_id(file_, start_index, node_type, separator="\t"):
+    """Mapp the ID and name of nodes to index.
+    """
+    node_types = []
+    id2index = {}
+    name2index = {}
+    index = start_index
+    with open(file_, encoding="ISO-8859-1") as reader:
+        for line in reader:
+            tokens = line.strip().split(separator)
+            id2index[tokens[0]] = str(index)
+            if len(tokens) == 2:
+                name2index[tokens[1]] = str(index)
+            node_types.append((str(index), node_type))
+            index += 1
+
+    return id2index, name2index, node_types
+
+
+def load_edges(file_, src2index, dst2index, symmetry=False):
+    """Load edges from file.
+    """
+    edges = []
+    with open(file_, 'r') as reader:
+        for line in reader:
+            items = line.strip().split()
+            src, dst = src2index[items[0]], dst2index[items[1]]
+            edges.append((src, dst))
+            if symmetry:
+                edges.append((dst, src))
+        edges = list(set(edges))
+    return edges
+
+
+def load_label(file_, name2index):
+    index_label = []
+    with open(file_, encoding="ISO-8859-1") as reader:
+        for line in reader:
+            tokens = line.strip().split(' ')
+            name, label = tokens[0], int(tokens[1]) - 1
+            if name in name2index:
+                index_label.append((name2index[name], str(label)))
+
+    return index_label
+
+
+def main(config):
+    conf_id2index, conf_name2index, conf_node_type = remapping_id(
+        os.path.join(config.data_path, 'id_conf.txt'),
+        start_index=0,
+        node_type='c')
+    log.info('%d venues have been loaded.' % (len(conf_id2index)))
+
+    author_id2index, author_name2index, author_node_type = remapping_id(
+        os.path.join(config.data_path, 'id_author.txt'),
+        start_index=len(conf_id2index),
+        node_type='a')
+    log.info('%d authors have been loaded.' % (len(author_id2index)))
+
+    paper_id2index, paper_name2index, paper_node_type = remapping_id(
+        os.path.join(config.data_path, 'paper.txt'),
+        start_index=(len(conf_id2index) + len(author_id2index)),
+        node_type='p',
+        separator='\t')
+    log.info('%d papers have been loaded.' % (len(paper_id2index)))
+
+    node_types = conf_node_type + author_node_type + paper_node_type
+
+    paper2author_edges = load_edges(
+        os.path.join(config.data_path, 'paper_author.txt'), paper_id2index,
+        author_id2index)
+    log.info('%d paper2author edges have been loaded.' %
+             (len(paper2author_edges)))
+
+    paper2conf_edges = load_edges(
+        os.path.join(config.data_path, 'paper_conf.txt'), paper_id2index,
+        conf_id2index)
+    log.info('%d paper2conf edges have been loaded.' % (len(paper2conf_edges)))
+
+    author_label = load_label(config.author_label_file, author_name2index)
+    conf_label = load_label(config.venue_label_file, conf_name2index)
+
+    make_dir(config.processed_path)
+    node_types_file = os.path.join(config.processed_path, 'node_types.txt')
+    log.info("saving node_types to %s" % node_types_file)
+    with open(node_types_file, 'w') as writer:
+        for item in tqdm.tqdm(node_types):
+            writer.write("%s\t%s\n" % (item[1], item[0]))
+
+    p2a_edges_file = os.path.join(config.processed_path,
+                                  'paper2author_edges.txt')
+    log.info("saving paper2author edges to %s" % p2a_edges_file)
+    with open(p2a_edges_file, 'w') as writer:
+        for item in tqdm.tqdm(paper2author_edges):
+            writer.write("\t".join(item) + "\n")
+
+    p2c_edges_file = os.path.join(config.processed_path,
+                                  'paper2conf_edges.txt')
+    log.info("saving paper2conf edges to %s" % p2c_edges_file)
+    with open(p2c_edges_file, 'w') as writer:
+        for item in tqdm.tqdm(paper2conf_edges):
+            writer.write("\t".join(item) + "\n")
+
+    author_label_file = os.path.join(config.processed_path, 'author_label.txt')
+    log.info("saving author label to %s" % author_label_file)
+    with open(author_label_file, 'w') as writer:
+        for item in tqdm.tqdm(author_label):
+            writer.write("\t".join(item) + "\n")
+
+    conf_label_file = os.path.join(config.processed_path, 'conf_label.txt')
+    log.info("saving conf label to %s" % conf_label_file)
+    with open(conf_label_file, 'w') as writer:
+        for item in tqdm.tqdm(conf_label):
+            writer.write("\t".join(item) + "\n")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='metapath2vec')
+    parser.add_argument('--config', default="./config.yaml", type=str)
+    args = parser.parse_args()
+
+    config = prepare_config(args.config)
+
+    main(config)
@@ -0,0 +1,19 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from datasets import dataset
+from datasets.dataset import *
+
+__all__ = []
+__all__ += dataset.__all__