wangxicoding
diff --git a/‎examples/text_matching/simcse/README.md
+112 b/‎examples/text_matching/simcse/README.md
+112
diff --git a/‎examples/text_matching/simcse/data.py
+102 b/‎examples/text_matching/simcse/data.py
+102
diff --git a/‎examples/text_matching/simcse/model.py
+140 b/‎examples/text_matching/simcse/model.py
+140
@@ -0,0 +1,112 @@
+# 无监督语义匹配模型 [SimCSE](https://arxiv.org/abs/2104.08821)
+
+我们实现了 SimCSE 模型，并在 4 个常用中文语义匹配数据集上对 SimCSE 模型的无监督匹配效果进行了评测。SimCSE 模型适合缺乏监督数据，但是又有大量无监督数据的匹配和检索场景。
+
+## 效果评估
+本项目分别使用 LCQMC、BQ_Corpus、STS-B、ATEC 这 4 个语义匹配数据集的训练集作为无监督训练集(仅使用文本信息，不适用 Label)，并且在各自数据集上的验证集上进行效果评估，评估指标采用 SimCSE 论文中采用的 Spearman 相关系数，Spearman 相关系数越高，表示模型效果越好。
+
+| 模型  | Infer_with_fc| LCQMC | BQ_Corpus|STS-B|ATEC|
+| ------- |-------|-------|-----|------|-----|
+| ERNIE-1.0|是| 52.33 | 43.75 | 66.66 | 29.78 |
+| ERNIE-1.0|否| 57.01 | 51.72 | 74.76 | 33.56 |
+
+**Note**:  Infer_with_fc 表示在预测阶段计算文本 embedding 表示的时候网络前向是否会过训练阶段最后一层的 fc, 由表格可知: 预测阶段不使用最后一层 fc 可以显著提升无监督语义匹配的效果.
+
+## 快速开始
+
+### 代码结构说明
+
+以下是本项目主要代码结构及说明：
+
+```
+simcse/
+├── model.py # SimCSE 模型组网代码
+├── data.py # 无监督语义匹配训练数据、测试数据的读取逻辑
+├── predict.py # 基于训练好的无监督语义匹配模型计算文本 Pair 相似度
+└── train.py # SimCSE 模型训练、评估逻辑
+```
+
+### 模型训练
+我们以中文文本匹配公开数据集 LCQMC 为示例数据集， 仅使用 LCQMC 的文本数据构造生成了无监督的训练数据。可以运行如下命令，开始模型训练并且在 LCQMC 的验证集上进行 Spearman 相关系数评估.
+
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+python -u -m paddle.distributed.launch --gpus '0' \
+	train.py \
+	--device gpu \
+	--save_dir ./checkpoints/ \
+	--batch_size 64 \
+	--learning_rate 5E-5 \
+	--epochs 1 \
+	--save_steps 100 \
+	--eval_steps 100 \
+	--max_seq_length 64 \
+	--infer_with_fc_pooler \
+	--dropout 0.3 \
+	--train_set_file "./senteval_cn/LCQMC/train.txt" \
+	--test_set_file "./senteval_cn/LCQMC/dev.tsv"
+```
+
+可支持配置的参数：
+
+* `infer_with_fc_pooler`：可选，在预测阶段计算文本 embedding 表示的时候网络前向是否会过训练阶段最后一层的 fc;  建议打开模型效果最好。
+* `scale`：可选，在计算 cross_entropy loss 之前对 cosine 相似度进行缩放的因子；默认为 20。
+* `dropout`：可选，SimCSE 网络前向使用的 dropout 取值；默认 0.1。
+* `save_dir`：可选，保存训练模型的目录；默认保存在当前目录checkpoints文件夹下。
+* `max_seq_length`：可选，ERNIE-Gram 模型使用的最大序列长度，最大不能超过512, 若出现显存不足，请适当调低这一参数；默认为128。
+* `batch_size`：可选，批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为32。
+* `learning_rate`：可选，Fine-tune的最大学习率；默认为5e-5。
+* `weight_decay`：可选，控制正则项力度的参数，用于防止过拟合，默认为0.0。
+* `epochs`: 训练轮次，默认为3。
+* `warmup_proption`：可选，学习率warmup策略的比例，如果0.1，则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减，默认为0.0。
+* `init_from_ckpt`：可选，模型参数路径，热启动模型训练；默认为None。
+* `seed`：可选，随机种子，默认为1000.
+* `device`: 选用什么设备进行训练，可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。
+
+程序运行时将会自动进行训练，评估。同时训练过程中会自动保存模型在指定的`save_dir`中。
+如：
+```text
+checkpoints/
+├── model_100
+│   ├── model_state.pdparams
+│   ├── tokenizer_config.json
+│   └── vocab.txt
+└── ...
+```
+
+**NOTE:**
+* 如需恢复模型训练，则可以设置`init_from_ckpt`， 如`init_from_ckpt=checkpoints/model_100/model_state.pdparams`。
+
+### 基于动态图模型预测
+
+我们用 LCQMC 的测试集作为预测数据,  测试数据示例如下，：
+```text
+谁有狂三这张高清的  这张高清图，谁有
+英雄联盟什么英雄最好    英雄联盟最好英雄是什么
+这是什么意思，被蹭网吗  我也是醉了，这是什么意思
+现在有什么动画片好看呢？    现在有什么好看的动画片吗？
+请问晶达电子厂现在的工资待遇怎么样要求有哪些    三星电子厂工资待遇怎么样啊
+```
+
+执行如下命令开始预测：
+```shell
+python -u -m paddle.distributed.launch --gpus "0" \
+        predict.py \
+        --device gpu \
+        --params_path "./checkpoints/model_4400/model_state.pdparams"\
+        --batch_size 64 \
+        --max_seq_length 64 \
+        --input_file 'test.tsv'
+```
+
+输出预测结果如下:
+```text
+0.7201147675514221
+0.9010907411575317
+0.5393891334533691
+0.9698929786682129
+0.6056119203567505
+```
+
+## Reference
+[1] Gao, Tianyu, Xingcheng Yao, and Danqi Chen. “SimCSE: Simple Contrastive Learning of Sentence Embeddings.” ArXiv:2104.08821 [Cs], April 18, 2021. http://arxiv.org/abs/2104.08821.
@@ -0,0 +1,102 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+import paddle
+
+from paddlenlp.utils.log import logger
+
+
+def create_dataloader(dataset,
+                      mode='train',
+                      batch_size=1,
+                      batchify_fn=None,
+                      trans_fn=None):
+    if trans_fn:
+        dataset = dataset.map(trans_fn)
+
+    shuffle = True if mode == 'train' else False
+    if mode == 'train':
+        batch_sampler = paddle.io.DistributedBatchSampler(
+            dataset, batch_size=batch_size, shuffle=shuffle)
+    else:
+        batch_sampler = paddle.io.BatchSampler(
+            dataset, batch_size=batch_size, shuffle=shuffle)
+
+    return paddle.io.DataLoader(
+        dataset=dataset,
+        batch_sampler=batch_sampler,
+        collate_fn=batchify_fn,
+        return_list=True)
+
+
+def convert_example(example, tokenizer, max_seq_length=512, do_evalute=False):
+    """
+    Builds model inputs from a sequence.
+        
+    A BERT sequence has the following format:
+
+    - single sequence: ``[CLS] X [SEP]``
+
+    Args:
+        example(obj:`list(str)`): The list of text to be converted to ids.
+        tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` 
+            which contains most of the methods. Users should refer to the superclass for more information regarding methods.
+        max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. 
+            Sequences longer than this will be truncated, sequences shorter will be padded.
+        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
+
+    Returns:
+        input_ids(obj:`list[int]`): The list of query token ids.
+        token_type_ids(obj: `list[int]`): List of query sequence pair mask.
+    """
+
+    result = []
+
+    for key, text in example.items():
+        if 'label' in key:
+            # do_evaluate
+            result += [example['label']]
+        else:
+            # do_train
+            encoded_inputs = tokenizer(text=text, max_seq_len=max_seq_length)
+            input_ids = encoded_inputs["input_ids"]
+            token_type_ids = encoded_inputs["token_type_ids"]
+            result += [input_ids, token_type_ids]
+
+    return result
+
+
+def read_simcse_text(data_path):
+    """Reads data."""
+    with open(data_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            data = line.rstrip()
+            yield {'text_a': data, 'text_b': data}
+
+
+def read_text_pair(data_path, is_test=False):
+    """Reads data."""
+    with open(data_path, 'r', encoding='utf-8') as f:
+        for line in f:
+            data = line.rstrip().split("\t")
+            if is_test == False:
+                if len(data) != 3:
+                    continue
+                yield {'text_a': data[0], 'text_b': data[1], 'label': data[2]}
+            else:
+                if len(data) != 2:
+                    continue
+                yield {'text_a': data[0], 'text_b': data[1]}
@@ -0,0 +1,140 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import abc
+import sys
+
+import numpy as np
+
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+
+class SimCSE(nn.Layer):
+    def __init__(self,
+                 pretrained_model,
+                 dropout=None,
+                 margin=0.0,
+                 scale=20,
+                 output_emb_size=None):
+
+        super().__init__()
+
+        self.ptm = pretrained_model
+        self.dropout = nn.Dropout(dropout if dropout is not None else 0.1)
+
+        # if output_emb_size is greater than 0, then add Linear layer to reduce embedding_size, 
+        # we recommend set output_emb_size = 256 considering the trade-off beteween 
+        # recall performance and efficiency
+        self.output_emb_size = output_emb_size
+        if output_emb_size > 0:
+            weight_attr = paddle.ParamAttr(
+                initializer=paddle.nn.initializer.TruncatedNormal(std=0.02))
+            self.emb_reduce_linear = paddle.nn.Linear(
+                768, output_emb_size, weight_attr=weight_attr)
+
+        self.margin = margin
+        # Used scaling cosine similarity to ease converge
+        self.sacle = scale
+
+    def get_pooled_embedding(self,
+                             input_ids,
+                             token_type_ids=None,
+                             position_ids=None,
+                             attention_mask=None,
+                             with_pooler=True):
+
+        # Note: cls_embedding is poolerd embedding with act tanh 
+        sequence_output, cls_embedding = self.ptm(input_ids, token_type_ids,
+                                                  position_ids, attention_mask)
+
+        if with_pooler == False:
+            cls_embedding = sequence_output[:, 0, :]
+
+        if self.output_emb_size > 0:
+            cls_embedding = self.emb_reduce_linear(cls_embedding)
+
+        cls_embedding = self.dropout(cls_embedding)
+        cls_embedding = F.normalize(cls_embedding, p=2, axis=-1)
+
+        return cls_embedding
+
+    def cosine_sim(self,
+                   query_input_ids,
+                   title_input_ids,
+                   query_token_type_ids=None,
+                   query_position_ids=None,
+                   query_attention_mask=None,
+                   title_token_type_ids=None,
+                   title_position_ids=None,
+                   title_attention_mask=None,
+                   with_pooler=True):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids,
+            query_token_type_ids,
+            query_position_ids,
+            query_attention_mask,
+            with_pooler=with_pooler)
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids,
+            title_token_type_ids,
+            title_position_ids,
+            title_attention_mask,
+            with_pooler=with_pooler)
+
+        cosine_sim = paddle.sum(query_cls_embedding * title_cls_embedding,
+                                axis=-1)
+        return cosine_sim
+
+    def forward(self,
+                query_input_ids,
+                title_input_ids,
+                query_token_type_ids=None,
+                query_position_ids=None,
+                query_attention_mask=None,
+                title_token_type_ids=None,
+                title_position_ids=None,
+                title_attention_mask=None):
+
+        query_cls_embedding = self.get_pooled_embedding(
+            query_input_ids, query_token_type_ids, query_position_ids,
+            query_attention_mask)
+
+        title_cls_embedding = self.get_pooled_embedding(
+            title_input_ids, title_token_type_ids, title_position_ids,
+            title_attention_mask)
+
+        cosine_sim = paddle.matmul(
+            query_cls_embedding, title_cls_embedding, transpose_y=True)
+
+        # substract margin from all positive samples cosine_sim()
+        margin_diag = paddle.full(
+            shape=[query_cls_embedding.shape[0]],
+            fill_value=self.margin,
+            dtype=paddle.get_default_dtype())
+
+        cosine_sim = cosine_sim - paddle.diag(margin_diag)
+
+        # scale cosine to ease training converge
+        cosine_sim *= self.sacle
+
+        labels = paddle.arange(0, query_cls_embedding.shape[0], dtype='int64')
+        labels = paddle.reshape(labels, shape=[-1, 1])
+
+        loss = F.cross_entropy(input=cosine_sim, label=labels)
+
+        return loss