1
1
<p align =" center " ><img width =" 55% " src =" docs/_static/img/logo.svg " /></p >
2
2
3
- <h3 align =" center " >Supporting Rapid Prototyping with a Deep Learning NLP Toolkit  ;  ;
4
- <a href =" https://twitter.com/intent/tweet?text=Supporting%20rapid%20prototyping%20for%20research,%20PyTorch-NLP%20has%20LAUNCHED,%20a%20deep%20learning%20natural%20language%20processing%20(NLP)%20toolkit!%20&url=https://github.com/PetrochukM/PyTorch-NLP&hashtags=pytorch,nlp,research " >
5
- <img style='vertical-align: text-bottom !important;' src="https://img.shields.io/twitter/url/http/shields.io.svg?style=social" alt="Tweet">
6
- </a >
7
- </h3 >
3
+ <h3 align =" center " >Basic Utilities for PyTorch NLP Software</h3 >
8
4
9
- PyTorch-NLP, or torchnlp for short, is a library of neural network layers, text processing modules and datasets designed to accelerate Natural Language Processing (NLP) research.
10
-
11
- Join our community, add datasets and neural network layers! Chat with us on [ Gitter ] ( https://gitter.im/PyTorch-NLP/Lobby ) and join the [ Google Group ] ( https://groups.google.com/forum/#!forum/pytorch-nlp ) , we're eager to collaborate with you .
5
+ PyTorch-NLP, or ` torchnlp ` for short, is a library of basic utilities for PyTorch
6
+ Natural Language Processing (NLP). ` torchnlp ` extends PyTorch to provide you with
7
+ basic text data processing functions .
12
8
13
9
![ PyPI - Python Version] ( https://img.shields.io/pypi/pyversions/pytorch-nlp.svg?style=flat-square )
14
10
[ ![ Codecov] ( https://img.shields.io/codecov/c/github/PetrochukM/PyTorch-NLP/master.svg?style=flat-square )] ( https://codecov.io/gh/PetrochukM/PyTorch-NLP )
15
11
[ ![ Downloads] ( http://pepy.tech/badge/pytorch-nlp )] ( http://pepy.tech/project/pytorch-nlp )
16
- [ ![ Documentation Status] ( https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square )] ( http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square )
12
+ [ ![ Documentation Status] ( https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square )] ( http://pytorchnlp.readthedocs.io/en/latest/?badge=latest&style=flat-square )
17
13
[ ![ Build Status] ( https://img.shields.io/travis/PetrochukM/PyTorch-NLP/master.svg?style=flat-square )] ( https://travis-ci.org/PetrochukM/PyTorch-NLP )
14
+ [ ![ Twitter: PetrochukM] ( https://img.shields.io/twitter/follow/MPetrochuk.svg?style=social )] ( https://twitter.com/MPetrochuk )
18
15
19
- _ Logo by [ Chloe Yeo] ( http://www.yeochloe.com/ ) _
16
+ _ Logo by [ Chloe Yeo] ( http://www.yeochloe.com/ ) , Corporate Sponsorship by [ WellSaid Labs ] ( https://wellsaidlabs.com/ ) _
20
17
21
- ## Installation
18
+ ## Installation 🐾
22
19
23
- Make sure you have Python 3.6 + and PyTorch 1.0+. You can then install ` pytorch-nlp ` using
20
+ Make sure you have Python 3.5 + and PyTorch 1.0+. You can then install ` pytorch-nlp ` using
24
21
pip:
25
22
26
- pip install pytorch-nlp
23
+ ``` python
24
+ pip install pytorch- nlp
25
+ ```
27
26
28
27
Or to install the latest code via:
29
28
30
- pip install git+https://github.com/PetrochukM/PyTorch-NLP.git
29
+ ``` python
30
+ pip install git+ https:// github.com/ PetrochukM/ PyTorch- NLP .git
31
+ ```
31
32
32
- ## Docs 📖
33
+ ## Docs
33
34
34
- The complete documentation for PyTorch-NLP is available via [ our ReadTheDocs website] ( https://pytorchnlp.readthedocs.io ) .
35
+ The complete documentation for PyTorch-NLP is available
36
+ via [ our ReadTheDocs website] ( https://pytorchnlp.readthedocs.io ) .
35
37
36
- ## Basics
38
+ ## Get Started
37
39
38
- Add PyTorch- NLP to your project by following one of the common use cases :
40
+ Within an NLP data pipeline, you'll want to implement these basic steps :
39
41
40
- ### Load a [ Dataset ] ( http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.datasets.html )
42
+ ### Load Your Data 🐿
41
43
42
44
Load the IMDB dataset, for example:
43
45
@@ -49,51 +51,133 @@ train = imdb_dataset(train=True)
49
51
train[0 ] # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}
50
52
```
51
53
52
- ### Apply [ Neural Networks] ( http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.nn.html ) Layers
54
+ Load a custom dataset, for example:
55
+
56
+ ``` python
57
+ from pathlib import Path
58
+
59
+ from torchnlp.download import download_file_maybe_extract
60
+
61
+ directory_path = Path(' data/' )
62
+ train_file_path = Path(' trees/train.txt' )
63
+
64
+ download_file_maybe_extract(
65
+ url = ' http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip' ,
66
+ directory = directory_path,
67
+ check_files = [train_file_path])
68
+
69
+ open (directory_path / train_file_path)
70
+ ```
71
+
72
+ Don't worry we'll handle caching for you!
53
73
54
- For example, from the neural network package, apply state-of-the-art LockedDropout:
74
+ ### Text To Tensor
75
+
76
+ Tokenize and encode your text as a tensor. For example, a ` WhitespaceEncoder ` breaks
77
+ text into terms whenever it encounters a whitespace character.
78
+
79
+ ``` python
80
+ from torchnlp.encoders.text import WhitespaceEncoder
81
+
82
+ loaded_data = [" now this ain't funny" , " so don't you dare laugh" ]
83
+ encoder = WhitespaceEncoder(loaded_data)
84
+ encoded_data = [encoder.encode(example) for example in loaded_data]
85
+ ```
86
+
87
+ ### Tensor To Batch
88
+
89
+ With your loaded and encoded data in hand, you'll want to batch your dataset.
55
90
56
91
``` python
57
92
import torch
58
- from torchnlp.nn import LockedDropout
93
+ from torchnlp.samplers import BucketBatchSampler
94
+ from torchnlp.utils import collate_tensors
95
+ from torchnlp.encoders.text import stack_and_pad_tensors
59
96
60
- input_ = torch.randn(6 , 3 , 10 )
61
- dropout = LockedDropout(0.5 )
97
+ encoded_data = [torch.randn(2 ), torch.randn(3 ), torch.randn(4 ), torch.randn(5 )]
62
98
63
- # Apply a LockedDropout to `input_`
64
- dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
99
+ train_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)
100
+ train_batch_sampler = BucketBatchSampler(
101
+ train_sampler, batch_size = 2 , drop_last = False , sort_key = lambda i : encoded_data[i].shape[0 ])
102
+
103
+ batches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]
104
+ batches = [collate_tensors(batch, stack_tensors = stack_and_pad_tensors) for batch in batches]
65
105
```
66
106
67
- ### [ Encode Text] ( http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.encoders.text.html )
107
+ PyTorch-NLP builds on top of PyTorch's existing ` torch.utils.data.sampler ` , ` torch.stack `
108
+ and ` default_collate ` to support sequential inputs of varying lengths!
109
+
110
+ ### Your Good To Go!
111
+
112
+ With your batch in hand, you can use PyTorch to develop and train your model using gradient descent.
113
+
114
+ ### Last But Not Least
68
115
69
- Tokenize and encode text as a tensor. For example, a ` WhitespaceEncoder ` breaks text into terms whenever it encounters a whitespace character.
116
+ PyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗
117
+
118
+ #### Deterministic Functions
119
+
120
+ Now you've setup your pipeline, you may want to ensure that some functions run deterministically.
121
+ Wrap any code that's random, with ` fork_rng ` and you'll be good to go, like so:
70
122
71
123
``` python
124
+ import random
125
+ import numpy
126
+ import torch
127
+
128
+ from torchnlp.random import fork_rng
129
+
130
+ with fork_rng(seed = 123 ): # Ensure determinism
131
+ print (' Random:' , random.randint(1 , 2 ** 31 ))
132
+ print (' Numpy:' , numpy.random.randint(1 , 2 ** 31 ))
133
+ print (' Torch:' , int (torch.randint(1 , 2 ** 31 , (1 ,))))
134
+ ```
135
+
136
+ This will always print:
137
+
138
+ ``` text
139
+ Random: 224899943
140
+ Numpy: 843828735
141
+ Torch: 843828736
142
+ ```
143
+
144
+ #### Pre-Trained Word Vectors
145
+
146
+ Now that you've computed your vocabulary, you may want to make use of
147
+ pre-trained word vectors, like so:
148
+
149
+ ``` python
150
+ import torch
72
151
from torchnlp.encoders.text import WhitespaceEncoder
152
+ from torchnlp.word_to_vector import GloVe
73
153
74
- # Create a `WhitespaceEncoder` with a corpus of text
75
154
encoder = WhitespaceEncoder([" now this ain't funny" , " so don't you dare laugh" ])
76
155
77
- # Encode and decode phrases
78
- encoder.encode(" this ain't funny." ) # RETURNS: torch.Tensor([6, 7, 1])
79
- encoder.decode(encoder.encode(" This ain't funny." )) # RETURNS: "this ain't funny."
156
+ vocab = set (encoder.vocab)
157
+ pretrained_embedding = GloVe(name = ' 6B' , dim = 100 , is_include = lambda w : w in vocab)
158
+ embedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)
159
+ for i, token in enumerate (encoder.vocab):
160
+ embedding_weights[i] = pretrained_embedding[token]
80
161
```
81
162
82
- ### Load [ Word Vectors ] ( http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.word_to_vector.html )
163
+ #### Neural Networks Layers
83
164
84
- For example, load FastText, state-of-the-art English word vectors :
165
+ For example, from the neural network package, apply the state-of-the-art ` LockedDropout ` :
85
166
86
167
``` python
87
- from torchnlp.word_to_vector import FastText
168
+ import torch
169
+ from torchnlp.nn import LockedDropout
170
+
171
+ input_ = torch.randn(6 , 3 , 10 )
172
+ dropout = LockedDropout(0.5 )
88
173
89
- vectors = FastText()
90
- # Load vectors for any word as a `torch.FloatTensor`
91
- vectors[' hello' ] # RETURNS: [torch.FloatTensor of size 300]
174
+ # Apply a LockedDropout to `input_`
175
+ dropout(input_) # RETURNS: torch.FloatTensor (6x3x10)
92
176
```
93
177
94
- ### Compute [ Metrics] ( http://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.metrics.html )
178
+ #### Metrics
95
179
96
- Finally, compute common metrics such as the BLEU score.
180
+ Compute common NLP metrics such as the BLEU score.
97
181
98
182
``` python
99
183
from torchnlp.metrics import get_moses_multi_bleu
@@ -131,8 +215,8 @@ AllenNLP is designed to be a platform for research. PyTorch-NLP is designed to b
131
215
132
216
## Authors
133
217
134
- * [ Michael Petrochuk] ( https://github.com/PetrochukM/ ) — Developer
135
- * [ Chloe Yeo] ( http://www.yeochloe.com/ ) — Logo Design
218
+ - [ Michael Petrochuk] ( https://github.com/PetrochukM/ ) — Developer
219
+ - [ Chloe Yeo] ( http://www.yeochloe.com/ ) — Logo Design
136
220
137
221
## Citing
138
222
0 commit comments