diff --git a/encoder-decoder.md b/encoder-decoder.md index 8db8d2d936..1a7b5f3aa2 100644 --- a/encoder-decoder.md +++ b/encoder-decoder.md @@ -352,7 +352,7 @@ mapping. Similar to RNN-based encoder-decoder models, the transformer-based encoder-decoder models define a conditional distribution of target -vectors \\(\mathbf{Y}_{1:n}\\) given an input sequence \\(\mathbf{X}_{1:n}\\): +vectors \\(\mathbf{Y}_{1:m}\\) given an input sequence \\(\mathbf{X}_{1:n}\\): $$ p_{\theta_{\text{enc}}, \theta_{\text{dec}}}(\mathbf{Y}_{1:m} | \mathbf{X}_{1:n}). @@ -366,10 +366,10 @@ $$ f_{\theta_{\text{enc}}}: \mathbf{X}_{1:n} \to \mathbf{\overline{X}}_{1:n}. $$ The transformer-based decoder part then models the conditional probability distribution of the target vector sequence -\\(\mathbf{Y}_{1:n}\\) given the sequence of encoded hidden states +\\(\mathbf{Y}_{1:m}\\) given the sequence of encoded hidden states \\(\mathbf{\overline{X}}_{1:n}\\): -$$ p_{\theta_{dec}}(\mathbf{Y}_{1:n} | \mathbf{\overline{X}}_{1:n}).$$ +$$ p_{\theta_{dec}}(\mathbf{Y}_{1:m} | \mathbf{\overline{X}}_{1:n}).$$ By Bayes\' rule, this distribution can be factorized to a product of conditional probability distribution of the target vector \\(\mathbf{y}_i\\) @@ -377,7 +377,7 @@ given the encoded hidden states \\(\mathbf{\overline{X}}_{1:n}\\) and all previous target vectors \\(\mathbf{Y}_{0:i-1}\\): $$ -p_{\theta_{dec}}(\mathbf{Y}_{1:n} | \mathbf{\overline{X}}_{1:n}) = \prod_{i=1}^{n} p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n}). $$ +p_{\theta_{dec}}(\mathbf{Y}_{1:m} | \mathbf{\overline{X}}_{1:n}) = \prod_{i=1}^{m} p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n}). $$ The transformer-based decoder hereby maps the sequence of encoded hidden states \\(\mathbf{\overline{X}}_{1:n}\\) and all previous target vectors