-
Notifications
You must be signed in to change notification settings - Fork 105
Open
Description
the first pitch in the sample() as follow:
naturalspeech2-pytorch/naturalspeech2_pytorch/naturalspeech2_pytorch.py
Lines 1478 to 1479 in 659bec7
| duration, pitch = self.duration_pitch(phoneme_enc, prompt_enc) | |
| pitch = rearrange(pitch, 'b n -> b 1 n') |
the second pitch in the forward() of Naturalspeech2 as follow:
naturalspeech2-pytorch/naturalspeech2_pytorch/naturalspeech2_pytorch.py
Lines 1543 to 1556 in 659bec7
| if not exists(pitch): | |
| assert exists(audio) and audio.ndim == 2 | |
| assert exists(self.target_sample_hz) | |
| if self.calc_pitch_with_pyworld: | |
| pitch = compute_pitch_pyworld( | |
| audio, | |
| sample_rate = self.target_sample_hz, | |
| hop_length = self.mel_hop_length | |
| ) | |
| else: | |
| pitch = compute_pitch_pytorch(audio, self.target_sample_hz) | |
| pitch = rearrange(pitch, 'b n -> b 1 n') |
- Personally, I think the first pitch is from the prompt, and the second pitch is from the training data, right?
- Personally, I think the prompt is a small part of the training data, such as the training data is10s, from which prompt takes 2s, right?
- Because the input format of the prompt and the training data is the same, why are the calculation methods of pitch different?
Metadata
Metadata
Assignees
Labels
No labels