Skip to content

Commit 4032cf9

Browse files
add pndm plugin
1 parent a977f60 commit 4032cf9

4 files changed

Lines changed: 169 additions & 11 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ This repository is the official PyTorch implementation of our AAAI-2022 [paper](
2020
</table>
2121

2222
:tada: :tada: :tada: **Updates**:
23+
- Sep.11, 2022: :electric_plug: [DiffSinger-PN](docs/README-SVS-opencpop-pndm.md). Add plug-in [PNDM](https://arxiv.org/abs/2202.09778), ICLR 2022 in our laboratory, to accelerate DiffSinger freely.
2324
- Jul.27, 2022: Update documents for [SVS](docs/README-SVS.md). Add easy inference [A](docs/README-SVS-opencpop-cascade.md#4-inference-from-raw-inputs) & [B](docs/README-SVS-opencpop-e2e.md#4-inference-from-raw-inputs); Add Interactive SVS running on [HuggingFace🤗 SVS](https://huggingface.co/spaces/Silentlin/DiffSinger).
2425
- Mar.2, 2022: MIDI-B-version.
2526
- Mar.1, 2022: [NeuralSVB](https://github.com/MoonInTheRiver/NeuralSVB), for singing voice beautifying, has been released.

docs/README-SVS-opencpop-pndm.md

Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# DiffSinger-PNDM
2+
[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
3+
[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
4+
[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
5+
6+
Highlights:
7+
8+
Training diffusion model: 1000 steps
9+
10+
Default pndm_speedup: 40
11+
12+
Inference diffusion model: (1000 / pndm_speedup) steps = 25 steps
13+
14+
You can freely control the inference steps, by adding these arguments in your experiment scripts :
15+
--hparams="pndm_speedup=40" or --hparams="pndm_speedup=20" or --hparams="pndm_speedup=10".
16+
17+
Contributed by @luping-liu .
18+
19+
## DiffSinger (MIDI SVS | B version | +PNDM)
20+
### 0. Data Acquirement
21+
For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.
22+
23+
The pipeline below is designed for Opencpop dataset:
24+
25+
### 1. Preparation
26+
27+
#### Data Preparation
28+
a) Download and extract Opencpop, then create a link to the dataset folder: `ln -s /xxx/opencpop data/raw/`
29+
30+
b) Run the following scripts to pack the dataset for training/inference.
31+
32+
```sh
33+
export PYTHONPATH=.
34+
CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml
35+
36+
# `data/binary/opencpop-midi-dp` will be generated.
37+
```
38+
39+
#### Vocoder Preparation
40+
We provide the pre-trained model of [HifiGAN-Singing](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0109_hifigan_bigpopcs_hop128.zip) which is specially designed for SVS with NSF mechanism.
41+
42+
Also, please unzip pre-trained vocoder and [this pendant for vocoder](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0102_xiaoma_pe.zip) into `checkpoints` before training your acoustic model.
43+
44+
(Update: You can also move [a ckpt with more training steps](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/model_ckpt_steps_1512000.ckpt) into this vocoder directory)
45+
46+
This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder.
47+
48+
#### Exp Name Preparation
49+
```bash
50+
export MY_DS_EXP_NAME=0831_opencpop_ds1000
51+
```
52+
53+
```
54+
.
55+
|--data
56+
|--raw
57+
|--opencpop
58+
|--segments
59+
|--transcriptions.txt
60+
|--wavs
61+
|--checkpoints
62+
|--MY_DS_EXP_NAME (optional)
63+
|--0109_hifigan_bigpopcs_hop128 (vocoder)
64+
|--model_ckpt_steps_1512000.ckpt
65+
|--config.yaml
66+
```
67+
68+
### 2. Training Example
69+
```sh
70+
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds1000.yaml --exp_name $MY_DS_EXP_NAME --reset
71+
```
72+
73+
### 3. Inference from packed test set
74+
```sh
75+
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds1000.yaml --exp_name $MY_DS_EXP_NAME --reset --infer
76+
```
77+
78+
We also provide:
79+
- the pre-trained model of DiffSinger;
80+
81+
They can be found in [here](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0831_opencpop_ds1000.zip).
82+
83+
Remember to put the pre-trained models in `checkpoints` directory.
84+
85+
### 4. Inference from raw inputs
86+
```sh
87+
python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds1000.yaml --exp_name $MY_DS_EXP_NAME
88+
```
89+
Raw inputs:
90+
```
91+
inp = {
92+
'text': '小酒窝长睫毛AP是你最美的记号',
93+
'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
94+
'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
95+
'input_type': 'word'
96+
} # user input: Chinese characters
97+
or,
98+
inp = {
99+
'text': '小酒窝长睫毛AP是你最美的记号',
100+
'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
101+
'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
102+
'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
103+
'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
104+
'input_type': 'phoneme'
105+
} # input like Opencpop dataset.
106+
```
107+
108+
### 5. Some issues.
109+
a) the HifiGAN-Singing is trained on our [vocoder dataset](https://dl.acm.org/doi/abs/10.1145/3474085.3475437) and the training set of [PopCS](https://arxiv.org/abs/2105.02446). Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.
110+
111+
b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[ph_dur]) to predict phoneme duration. F0 curve is implicitly predicted together with mel-spectrogram.
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
base_config:
2+
- usr/configs/popcs_ds_beta6.yaml
3+
- usr/configs/midi/cascade/opencs/opencpop_statis.yaml
4+
5+
binarizer_cls: data_gen.singing.binarize.OpencpopBinarizer
6+
binary_data_dir: 'data/binary/opencpop-midi-dp'
7+
8+
#switch_midi2f0_step: 174000
9+
use_midi: true # for midi exp
10+
use_gt_dur: false # for further midi exp
11+
lambda_ph_dur: 1.0
12+
lambda_sent_dur: 1.0
13+
lambda_word_dur: 1.0
14+
predictor_grad: 0.1
15+
dur_predictor_layers: 5 # *
16+
17+
18+
fs2_ckpt: '' #
19+
#num_valid_plots: 0
20+
task_cls: usr.diffsinger_task.DiffSingerMIDITask
21+
22+
# for diffusion schedule
23+
timesteps: 1000
24+
K_step: 1000
25+
max_beta: 0.02
26+
max_tokens: 36000
27+
max_updates: 320000
28+
gaussian_start: True
29+
pndm_speedup: 40
30+
31+
use_pitch_embed: false
32+
use_gt_f0: false # for midi exp
33+
34+
lambda_f0: 0.
35+
lambda_uv: 0.
36+
dilation_cycle_length: 4 # *
37+
rel_pos: true
38+
predictor_layers: 5
39+
pe_enable: true
40+
pe_ckpt: 'checkpoints/0102_xiaoma_pe'
41+
42+

usr/diff/shallow_diffusion_tts.py

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from utils.hparams import hparams
1717

1818

19+
1920
def exists(x):
2021
return x is not None
2122

@@ -69,8 +70,7 @@ def cosine_beta_schedule(timesteps, s=0.008):
6970

7071
class GaussianDiffusion(nn.Module):
7172
def __init__(self, phone_encoder, out_dims, denoise_fn,
72-
timesteps=1000, K_step=1000, loss_type=hparams.get('diff_loss_type', 'l1'), betas=None, spec_min=None,
73-
spec_max=None):
73+
timesteps=1000, K_step=1000, loss_type=hparams.get('diff_loss_type', 'l1'), betas=None, spec_min=None, spec_max=None):
7474
super().__init__()
7575
self.denoise_fn = denoise_fn
7676
if hparams.get('use_midi') is not None and hparams['use_midi']:
@@ -173,10 +173,7 @@ def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repeat_noise=F
173173

174174
def get_x_pred(x, noise_t, t):
175175
a_t = extract(self.alphas_cumprod, t, x.shape)
176-
if t[0] < interval:
177-
a_prev = torch.ones_like(a_t)
178-
else:
179-
a_prev = extract(self.alphas_cumprod, t - interval, x.shape)
176+
a_prev = extract(self.alphas_cumprod, torch.max(t-interval, torch.zeros_like(t)), x.shape)
180177
a_t_sq, a_prev_sq = a_t.sqrt(), a_prev.sqrt()
181178

182179
x_delta = (a_prev - a_t) * ((1 / (a_t_sq * (a_t_sq + a_prev_sq))) * x - 1 / (a_t_sq * (((1 - a_prev) * a_t).sqrt() + ((1 - a_t) * a_prev).sqrt())) * noise_t)
@@ -189,7 +186,7 @@ def get_x_pred(x, noise_t, t):
189186

190187
if len(noise_list) == 0:
191188
x_pred = get_x_pred(x, noise_pred, t)
192-
noise_pred_prev = self.denoise_fn(x_pred, torch.max(t-interval, torch.zeros_like(t)), cond=cond)
189+
noise_pred_prev = self.denoise_fn(x_pred, max(t-interval, 0), cond=cond)
193190
noise_pred_prime = (noise_pred + noise_pred_prev) / 2
194191
elif len(noise_list) == 1:
195192
noise_pred_prime = (3 * noise_pred - noise_list[-1]) / 2
@@ -257,10 +254,17 @@ def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
257254
print('===> gaussion start.')
258255
shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
259256
x = torch.randn(shape, device=device)
260-
self.noise_list = deque(maxlen=4)
261-
iteration_interval = 5
262-
for i in tqdm(reversed(range(0, t, iteration_interval)), desc='sample time step', total=t):
263-
x = self.p_sample_plms(x, torch.full((b,), i, device=device, dtype=torch.long), iteration_interval, cond)
257+
258+
if hparams.get('pndm_speedup'):
259+
self.noise_list = deque(maxlen=4)
260+
iteration_interval = hparams['pndm_speedup']
261+
for i in tqdm(reversed(range(0, t, iteration_interval)), desc='sample time step',
262+
total=t // iteration_interval):
263+
x = self.p_sample_plms(x, torch.full((b,), i, device=device, dtype=torch.long), iteration_interval,
264+
cond)
265+
else:
266+
for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
267+
x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
264268
x = x[:, 0].transpose(1, 2)
265269
if mel2ph is not None: # for singing
266270
ret['mel_out'] = self.denorm_spec(x) * ((mel2ph > 0).float()[:, :, None])

0 commit comments

Comments
 (0)