Merge branch 'MoonInTheRiver-master'

cyclekiller · cyclekiller · commit c21744d5c475 · 2022-09-11T21:28:03.000+08:00
diff --git a/README.md b/README.md
@@ -21,6 +21,7 @@ This repository is the official PyTorch implementation of our AAAI-2022 [paper](
 </table>
 
 :tada: :tada: :tada: **Updates**:
+ - Sep.11, 2022: :electric_plug: [DiffSinger-PN](docs/README-SVS-opencpop-pndm.md). Add plug-in [PNDM](https://arxiv.org/abs/2202.09778), ICLR 2022 in our laboratory, to accelerate DiffSinger freely.
  - Jul.27, 2022: Update documents for [SVS](docs/README-SVS.md). Add easy inference [A](docs/README-SVS-opencpop-cascade.md#4-inference-from-raw-inputs) & [B](docs/README-SVS-opencpop-e2e.md#4-inference-from-raw-inputs); Add Interactive SVS running on [HuggingFace🤗 SVS](https://huggingface.co/spaces/Silentlin/DiffSinger).
  - Mar.2, 2022: MIDI-B-version.
  - Mar.1, 2022: [NeuralSVB](https://github.com/MoonInTheRiver/NeuralSVB), for singing voice beautifying, has been released.
diff --git a/docs/README-SVS-opencpop-pndm.md b/docs/README-SVS-opencpop-pndm.md
@@ -0,0 +1,114 @@
+# DiffSinger-PNDM
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2105.02446)
+[![GitHub Stars](https://img.shields.io/github/stars/MoonInTheRiver/DiffSinger?style=social)](https://github.com/MoonInTheRiver/DiffSinger)
+[![downloads](https://img.shields.io/github/downloads/MoonInTheRiver/DiffSinger/total.svg)](https://github.com/MoonInTheRiver/DiffSinger/releases)
+
+Highlights:
+
+Training diffusion model: 100 steps 
+
+Default pndm_speedup: 1
+
+Inference diffusion model: (100 / pndm_speedup) steps
+
+You can freely control the inference steps, by adding these arguments in your experiment scripts :
+--hparams="pndm_speedup=5" or --hparams="pndm_speedup=10".
+
+Contributed by @luping-liu .
+
+## DiffSinger (MIDI SVS | B version | +PNDM)
+### 0. Data Acquirement
+For Opencpop dataset: Please strictly follow the instructions of [Opencpop](https://wenet.org.cn/opencpop/). We have no right to give you the access to Opencpop.
+
+The pipeline below is designed for Opencpop dataset:
+
+### 1. Preparation
+
+#### Data Preparation
+a) Download and extract Opencpop, then create a link to the dataset folder: `ln -s /xxx/opencpop data/raw/`
+
+b) Run the following scripts to pack the dataset for training/inference.
+
+```sh
+export PYTHONPATH=.
+CUDA_VISIBLE_DEVICES=0 python data_gen/tts/bin/binarize.py --config usr/configs/midi/cascade/opencs/aux_rel.yaml
+
+# `data/binary/opencpop-midi-dp` will be generated.
+```
+
+#### Vocoder Preparation
+We provide the pre-trained model of [HifiGAN-Singing](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0109_hifigan_bigpopcs_hop128.zip) which is specially designed for SVS with NSF mechanism.
+
+Also, please unzip pre-trained vocoder and [this pendant for vocoder](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0102_xiaoma_pe.zip) into `checkpoints` before training your acoustic model.
+
+(Update: You can also move [a ckpt with more training steps](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/model_ckpt_steps_1512000.ckpt) into this vocoder directory)
+
+This singing vocoder is trained on ~70 hours singing data, which can be viewed as a universal vocoder. 
+
+#### Exp Name Preparation
+```bash
+export MY_DS_EXP_NAME=0228_opencpop_ds100_rel
+```
+
+```
+.
+|--data
+    |--raw
+        |--opencpop
+            |--segments
+                |--transcriptions.txt
+                |--wavs
+|--checkpoints
+    |--MY_DS_EXP_NAME (optional)
+    |--0109_hifigan_bigpopcs_hop128 (vocoder)
+        |--model_ckpt_steps_1512000.ckpt
+        |--config.yaml
+```
+
+### 2. Training Example
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset
+```
+
+### 3. Inference from packed test set
+```sh
+CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME --reset --infer
+```
+
+We also provide:
+ - the pre-trained model of DiffSinger;
+ 
+They can be found in [here](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0228_opencpop_ds100_rel.zip).
+
+Remember to put the pre-trained models in `checkpoints` directory.
+
+### 4. Inference from raw inputs
+```sh
+python inference/svs/ds_e2e.py --config usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml --exp_name $MY_DS_EXP_NAME
+```
+Raw inputs:
+```
+inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'notes': 'C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4',
+        'notes_duration': '0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340',
+        'input_type': 'word'
+    }  # user input: Chinese characters
+or,
+inp = {
+        'text': '小酒窝长睫毛AP是你最美的记号',
+        'ph_seq': 'x iao j iu w o ch ang ang j ie ie m ao AP sh i n i z ui m ei d e j i h ao',
+        'note_seq': 'C#4/Db4 C#4/Db4 F#4/Gb4 F#4/Gb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 F#4/Gb4 F#4/Gb4 F#4/Gb4 C#4/Db4 C#4/Db4 C#4/Db4 rest C#4/Db4 C#4/Db4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 A#4/Bb4 A#4/Bb4 G#4/Ab4 G#4/Ab4 F4 F4 C#4/Db4 C#4/Db4',
+        'note_dur_seq': '0.407140 0.407140 0.376190 0.376190 0.242180 0.242180 0.509550 0.509550 0.183420 0.315400 0.315400 0.235020 0.361660 0.361660 0.223070 0.377270 0.377270 0.340550 0.340550 0.299620 0.299620 0.344510 0.344510 0.283770 0.283770 0.323390 0.323390 0.360340 0.360340',
+        'is_slur_seq': '0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0',
+        'input_type': 'phoneme'
+    }  # input like Opencpop dataset.
+```
+
+### 5. Some issues.
+a) the HifiGAN-Singing is trained on our [vocoder dataset](https://dl.acm.org/doi/abs/10.1145/3474085.3475437) and the training set of [PopCS](https://arxiv.org/abs/2105.02446). Opencpop is the out-of-domain dataset (unseen speaker). This may cause the deterioration of audio quality, and we are considering fine-tuning this vocoder on the training set of Opencpop.
+
+b) in this version of codes, we used the melody frontend ([lyric + MIDI]->[ph_dur]) to predict phoneme duration. F0 curve is implicitly predicted together with mel-spectrogram.
+
+c) example [generated audio](https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/demos_0221/DS/).
+More generated audio demos can be found in [DiffSinger](https://github.com/MoonInTheRiver/DiffSinger/releases/download/pretrain-model/0228_opencpop_ds100_rel.zip).
diff --git a/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml b/usr/configs/midi/e2e/opencpop/ds100_adj_rel.yaml
@@ -23,6 +23,7 @@ K_step: 100
 max_tokens: 40000
 max_updates: 160000
 gaussian_start: True
+pndm_speedup: 1
 
 use_pitch_embed: false
 use_gt_f0: false  #  for midi exp
diff --git a/usr/diff/shallow_diffusion_tts.py b/usr/diff/shallow_diffusion_tts.py
@@ -164,7 +164,7 @@ def p_sample(self, x, t, cond, clip_denoised=True, repeat_noise=False):
         # no noise when t == 0
         nonzero_mask = (1 - (t == 0).float()).reshape(b, *((1,) * (len(x.shape) - 1)))
         return model_mean + nonzero_mask * (0.5 * model_log_variance).exp() * noise
-
+    
     @torch.no_grad()
     def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repeat_noise=False):
         """
@@ -173,10 +173,7 @@ def p_sample_plms(self, x, t, interval, cond, clip_denoised=True, repeat_noise=F
 
         def get_x_pred(x, noise_t, t):
             a_t = extract(self.alphas_cumprod, t, x.shape)
-            if t[0] < interval:
-                a_prev = torch.ones_like(a_t)
-            else:
-                a_prev = extract(self.alphas_cumprod, t - interval, x.shape)
+            a_prev = extract(self.alphas_cumprod, torch.max(t-interval, torch.zeros_like(t)), x.shape)
             a_t_sq, a_prev_sq = a_t.sqrt(), a_prev.sqrt()
 
             x_delta = (a_prev - a_t) * ((1 / (a_t_sq * (a_t_sq + a_prev_sq))) * x - 1 / (a_t_sq * (((1 - a_prev) * a_t).sqrt() + ((1 - a_t) * a_prev).sqrt())) * noise_t)
@@ -189,7 +186,7 @@ def get_x_pred(x, noise_t, t):
 
         if len(noise_list) == 0:
             x_pred = get_x_pred(x, noise_pred, t)
-            noise_pred_prev = self.denoise_fn(x_pred, torch.max(t-interval, torch.zeros_like(t)), cond=cond)
+            noise_pred_prev = self.denoise_fn(x_pred, max(t-interval, 0), cond=cond)
             noise_pred_prime = (noise_pred + noise_pred_prev) / 2
         elif len(noise_list) == 1:
             noise_pred_prime = (3 * noise_pred - noise_list[-1]) / 2
@@ -263,10 +260,16 @@ def forward(self, txt_tokens, mel2ph=None, spk_embed=None,
             print('===> gaussion start.')
             shape = (cond.shape[0], 1, self.mel_bins, cond.shape[2])
             x = torch.randn(shape, device=device)
-            self.noise_list = deque(maxlen=4)
-            iteration_interval = 5
-            for i in tqdm(reversed(range(0, t, iteration_interval)), desc='sample time step', total=t // iteration_interval):
-                x = self.p_sample_plms(x, torch.full((b,), i, device=device, dtype=torch.long), iteration_interval, cond)
+            if hparams.get('pndm_speedup') and hparams['pndm_speedup'] > 1:
+                self.noise_list = deque(maxlen=4)
+                iteration_interval = hparams['pndm_speedup']
+                for i in tqdm(reversed(range(0, t, iteration_interval)), desc='sample time step',
+                              total=t // iteration_interval):
+                    x = self.p_sample_plms(x, torch.full((b,), i, device=device, dtype=torch.long), iteration_interval,
+                                           cond)
+            else:
+                for i in tqdm(reversed(range(0, t)), desc='sample time step', total=t):
+                    x = self.p_sample(x, torch.full((b,), i, device=device, dtype=torch.long), cond)
             x = x[:, 0].transpose(1, 2)
             if mel2ph is not None:  # for singing
                 ret['mel_out'] = self.denorm_spec(x) * ((mel2ph > 0).float()[:, :, None])