Abstract

Neural sequence-to-sequence models are well established for applications which can be cast as mapping a single input sequence into a single output sequence. In this work, we focus on one-to-many sequence transduction problems, such as extracting multiple sequential sources from a mixture sequence. We extend the standard sequence-to-sequence model to a novel conditional multi-sequence model, which explicitly models the relevance between multiple output sequences with the probabilistic chain rule. We take speech data as a primary test field to evaluate our methods since the observed speech data is often composed of multiple sources due to the nature of the superposition principle of sound waves. Experiments on several different tasks including speech separation and multi-speaker speech recognition show that our conditional multi-sequence models lead to consistent improvements over the conventional non-conditional models.

Motivation

image For clarity, we refer our methods as Conditional Chain model, combining both the serial mapping and parallel mapping with the probabilistic chain rule. Simultaneous modeling for these two paradigms not only makes the framework more flexible but also encourages the model to automatically learn the efficient relationship between multiple outputs.

Supplementary Material (Appendix):

Please visit there to see: https://github.com/demotoshow/demotoshow.github.io/blob/master/Conditional_Model_NeurIPS_2020%20appendix.pdf

Source code

Please visit there to see: https://github.com/demotoshow/demotoshow.github.io/tree/master/codes

Proposed methods

image

We take the raw input sequence as the input for every output sequence. Meanwhile, the output will be generated one by one with the former outputted sequence as a condition. We consider that the multiple outputs from the same input actually have some relevance at the information level. By combining both the cascading and parallel connection, our model learns the mapping from the input to each output sequence, and also the connection between the output sequences. Our proposed structure provides a solution to the variable and unknown output number issues.

Speech separation samples

WSJ0-2mix & 3mix are re-used by us to mix the 4-speaker and 5-speaker mixtures. This is to say, we did not import any additional data besides the WSJ0-2mix and WSJ0-3mix.

WSJ0-4mix Sample 1 (2Female & 2Male)

Mixture and separated spectrograms

image

Mixed audio

Separated sources

WSJ0-4mix Sample 2 (1Female & 3Male)

Mixture and separated spectrograms

image

Mixed audio

Separated sources

WSJ0-4mix Sample 3 (3Female & 1Male)

Mixed audio

Separated sources

WSJ0-5mix Sample (2Female & 3Male)

Mixture and separated spectrograms

image

Mixed audio

Separated sources