• Recent advancements in time-domain audio separation networks (TasNets) have markedly propelled the field of speech separation. Unlike conventional time-frequency domain methodologies, TasNets directly model the amalgamated speech signals in the time-domain, employing a convolutional encoder-decoder architecture to effect separation on the output of the encoder. However, the original dual-path framework is characterized by a fixed feature dimension and a constant segment size across all RNN layers, thereby limiting its ability to produce high-resolution features. In this study, we present a novel approach termed Multi-Scale Feature Fusion Transformer Network (MSFFT-Net). The MSFFT-Net incorporates multiple dual-path processing paths in the separation stage, each dedicates to perform feature modeling at different scales. Coarse-grain and fine-grain features are obtained in parallel from different processing paths. In addition, the features from one dual-path processing path can be exchanged and shared with other distinct processing path, ultimately yielding high feature resolution across layers, and thus resulting in more accurate mask estimation. Experimental results on various datasets demonstrate the superiority of the MSFFT-Net over SOTA baselines across diverse datasets in single channel speech separation task.

Several samples from WSJ0-2mix DataSet

Mixed Audio Speaker1 Speaker2

Separation Results of our proposed MSFFT-3P and MSFFT-2P

Mixed Audio Speaker Clean DPRNN MSFFT-3P MSFFT-2P
spk1
  spk2
spk1
  spk2
spk1
  spk2