Enhancing Bone-Conducted Speech with CATAD( Conformer-Based Adversarial Training with Adaptive Diffusion)

Due to the low-pass filtering effect of human tissues, the high frequency components of bone-conducted (BC) speech suffer severe attenuation, resulting in reduced speech quality and intelligibility. To address this issue, this paper proposes a novel adversarial learning network based on the Conformer architecture and an adaptive diffusion process. In the design of the generator, we deploy a TS-Conformer consisting of two Conformer modules that capture temporal and frequency dependencies, respectively, to enhance the expressiveness of speech features. To ensure the stability of the adversarial learning process and the diversity of the generated results, we adopt an adaptive diffusion process that adds noise to both generated and real data. This challenges the discriminator to distinguish between diffused real data and generated data. Experimental results on the ESMB dataset demonstrate that our proposed method significantly improves BC speech recovery, enhancing both speech quality and intelligibility.

model_arc

Demo for CATAD