Demo for BCSE-SSAL
Enhancing Bone-Conducted Speech with Spectrum Similarity Metric in Adversarial Learning
- Although bone-conducted (BC) speech offers the advantage of being insusceptible to background noise, its transmission path through bone tissue entails not only serious attenuation of high-frequency components but also speech distortion and the loss of unvoiced speech, resulting in a substantial degradation in both speech quality and intelligibility. Existing BC speech enhancement methods focus mainly on approaching high-frequency component restoration but overlook the restoration of missing unvoiced speech and the mitigation of speech distortion, resulting in a noticeable gap in speech quality and intelligibility compared to air-conducted (AC) speech. In this paper, a spectrum-similarity metric based adversarial learning method is proposed for bone-conducted speech enhancement. The acoustic features corresponding to source-excitation and filter-response are disentangled using the WORLD vocoder and mapped to its AC speech counterparts with logarithmic Gaussian normalization and a vocal tract converter, respectively. To reconstruct unvoiced speech from BC speech and decrease the nonlinear speech distortion in BC speech, the vocal tract converter predicts low-dimensional Mel-cepstral coefficients of AC speech using a generator which is supervised by a classification discriminator and a spectrum similarity discriminator. While the classification discriminator is used to distinguish between authentic AC speech and enhanced BC speech, the spectrum similarity discriminator is designed to evaluate the spectrum similarity between enhanced BC speech and its AC counterpart. To evaluate spectrum similarity, the correlation of time-frequency units in spectrum of long duration is captured within the self-attention layer embedded in the spectrum similarity discriminator. Experimental results on various speech datasets show that the proposed method is capable of restoring unvoiced speech segment and diminishing speech distortion, resulting in predicting accurate fine-grained AC spectrum and thus significant improvement in terms of speech quality and speech intelligibility.
Ground truth target samples
Speakers | BC Speech | AC Speech |
---|---|---|
Female | ||
Male | ||
01 | ||
02 | ||
03 | ||
04 | ||
05 | ||
06 |
Speakers “female” and “male” belong to dataset AEUCHSAC&BC-2017 corpus. The paper is available at here.
Speakers “01”、”02”、”03”、”04”、”05”、”06” belong to the paper here.
Comparision of proposed method to baseline methods on the TMHINT and AEUCHSAC&BC-2017 dataset.
Speakers | GMM | BLSTM | CycleGAN | CycleGAN-VC2 | CycleGAN-DAL | Ours |
---|---|---|---|---|---|---|
Female | ||||||
Male | ||||||
01 | ||||||
02 | ||||||
03 | ||||||
04 | ||||||
05 | ||||||
06 |
Comparison of the proposed method to the Two-stage method on the ESMB BC speech dataset.
speakers | AC Speech | BC Speech | The Two-Stage method | Ours |
---|---|---|---|---|
01 | ||||
02 | ||||
03 | ||||
04 | ||||
05 | ||||
06 | ||||
07 | ||||
08 | ||||
09 | ||||
10 |
References
2024
- Enhancing Bone-Conducted Speech with Spectrum Similarity Metric in Adversarial Learning (Under Revision R3)Speech Communication, 2024
Enjoy Reading This Article?
Here are some more articles you might like to read next: