CN112037759A

CN112037759A - Anti-noise perception sensitivity curve establishing and voice synthesizing method

Info

Publication number: CN112037759A
Application number: CN202010686375.3A
Authority: CN
Inventors: 杨玉红; 冯佳倩; 蔡林君; 陈旭峰; 刘青沐; 郭佳昊; 余洪江; 涂卫平; 艾浩军; 王晓晨; 高戈
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-12-04
Anticipated expiration: 2040-07-16
Also published as: CN112037759B

Abstract

The invention provides an anti-noise perception sensitivity curve establishing and voice synthesizing method, which comprises the steps of using band-pass filtering to divide noise according to critical frequency bands perceived by human ears to obtain a plurality of critical frequency band noises; recording a corresponding anti-noise voice sequence according to different noise decibels aiming at each critical frequency band noise; determining a perception threshold value based on the SII objective test index, and performing noise decibel level perception test on each critical frequency band to obtain an updated critical decibel; generating an anti-noise perception sensitivity curve according to the updated critical decibels; and obtaining critical decibel values from an anti-noise perception sensitivity curve, selecting anti-noise voices with different critical decibel values, training an anti-noise voice feature mapping model, and performing voice synthesis by using the mapped anti-noise voice features. The method of the invention utilizes the hearing characteristic of people in the noise environment to provide an anti-noise perception sensitivity curve establishing and voice synthesizing method, which is more beneficial to the practical application scene of anti-noise voice conversion.

Description

Anti-noise perception sensitivity curve establishing and voice synthesizing method

Technical Field

The invention belongs to the technical field of acoustics, and particularly relates to an anti-noise perception sensitivity curve establishing and voice synthesizing method.

Background

An equal loudness curve refers to the plot of sound pressure level versus frequency for a pure tone of the same perceived loudness of a typical listener. And (3) an equal loudness curve of the binaural audiometry, wherein a dotted line with the lowest threshold value, namely a pure tone minimum audible sound field, is used as a hearing threshold curve of the binaural audiometry. The loudness is mainly determined by the sound intensity, and the loudness level is correspondingly increased by increasing the sound intensity. However, the loudness of sound is not determined purely by the sound intensity, but depends on the frequency, and pure tones of different frequencies have different loudness growth rates, wherein the loudness growth rate of low-frequency pure tones is faster than that of medium-frequency pure tones.

Thus, similar to the equal loudness curve, speakers perceive ambient noise at different frequencies, at different noise levels, and in different anti-noise sound production patterns triggered accordingly. Determining a distinguishing threshold curve of a speaker for the decibel level change of environmental noise, guiding to establish an anti-noise sound production model based on the Lombard effect, starting corresponding anti-noise voice conversion in due time, and ensuring the consistency of the converted anti-noise voice and various real noise scenes. However, the prior art focuses on the acoustic features of the Lombard effect changes, and the importance of the acoustic features to improve the intelligibility of anti-noise speech. Due to lack of guidance of anti-noise perception sensitivity, the converted anti-noise voice is not matched with a real scene, and experience of subsequent voice application is further influenced.

The invention provides an anti-noise perception sensitivity curve establishment and speech synthesis method, which aims to fully utilize the perception characteristics of people in different noise environments, study the anti-noise vocalization mechanism from the perspective of auditory perception, establish the perception sensitivity curve of a speaker to environmental noise and solve the problem that the anti-noise speech conversion is disconnected from a real scene due to the lack of auditory perception model guidance of the anti-noise speech vocalization at present.

Disclosure of Invention

The invention provides an anti-noise perception sensitivity curve establishing and voice synthesizing method, aiming at solving the problem that the existing anti-noise voice production is lack of auditory perception model guidance and reducing the detail difference in frequency.

The technical scheme adopted by the invention is that the method for establishing the anti-noise perception sensitivity curve comprises the following steps,

step 1, dividing noise according to critical frequency bands sensed by human ears by using band-pass filtering to obtain a plurality of critical frequency band noises;

step 2, recording corresponding anti-noise voice sequences according to different noise decibels aiming at each critical frequency band noise in the step 1;

step 3, determining a perception threshold value based on the SII objective test index, and performing noise decibel level perception test on each critical frequency band to obtain an updated critical decibel;

and 4, generating an anti-noise perception sensitivity curve according to the updated critical decibels obtained in the step 3.

In step 1, white noise is used as the noise.

In step 1, Bark band or Mel band is used as the critical band of human ear perception.

Moreover, the implementation manner of the step 2 is that firstly, aiming at each critical frequency band noise obtained in the step 1, data is collected through a manual head, each critical frequency band noise is correspondingly adjusted according to a preset signal-to-noise ratio, and the decibel level is calibrated; and then respectively recording voice sequences for different decibel levels according to the noise of each critical frequency band.

And recording according to the preset lower limit MIN and upper limit MAX of the signal-to-noise ratio range and the step length d respectively according to the signal-to-noise ratio of MIN, MIN + d, MIN +2d, … and MAX to obtain the corresponding voice sequence.

In step 3, the noise decibel level sensing test for each critical frequency band is realized by using the MUSHRA standard.

The invention also provides a speech synthesis method based on the anti-noise perception sensitivity curve, which comprises the following steps,

step 4, generating an anti-noise perception sensitivity curve according to the updated critical decibels obtained in the step 3;

and 5, obtaining critical decibel values from the anti-noise perception sensitivity curve obtained in the step 4, selecting anti-noise voices with different critical decibel values, training an anti-noise voice feature mapping model, and performing voice synthesis by using the mapped anti-noise voice features.

Furthermore, in step 5, the WORLD vocoder is used to extract acoustic features, including the fundamental frequency and the spectral envelope.

In step 5, the anti-noise speech feature mapping model is obtained by training a spectrum envelope by using an EM method and a gaussian mixture model.

And moreover, based on the spectrum envelope feature conversion result obtained by the anti-noise voice feature mapping model, voice synthesis is carried out by combining the fundamental frequency feature.

The method of the invention utilizes the hearing characteristic and the special sound production mechanism of people in the noise environment to provide an anti-noise perception sensitivity curve establishment and voice synthesis method, which is more beneficial to the practical application scene of anti-noise voice conversion, has high accuracy and wide application prospect, for example, a large amount of anti-noise voice data sets are needed in the practical application of voice separation and conference transcription.

Detailed Description

The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.

The method provided by the invention can realize the process by using computer software technology and other hardware equipment, and the process of the invention is specifically explained below.

Example one

The embodiment of the invention provides a voice synthesis method established based on an anti-noise perception sensitivity curve, which comprises the following specific implementation steps:

step 1: dividing the noise according to the critical frequency band sensed by the human ear by using band-pass filtering to obtain a plurality of critical frequency band noises;

the noise used in the embodiment is white noise, a Bark band is used as a critical frequency band of human ear perception, and the white noise is divided according to the Bark band by using band-pass filtering.

Step 2: recording a corresponding anti-noise voice sequence according to different noise decibels aiming at each critical frequency band noise obtained in the step 1;

for step 2, this embodiment may be implemented by the following steps:

step 2.1: and (3) aiming at each Bark band noise in the step (1), acquiring data through a manual head, correspondingly adjusting each Bark band noise according to a preset signal-to-noise ratio, and calibrating the decibel level.

Considering that the common scene noise is about 35dB, the hearing pain threshold of human ear is 85dB, and the preset signal-to-noise ratio range in the embodiment is 40-85dB, that is, MIN is 40, MAX is 85, and the step length d is 5 dB. And for each Bark band noise, recording according to signal-to-noise ratios of 40 dB, 45 dB, … 80 dB and 85dB respectively to obtain corresponding voice data.

The preferred recording materials and specific settings used in the examples are as follows:

embodiments use an artificial head device for recording, such as a g.r.a.s.kemar 45BA 1/2 inch low noise ear analog system, including a highly simulated extended ear canal. In order to avoid other noises such as wall reflection and the like, various environmental noises are played in the earphone by manually wearing the earphone, and the accurate signal-to-noise ratio can be obtained by manually recording the sound by the head.

The signal-to-noise ratio is calculated in the art as follows:

wherein s (n) is a speech signal, d (n) is a noise signal, p_sFor the power of the speech signal, p_dIs the noise signal power, where N is the sampling point and N is the sampling point length.

Step 2.2: and respectively recording voice sequences for different decibel levels according to the noise of each Bark band.

In specific implementation, each speaker can wear an earphone, the earphone plays the noise calibrated in the step 2.1, and the voice sequence of each speaker is recorded for different decibel levels aiming at the noise of each Bark band. The corresponding experiment of the embodiment scheme is carried out in a anechoic room of Wuhan university, and a high-fidelity microphone is used for recording to obtain the voice data of corresponding decibel level.

Specifically, step 1 and step 2 may be performed in advance as input data.

And step 3: determining a perception threshold value based on a Speech Intelligibility Index (SII) objective Test index, and then performing noise decibel level perception Test on each critical frequency band by using a MUSHRA (Multi-Stimulus Test with high Reference and Anchor) standard to obtain an updated critical decibel level;

in specific implementation, other objective test indexes can be adopted, for example; other criteria may also be used for testing, such as the clarity Index (AI)

For step 3, this embodiment may be implemented by the following steps:

step 3.1: the improvement is carried out based on a definition index SII, the SII depends on the audible proportion of a listener in the spectrum information, the step uses a definition formula of the SII, and the critical decibel is calculated under the condition of a determined SII score, and the definition formula of the SII is as follows:

wherein, SII score is 0-1, and 0.35 is taken for determining decibel threshold value in the embodiment; n is_f20 for the total number of frequency bands; w_fA human ear perception weight representing the frequency band f; l is_fA variable element representing a speech level distortion; e_fAnd D_fDecibels representing speech and interference noise, respectively;

representing the audible threshold for that band.

By the formula, while the speech intelligibility is ensured, the noise signal-to-noise ratio (critical decibel) corresponding to the anti-noise speech is obtained, namely E_f-D_f。

Step 3.2: fine-tuning the critical decibel value in step 3.1: noise decibel level perception experiments were performed on each Bark band noise, where hearing perception tests were performed using the MUSHRA standard, and Word Error Rate (WER) was calculated. In order to keep the recognized word sequence consistent with the standard sequence, some words are replaced, deleted, or inserted, and the total number of words is divided by the total number of words in the standard sequence, multiplied by a percentage. The final word error rate calculation formula is as follows:

the obtained error rate is a score, and the statistical significance is required to be taken as a reference, and the average score of each voice sequence is calculated firstly

Wherein, score_ijkAnd (4) representing the score of the ith listener on the kth voice under the jth signal-to-noise ratio level, wherein N is the total number of listeners in the subjective experiment. Confidence intervals for each average score were then calculated:

the confidence coefficient is 95%, and non-repeated boundary values are found by comparing confidence intervals of different signal-to-noise ratios, and the critical decibel is updated.

And 4, step 4: and (4) generating an anti-noise perception sensitivity curve according to the test result in the step (3) (the updated critical decibel obtained in the step (3.2)).

In the present embodiment, Bark band is used, so the sensitivity curve here is plotted with Bark band on the horizontal axis and Bark band noise decibel level on the vertical axis, and other frequency bands, such as Mel band, may be used to generate corresponding cancellation in specific implementation.

And 5: and 4, obtaining critical decibel values from the anti-noise perception sensitivity curve in the step 4, selecting anti-noise voices with different critical decibel values, training an anti-noise voice feature mapping model, and performing voice synthesis by using the mapped anti-noise voice features.

For step 5, this embodiment may be implemented by the following steps:

step 5.1: and selecting anti-noise voices with different critical decibel values and corresponding common voices in the anti-noise perception sensitivity curve, and extracting acoustic features such as fundamental frequency (f0) and spectral envelope (spec).

In this embodiment, the method of extracting acoustic features by using the WORLD vocoder includes:

f0＝DIO(x,fs)

spec＝CheapTrick(x,fs,f0)

where x is the input speech signal, fs is the sampling rate, DIO and cheaptlock are prior art in the worrld vocoder, and the present invention is not described in detail.

Step 5.2: and (5) training an anti-noise voice feature mapping model by using the acoustic features extracted in the step (5.1), and performing feature conversion by using the feature mapping model.

The anti-noise speech feature mapping model used in this embodiment is a Gaussian Mixture Model (GMM), and a maximum-Expectation algorithm (EM) is used to train the GMM corresponding to the spec in step 5.1, where the spec feature is 24-dimensional, and the GMM is not described in detail for the prior art

In this embodiment, the GMM is used as the feature mapping model, and neural network models such as CycleGAN and StarGAN may also be used.

Step 5.3: and converting the spec characteristic into spe' by using the mapping model in the step 5.2, and combining other characteristics in the step 5.1 for voice synthesis.

This step adopts WORLD vocoder to carry out speech synthesis, includes:

source＝Platinum(x,f0,spec)

y＝SynthesisByWORLD(source,spec')

wherein y is the synthesized voice, and Platinum and synthesized ByWORLD are the prior art of WORLD vocoder, which is not repeated in the present invention.

In the embodiment, a WORLD vocoder is preferably used for analyzing and synthesizing the voice, wherein a STRAIGHT vocoder and the like can be used for analyzing the voice, and a neural network model such as WaveNet and WaveGAN can be used for synthesizing the voice.

Example two

The second embodiment of the invention fully utilizes the auditory characteristic of people in a noise environment, provides an anti-noise perception sensitivity curve establishing method, and can provide key guidance for anti-noise voice conversion in practical application. In specific implementation, the steps 1 to 4 in the first embodiment are implemented.

In specific implementation, the method provided by the technical scheme of the invention can be used for realizing an automatic operation process by a person skilled in the art by adopting a computer software technology to carry out operations such as generating an anti-noise perception sensitivity curve, synthesizing voice and the like. The system device for operating the method, such as a computer readable storage medium storing the corresponding computer program of the technical solution of the present invention and a computer apparatus including the corresponding computer program, should also be within the scope of the present invention.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. An anti-noise perception sensitivity curve establishing method is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

step 3, determining a perception threshold value based on the objective test index, and performing noise decibel level perception test on each critical frequency band to obtain updated critical decibels;

2. The anti-noise perceptual sensitivity curve creation method of claim 1, wherein: in step 1, the noise is white noise.

3. The anti-noise perceptual sensitivity curve creation method of claim 1, wherein: in step 1, Bark band or Mel band is used as the critical band of human ear perception.

4. The anti-noise perceptual sensitivity curve creation method of claim 1, wherein: the implementation mode of the step 2 is that firstly, aiming at each critical frequency band noise obtained in the step 1, data is collected through a manual head, each critical frequency band noise is correspondingly adjusted according to a preset signal-to-noise ratio, and the decibel level is calibrated; and then respectively recording voice sequences for different decibel levels according to the noise of each critical frequency band.

5. The anti-noise perceptual sensitivity curve creation method of claim 4, wherein: and recording according to the preset lower limit MIN, the preset upper limit MAX and the preset step length d of the signal-to-noise ratio range and the signal-to-noise ratio of MIN, MIN + d, MIN +2d, … and MAX respectively to obtain a corresponding voice sequence.

6. The anti-noise perceptual sensitivity curve creation method of claim 1, wherein: in step 3, a perception threshold value is determined based on the SII objective test index, and a noise decibel level perception test is carried out on each critical frequency band by adopting the MUSHRA standard.

7. A speech synthesis method based on anti-noise perception sensitivity curve establishment is characterized in that: comprises the following steps of (a) carrying out,

8. The method of speech synthesis based on antinoise perceptual sensitivity curve creation as defined in claim 7, wherein: in step 5, a WORLD vocoder is used to extract acoustic features, including fundamental frequency and spectral envelope.

9. A speech synthesis method based on an antinoise perceptual sensitivity curve according to claim 8, characterized in that: in step 5, the anti-noise voice feature mapping model is obtained by adopting a Gaussian mixture model and using an EM (effective noise ratio) method to train the spectrum envelope.

10. A speech synthesis method based on an antinoise perceptual sensitivity curve according to claim 9, characterized in that: and performing voice synthesis by combining the fundamental frequency characteristics based on the spectrum envelope characteristic conversion result obtained by the anti-noise voice characteristic mapping model.