CN107910016B

CN107910016B - Noise tolerance judgment method for noisy speech

Info

Publication number: CN107910016B
Application number: CN201711372174.0A
Authority: CN
Inventors: 王亦红
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2021-07-27
Anticipated expiration: 2037-12-19
Also published as: CN107910016A

Abstract

The invention discloses a noise tolerance judgment method of a voice with noise, which comprises the steps of selecting a voice sample with noise, calculating the prior signal-to-noise ratio of each frequency point of each voice sample signal with noise, and finding out the minimum prior signal-to-noise ratio; and comparing the minimum prior signal-to-noise ratio of each sample, and determining the threshold value of the noise tolerance of the voice signal with noise. Carrying out framing and pre-emphasis on the voice with noise, and carrying out end point detection; respectively estimating power spectrums of the noise and the voice with the noise; and calculating the prior signal-to-noise ratio of each frequency point of each frame. If the prior signal-to-noise ratio of one or more frequency points is less than the threshold value, the noise of the frame of voice is not compatible, and voice enhancement processing is required. If the prior signal-to-noise ratio of each frequency point of the frame is greater than the threshold value, the noise of the frame voice signal is tolerable, and voice enhancement processing is not needed. According to the determination method, a noisy speech with acceptable noise can be recognized. For noisy speech with noise tolerance, speech enhancement is not needed, thereby avoiding the loss of effective information.

Description

Noise tolerance judgment method for noisy speech

Technical Field

The invention relates to a method for judging whether noise of a voice with noise is tolerable or not, and belongs to the technical field of processing of voice signals with noise.

Background

In the speech signal processing, after the speech enhancement, the signal-to-noise ratio of the noisy speech is improved, and simultaneously, part of effective speech information is lost, so that the speech signal distortion is caused. The presence of a speech signal may raise the threshold at which its background noise is heard by the human ear, depending on the auditory masking effect of the human ear. Based on this principle, the background noise of some noisy speech signals may not be heard by the human ear or, even if heard by the human ear, may not cause discomfort. The present invention refers to such background noise as a tolerable noise, and provides a method for judging whether the noise is tolerable or not.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a noise tolerance judging method of a voice with noise, and for the voice signal with noise, if the background noise is judged to be tolerant, the noise can not be eliminated during signal processing, so that the loss of effective voice information caused by noise elimination is avoided, and the effective information is retained to the maximum extent.

The technical scheme is as follows: a noise tolerance judging method for noisy speech includes determining threshold value and judging based on threshold value.

Step of first part determining threshold

First, recording several segments of pure speech signal

And secondly, respectively extracting a plurality of scene noise samples from four noise signals of impulse noise, broadband noise, periodic noise and voice interference in a noise library of Noisex-92.

And thirdly, respectively adding each noise sample into each pure voice signal recorded in the first step to form various voice signals with noise. Enhancement processing is performed on each noisy speech based on different signal-to-noise ratios within a signal-to-noise ratio range of 0dB to 20 dB.

And fourthly, respectively carrying out MOS scoring on each noisy voice before and after voice enhancement.

And fifthly, taking the same kind of noisy speech based on different signal-to-noise ratios as a group, finding out signals with convergent MOS scores before and after speech enhancement, and taking the signal with the relatively lowest signal-to-noise ratio as a sample signal of the noisy speech.

Sixthly, performing framing and pre-emphasis processing on the voice sample with noise, then performing end point detection on the voice sample with noise, and performing power spectrum estimation on the noise and the voice with noise respectively;

seventhly, respectively calculating the prior signal-to-noise ratio of each frequency point in each frame of each noisy speech sample:

xi (n, k) -the prior signal-to-noise ratio of the kth frequency point of the nth frame;

-noisy speech power at the k frequency point for the nth frame.

-noise power at k frequency point for n frame.

Alpha-value of 0.98

A minimum a priori signal-to-noise ratio is found for each noisy speech sample. And comparing the minimum prior signal-to-noise ratio of each sample, and determining a threshold value of the noise tolerance of the voice with noise. Based on the steps, the threshold value determined by the invention is 0.95-1.05.

The second part judges whether the noise of the voice signal with noise is tolerable

Firstly, carrying out framing and pre-emphasis processing on a noisy speech signal,

secondly, carrying out endpoint detection on the noisy speech signal book,

thirdly, respectively carrying out power spectrum estimation on the noise and the voice with the noise;

fourthly, respectively calculating the prior signal-to-noise ratio of each frequency point in each frame of the voice with noise according to the formula (1):

fifthly, selecting a threshold value within the range of 0.95-1.05. And if the prior signal-to-noise ratio of each frequency point of the voice frame with noise is smaller than the selected threshold, the noise of the voice frame with noise is considered to be not tolerable, and voice enhancement is required. Otherwise, it is considered as being acceptable without enhancement.

Drawings

FIG. 1 is a block diagram of a method flow of an embodiment of the present invention;

FIG. 2 is a schematic diagram of a pure speech signal waveform according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a waveform of a noisy speech signal with a signal-to-noise ratio of 15dB according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a filtered waveform with a tolerable decision added according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a filtered waveform without adding a tolerance determination according to an embodiment of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

The method for judging the noise tolerance of the noisy speech comprises two parts of threshold value determination and threshold value-based judgment:

determination of threshold values

First, noise samples are extracted from the noise bank Noisex-92

Noise samples of four types of additive noise such as impulse noise, broadband noise, periodic noise and voice interference under different scenes are respectively extracted from a Noisex-92 noise library. Wherein, the impact noise is selected from the palm sound, and the hammer knock sound is used as a noise sample; selecting the noise in a car running at 130km/h, the noise in a noisy workshop and the noise on a road as sample noise by the broadband noise class; periodic noise respectively selects sounds emitted by an air conditioner external unit, electric wind, an electric hair drier and the like as sample noise; the speech interference selects office multi-person speech and one-person speech respectively as noise samples.

Second, the formation and enhancement of noisy speech

Each noise sample is added to the following five pure voice signals respectively: the method comprises the steps of 'carrying water if the water is good, carrying a great deal of goods', 'carrying water energy boat can also cover a boat', 'moving backward if the water energy boat does not move', a 'voice enhancement algorithm', and 'selected noise samples', forming various noisy voices, and respectively enhancing the various noisy voices based on different signal-to-noise ratios within the range of the signal-to-noise ratio from 0dB to 20 dB. The method comprises the following steps that a short-time energy mean value voice enhancement algorithm is adopted to respectively enhance voice signals with hammer knocking sound and clapping sound; respectively enhancing the noise in a car running at 130km/h, the noise in a noisy workshop and the voice signals with the noisy noise on a road by adopting a spectral subtraction method based on an auditory masking effect; respectively enhancing voice signals with electric hair drier noise, electric fan noise and air conditioner outdoor unit noise by adopting an LMS adaptive filtering method; speech enhancement of speech signals with speech interference of one or more persons using comb filters

Third, selecting sample signal

And respectively carrying out MOS scoring on each noisy voice before and after voice enhancement. The same noisy speech signals based on different signal-to-noise ratios are used as a group, signals with the MOS scores converging before and after speech enhancement are found out, and the signal with the relatively lowest signal-to-noise ratio is used as a sample of the noisy speech signals.

Fourth, threshold is determined based on sample signal with minimum prior signal-to-noise ratio

Respectively calculating the prior signal-to-noise ratio of each frequency point in each frame of each noisy speech sample according to the following formula:

-noisy speech power at the k frequency point for the nth frame.

-noise power at k frequency point for n frame.

Alpha-value of 0.98

A minimum a priori signal-to-noise ratio is found for each noisy speech sample. And comparing the minimum prior signal-to-noise ratio of each sample, and determining that the threshold value of the noise tolerance of the voice with noise is 0.95-1.05.

The second part determines whether the noise of the noisy speech signal is tolerable.

A female pure speech signal, one of the speech enhancement algorithms, is recorded, and the speech waveform is shown in fig. 2. Additive noise was taken as factory noise of factary 2 in the Noisex-92 standard noise library. A noisy speech signal with a signal-to-noise ratio of 15dB is formed as shown in fig. 3. Wiener filtering is performed on the noisy speech signal shown in fig. 3, without noise tolerance determination and with noise tolerance determination, respectively. The specific implementation steps are as follows;

firstly, pre-emphasizing a voice signal with noise, windowing and framing, and carrying out end point detection;

secondly, respectively carrying out power spectrum estimation on the noise and the voice with the noise;

then, respectively calculating the prior signal-to-noise ratio of each frequency point in each frame of the voice with noise according to the formula (1);

finally, for the threshold value of 0.95-1.05, 0.95 is taken as the threshold value for noise tolerance judgment. And if the prior signal-to-noise ratio of each frequency point of the voice frame with the noise is less than 0.95, considering that the noise of the voice frame with the noise is not tolerable and needing wiener filtering. Otherwise, it is considered to be acceptable without filtering. The noisy speech signal shown in fig. 3 is processed frame by frame, and the processed waveform is shown in fig. 4. For comparison, the noise-containing speech signal shown in fig. 3 is processed by wiener filtering without introducing a tolerable degree decision, and the processing result is shown in fig. 5.

Claims

1. A noise tolerance judging method of a voice with noise is characterized by comprising two parts of threshold value determination and threshold value-based judgment, and specifically comprises the following steps:

(1) a step of determining a threshold:

(1.1) recording several segments of pure voice signals;

(1.2) respectively extracting a plurality of scene noise samples from four types of noise signals, namely impulse noise, broadband noise, periodic noise and voice interference in a noise library of Noisex-92;

(1.3) respectively adding each noise sample into each pure voice signal recorded in the first step to form various voice signals with noise, and enhancing each voice with noise based on different signal-to-noise ratios within the range of 0dB to 20 dB;

(1.4) respectively carrying out MOS scoring on each noisy voice before and after voice enhancement;

(1.5) taking the same noisy speech based on different signal-to-noise ratios as a group, finding out signals with converged MOS scoring before and after speech enhancement, and taking the signal with the relatively lowest signal-to-noise ratio as a sample signal of the noisy speech;

(1.6) carrying out framing and pre-emphasis processing on the voice sample with noise, then carrying out end point detection on the voice sample with noise, and respectively carrying out power spectrum estimation on the noise and the voice with noise;

(1.7) respectively calculating the prior signal-to-noise ratio of each frequency point in each frame of each noisy speech sample:

xi (n, k) is the prior signal-to-noise ratio of the k frequency point of the nth frame;

the power of the voice with noise at the k frequency point of the nth frame;

noise power at a k-th frequency point for an nth frame; the value of alpha is 0.98;

finding out the minimum prior signal-to-noise ratio of each voice sample with noise, comparing the minimum prior signal-to-noise ratios of the samples, and determining the threshold value of the noise tolerance of the voice sample with noise, wherein based on the steps, the threshold value determined by the invention is 0.95-1.05;

(2) judging whether the noise of the voice signal with noise is tolerable or not:

(2.1) carrying out framing and pre-emphasis processing on the noisy speech signal;

(2.2) carrying out endpoint detection on the noisy speech signal book;

(2.3) respectively carrying out power spectrum estimation on the noise and the voice with the noise;

(2.4) respectively calculating the prior signal-to-noise ratio of each frequency point in each frame of the voice with noise according to the formula (1);

(2.5) selecting a threshold value within the range of 0.95-1.05, and if the prior signal-to-noise ratio of each frequency point of the voice frame with noise is less than the selected threshold value, considering that the noise of the voice frame with noise is not tolerable and needing voice enhancement; otherwise, it is considered as being acceptable without enhancement.