CN114913876B

CN114913876B - Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism

Info

Publication number: CN114913876B
Application number: CN202210508346.7A
Authority: CN
Inventors: 李晔; 沈自强; 白全民; 张存阳; 王金颖
Original assignee: Shandong Institute Of Science And Technology Development Strategy
Current assignee: Shandong Institute Of Science And Technology Development Strategy
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2024-08-20
Anticipated expiration: 2042-05-11
Also published as: CN114913876A

Abstract

The invention discloses a noisy speech endpoint detection method based on multi-resolution KL divergence and voting mechanism, comprising the following steps: framing sampled voice signal samples according to time sequence; dividing the frequency band of the current frame according to different granularities to obtain a plurality of sub-bands under each granularity; respectively calculating KL divergence of sub-band energy distribution under different granularities, and comparing the KL divergence of each granularity with a first threshold value to obtain a discrimination value of each granularity; and obtaining a comprehensive value according to the judgment values of different granularities, and obtaining a judgment result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value. The problem that noise caused by fluctuation of a noise frequency spectrum in a nearby subband is misjudged as voice is effectively solved, and therefore accuracy of voice endpoint detection is effectively improved.

Description

Noisy speech end point detection method based on multi-resolution KL divergence and voting mechanism

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a noisy voice endpoint detection method based on multi-resolution KL divergence and a voting mechanism.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The voice endpoint detection technique has wide application in the field of voice signal processing. As in variable rate speech coding systems, speech end point detection techniques are used to make decisions on speech segments and non-speech segments, thereby reducing the number of coded bits in the non-speech segments and reducing the average coding rate. In a voice noise reduction system, a voice endpoint detection technology is used for judging a voice segment and a non-voice segment, so that noise characteristics are estimated and updated in the non-voice segment, and a better noise reduction effect is achieved. In addition, the method also relates to the technical fields of voice recognition, echo cancellation, voice wakeup and the like.

However, the presence of background noise affects the accuracy of speech endpoint detection, and endpoint detection of noisy speech remains a research difficulty and hotspot. Techniques such as methods based on frequency domain features have been proposed for better endpoint detection. The endpoint detection based on the frequency domain features can better utilize the difference between noise and the voice signal frequency spectrum statistical features, and has a good detection effect; spectral entropy-based endpoint detection methods are typical representatives of this class of algorithms.

However, the robustness of the method for performing endpoint detection by using spectral entropy still needs to be further improved, in the noisy speech such as white noise and powder noise, although the noise spectrum features are relatively stable, there is a difference in degree, such as fluctuation in the distribution of energy in adjacent subbands, and the noise misjudgment as speech can be generated by adopting a single fixed subband division method; for another example, under a single fixed subband division frame, there is still a problem that the spectral entropy of an individual speech frame is determined to be noise due to the close proximity to the noise frame, so that erroneous determination occurs. The above problems all directly affect the effect of voice endpoint detection.

Disclosure of Invention

In order to solve the problems, the invention provides a noisy speech endpoint detection method based on multi-resolution KL divergence and voting mechanism, which effectively solves the problem that noise caused by fluctuation of a noise spectrum in a nearby subband is misjudged as speech, thereby effectively improving the accuracy of speech endpoint detection.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for detecting a noisy speech endpoint based on a multi-resolution KL divergence and a voting mechanism, including:

framing sampled voice signal samples according to time sequence;

dividing the frequency band of the current frame according to different granularities to obtain a plurality of sub-bands under each granularity;

Respectively calculating KL divergence of sub-band energy distribution under different granularities, and comparing the KL divergence of each granularity with a first threshold value to obtain a discrimination value of each granularity;

And obtaining a comprehensive value by adopting a voting mechanism according to the judgment values of different granularities, and obtaining a judgment result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value.

As an alternative implementation manner, the first N frames of speech signal samples are set as noise segments, and if the current frame number is smaller than N, the spectrum of the noise segments is updated according to the spectrum of the current frame, specifically:

Wherein, The energy of the jth sub-band under the kth granularity of the noise to be estimated; is the energy of the jth sub-band at the kth granularity of the current frame i.

As an alternative embodiment, the KL divergence is:

wherein KL _k is KL divergence of subband energy distribution under the kth granularity, and L is the number of subbands under the kth granularity; the energy of an mth subband under the kth granularity of noise to be estimated; The energy of the jth sub-band under the kth granularity of the noise to be estimated; the energy of the m-th sub-band under the kth granularity of the current frame number i; the energy of the jth subband at the kth granularity of the current frame number i.

As an alternative implementation manner, comparing the KL divergence of each granularity with a first threshold value, and if the KL divergence is greater than the first threshold value, determining that the current frame is a voice frame and the discrimination value is 1; otherwise, the current frame is noise, and the discrimination value is zero.

As an alternative implementation manner, a voting mechanism is adopted to obtain a comprehensive value according to the decision values of different granularities, specifically:

Where SN _total is a composite value, w _k is the number of subbands in k granularity, SN _k (i) is a discrimination value, and KN is the total number of granularity categories.

Alternatively, the current frame is a speech frame if the integrated value is greater than the second threshold, or is a noise frame if the integrated value is greater than the second threshold.

As an alternative embodiment, after performing the fast fourier transform on the current frame, the frequency band is divided according to different granularity.

In a second aspect, the present invention provides a noisy speech endpoint detection system based on a multi-resolution KL divergence and voting mechanism, comprising:

the framing module is configured to frame the sampled voice signal samples according to the time sequence;

the sub-band dividing module is configured to divide the frequency band of the current frame according to different granularities, and a plurality of sub-bands are obtained under each granularity;

the first judging module is configured to calculate the KL divergence of the sub-band energy distribution under different granularities respectively, and compare the KL divergence of each granularity with a first threshold value to obtain a judging value of each granularity;

And the second judging module is configured to obtain a comprehensive value by adopting a voting mechanism according to the judging values of different granularities, and obtain a judging result of the voice and the noise according to the comparison of the comprehensive value and the second threshold value.

In a third aspect, the invention provides an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a noisy speech endpoint detection based on multi-resolution KL divergence and voting mechanism, which is characterized in that an input speech frame signal is subjected to short-time-frequency analysis, multi-resolution sub-band division is carried out according to different granularities, the KL divergence of each sub-band division is calculated respectively, and the KL divergence is utilized to carry out speech endpoint detection, so that a noise/speech judgment result under the granularities is obtained; and combining a voting mechanism, counting weights according to the resolution, wherein the higher the resolution is, the larger the weights are, and comprehensively utilizing noise/voice judgment results under a plurality of granularities to judge the final noise and voice. The problem that noise caused by fluctuation of a noise frequency spectrum in a nearby subband is misjudged as voice is effectively solved, and therefore accuracy of voice endpoint detection is effectively improved.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of a conventional speech end point detection method based on spectral entropy;

Fig. 2 is a flowchart of a method for detecting a noisy speech endpoint according to embodiment 1 of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

Fig. 1 is a flowchart of a conventional speech endpoint detection method based on spectral entropy, specifically:

(1) Framing input speech signal samples according to a time sequence;

(2) Performing fast Fourier transform on each frame of signal, and dividing the signal into N sub-bands;

(3) Calculating energy E _i of each sub-band, and calculating spectral entropy H according to the proportion of the energy E _i of each sub-band to total energy E _total of the voice frame signal, wherein the spectral entropy H is specifically as follows:

(4) Calculating an absolute value H _err＝|H-H_N I of a difference value between the current frame spectrum entropy H and the noise frame spectrum entropy H _N, wherein H _N is obtained by using a plurality of input frame statistics;

(5) And comparing the absolute value H _err of the spectral entropy difference with an empirical threshold T to obtain a judging result of the voice frame and the non-voice frame, wherein if H _err is greater than T, the current frame is the voice frame, otherwise, the current frame is noise.

The robustness of the method for endpoint detection by using the spectral entropy still needs to be further improved, and the problem of misjudgment exists.

Therefore, this embodiment proposes a noisy speech endpoint detection method based on a multi-resolution KL divergence (Kullback-Leibler divergence) and a weighted voting mechanism, as shown in fig. 2, specifically including:

framing sampled voice signal samples according to time sequence;

In this embodiment, the voice signal is sampled at 8KHz frequency, and the power frequency interference is removed by high-pass filtering, so as to obtain a voice signal sample, and the voice signal sample is framed in time sequence.

Alternatively, but not limited to, every 32ms, i.e., 256 speech signal samples, form a frame.

In this embodiment, after performing fast fourier transform on the current frame, frequency bands are divided according to different granularities, and a plurality of subbands are obtained under each granularity;

In this embodiment, a granularity set k= {2,4,8, 16, 32}, where 5 seed bands of 2,4,8, 16, 32, etc. are set to divide granularity, K represents the number of fourier transform frequencies contained in each sub band, and the total number of granularity categories is denoted as KN.

As an alternative implementation manner, 256-point fast fourier transform is performed on the current frame according to discrete fourier transform, the first 128 values are selected to perform band division, and the current frame is divided into 64, 32, 16, 8, 4 sub-bands at 5 granularity levels of 2, 4, 8, 16, 32, etc.

The present embodiment sets the first N frames of voice signals as noise segments, for example, n=20, but is not limited thereto;

initializing a frame number i, enabling the frame number i to be 0, if the current frame number i is less than N, updating the frequency spectrum of a noise section according to the frequency spectrum of the current frame, enabling the i to be i+1, re-framing and re-dividing the sub-band until the current frame number i is greater than or equal to N;

The spectrum update of the noise section is specifically:

Wherein, The initial value of the energy of the jth sub-band under the kth granularity of the background noise to be estimated is 0; is the energy of the jth sub-band at the kth granularity of the current frame i.

Otherwise, respectively calculating KL divergence for the sub-band energy distribution under different granularity; the method comprises the following steps:

wherein KL _k is KL divergence of subband energy distribution under the kth granularity, and L is the number of subbands under the kth granularity; The energy of the m-th subband under the kth granularity of the background noise to be estimated; Is the energy of the m-th subband at the kth granularity of the current frame.

In this embodiment, comparing the KL divergence of each granularity with a first threshold T ₁, if the KL divergence is greater than T ₁, the current frame is a speech frame, and recording a discrimination value SN _k (i) =1; otherwise, the current frame is environmental noise, and a discrimination value SN _k (i) =0 is recorded;

Specifically: if KL _k>T₁, under the kth granularity, the current frame is a voice frame, and SN _k (i) =1; otherwise, the current frame is environmental noise, the SN _k(i)＝0;T₁ is recorded as an empirical statistical value, and the current frame is obtained offline.

In this embodiment, according to the decision values of different granularities, a voting mechanism is combined, and weights are counted, wherein the finer the granularity is, the more subbands are, the larger the weights are, so as to obtain a comprehensive value SN _total:

where w _k is the number of subbands at k granularity.

In this embodiment, according to the comparison between the integrated value and the second threshold T ₂, if SN _total>T₂, the current frame is a speech frame, otherwise it is a noise frame; t ₂ is an empirical statistic, obtained off-line. And carrying out the processing on each frame of voice signal in sequence until the voice signal processing is completed.

The voice endpoint detection method with noise of the embodiment improves the accuracy of voice endpoint detection in a noise environment. Compared with spectrum entropy, the KL divergence can more effectively describe the difference between different spectrum energy distributions, and has stronger robustness under different signal-to-noise ratios; because the noise spectrum has poor robustness on the problem of energy fluctuation among sub-bands in the conventional KL divergence, the embodiment combines a voting mechanism at the same time, and obtains a comprehensive discrimination value according to the KL divergence under different granularities. The method for detecting the endpoint of the voice with noise in the embodiment can be used for the technical requirements of voice endpoint detection in various voice signal processing fields such as variable rate voice compression coding, voice enhancement, voice noise reduction, echo cancellation, voice recognition, voice wakeup and the like in a noise environment.

Example 2

The embodiment provides a noisy speech endpoint detection system based on multi-resolution KL divergence and voting mechanism, comprising:

It should be noted that the above modules correspond to the steps described in embodiment 1, and the above modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to those disclosed in embodiment 1. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

In further embodiments, there is also provided:

An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method described in embodiment 1. For brevity, the description is omitted here.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method described in embodiment 1.

The method in embodiment 1 may be directly embodied as a hardware processor executing or executed with a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. The noisy speech end point detection method based on the multi-resolution KL divergence and voting mechanism is characterized by comprising the following steps:

framing sampled voice signal samples according to time sequence;

Obtaining a comprehensive value by adopting a voting mechanism according to the judgment values of different granularities, and obtaining a judgment result of voice and noise according to the comparison of the comprehensive value and a second threshold value;

Setting the first N frames of voice signal sample points as noise segments, and if the current frame number is smaller than N, updating the frequency spectrum of the noise segments according to the frequency spectrum of the current frame, wherein the method specifically comprises the following steps:

；

Wherein, The energy of the jth sub-band under the kth granularity of the noise to be estimated; The energy of the jth sub-band under the kth granularity of the current frame i;

The KL divergence is:

；

Wherein, KL divergence, which is the energy distribution of the sub-bands under the kth granularity, and L is the number of sub-bands under the kth granularity; the energy of an mth subband under the kth granularity of noise to be estimated; the energy of the m-th sub-band under the kth granularity of the current frame number i;

Comparing the KL divergence of each granularity with a first threshold value, and if the KL divergence is larger than the first threshold value, judging that the current frame is a voice frame and the judgment value is 1; otherwise, the current frame is noise, and the discrimination value is zero;

the comprehensive value is obtained by adopting a voting mechanism according to the judgment values of different granularities, and is specifically as follows:

；

Wherein, As a result of the combination of the values,For the number of subbands at k granularity,As the discrimination value, KN is the total number of particle size categories.

2. The method for detecting a noisy speech endpoint based on the multi-resolution KL divergence and voting scheme according to claim 1, wherein the current frame is a speech frame if the integrated value is greater than the second threshold value, and is a noisy frame otherwise, based on a comparison of the integrated value with the second threshold value.

3. The method for detecting a noisy speech endpoint based on the multi-resolution KL divergence and voting scheme according to claim 1, wherein the current frame is subjected to the fast fourier transform and then subjected to the frequency band division according to different granularities.

4. A noisy speech endpoint detection system based on a multi-resolution KL-divergence and voting mechanism, characterized in that it is configured to perform the noisy speech endpoint detection method based on a multi-resolution KL-divergence and voting mechanism as defined in any one of claims 1 to 3, and that it comprises:

5. An electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of any one of claims 1-3.

6. A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of any of claims 1-3.