CN110047519A

CN110047519A - A kind of sound end detecting method, device and equipment

Info

Publication number: CN110047519A
Application number: CN201910311947.7A
Authority: CN
Inventors: 张承云; 梁龙腾
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2019-07-23
Anticipated expiration: 2039-04-16
Also published as: CN110047519B

Abstract

The invention discloses a kind of sound end detecting methods, including are filtered simultaneously framing to the received voice signal of institute, obtain a signal；Calculate the energy and frequency spectrum of a signal described in every frame；Weighted factor is constructed according to the energy, and spectrum weighting is carried out to the frequency spectrum using the weighted factor, obtains secondary singal；Calculate the power spectrum and spectrum energy summation of secondary singal described in every frame；According to the power spectrum and the spectrum energy summation, the short-time spectrum entropy of secondary singal described in every frame is calculated；Using the average value reciprocal of the short-time spectrum entropy of several frames as the detection threshold value of sound end, the judgement of speech frame and noise frame is carried out.Sound end detecting method provided by the invention can be suitable for the noise type that Power Spectrum Distribution is comparatively concentrated, and improve the accuracy of speech terminals detection.

Description

Voice endpoint detection method, device and equipment

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for detecting a speech endpoint.

Background

The voice endpoint detection is a technology applied to voice front-end processing, and extracts a noise-containing voice signal in the signal through an endpoint detection algorithm, so that effective information is provided for algorithms and technologies such as later sound source positioning, voice enhancement, voice recognition, voice coding and the like. The voice endpoint detection method in the prior art mainly comprises the following two steps: speech signal feature extraction and detection of speech signals. Firstly, extracting the characteristics of a voice signal through different algorithms, and distinguishing a sound signal from a noise signal; the extracted speech signal is then examined by different detection methods. The feature extraction of the voice signal is a core part of the voice endpoint detection technology, and determines the accuracy of the final voice endpoint detection.

The voice endpoint detection technology is mainly frequency domain endpoint detection in a processing domain, wherein the frequency domain endpoint detection is a voice endpoint detection method based on a spectral entropy method, signals are distinguished by using the characteristic that voice signals and noise signals have different spectral entropies, and then voice endpoint detection is carried out by detecting the flatness degree of a power spectrum, namely the spectral entropy is required to be calculated according to a spectral Probability Density Function (PDF). When the power spectrum distribution of the signal is relatively flat or uniform, the signal tends to be distributed with equal probability, the entropy function takes a larger value, and the reciprocal thereof takes a smaller value; on the contrary, when the power spectrum distribution of the signal is more concentrated or uneven, the entropy function takes a smaller value and the reciprocal thereof takes a larger value. Because the voice signal has a formant structure and the power spectrum distribution is concentrated and uneven, the spectrum entropy is lower and the reciprocal is a larger value; the power spectrum of noise signals (white noise, pink noise and the like) is relatively scattered, the spectrum entropy is relatively large, and the reciprocal is a relatively small value, so that the voice signals and the noise signals can be distinguished. The endpoint detection method based on the spectral entropy method has the characteristic of being less influenced by the energy of sound signals, so that the endpoint detection method has certain robustness on noise; however, in an actual noisy environment, such as a restaurant or a subway, which is full of noisy human noise, car driving noise, and the like, both the noise signal and the sound signal have relatively concentrated power spectrum distribution, so that the speech endpoint detection method based on the spectral entropy method is difficult to accurately estimate.

Disclosure of Invention

The invention provides a voice endpoint detection method, which aims to solve the technical problem that the voice endpoint detection method in the prior art is difficult to accurately estimate under the noise with concentrated power spectrum distribution; the invention can be suitable for noise types with relatively concentrated power spectrum distribution and improve the accuracy of voice endpoint detection.

In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection method, including:

filtering and framing the received voice signal to obtain a primary signal;

calculating the energy and frequency spectrum of the primary signal of each frame;

constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;

calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;

calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;

and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.

As a preferred scheme, the average value of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to judge speech frames and noise frames, and specifically:

comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;

when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;

and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.

As a preferred scheme, the calculating the energy and the frequency spectrum of the primary signal per frame specifically includes:

calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;

calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;

wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;

x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.

Preferably, the method is characterized in that a weighting factor is constructed according to the energy, and the spectrum is subjected to spectrum weighting by using the weighting factor to obtain a secondary signal, specifically:

normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);

carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal X_g(n,l)；

Wherein E (n) is a weighting factor, E (n) 1-E_g(n)，E_g(n)＝E(n)/max(E(n))；

X_g(n,l)＝X(n,l)./|X(n,l)|^e(n)。

As a preferred scheme, the calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal specifically includes:

calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;

wherein S (n, l) ═ X_g(n,l).*X_g(n,l)|，L is the length of the Fourier transform;

as a preferred scheme, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:

calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);

calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;

wherein P (n, l) ═ S (n, l)/y (n);

taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;

wherein,Z<<N，J(n)＝1/H(n)。

in order to solve the same technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including:

the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;

the first calculation module is used for calculating the energy and the frequency spectrum of the primary signal of each frame;

the spectrum weighting module is used for constructing a weighting factor according to the energy and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;

the second calculation module is used for calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;

the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;

and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.

In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the voice endpoint detection method as described above when executing the computer program.

Compared with the prior art, the embodiment of the invention has the beneficial effects that the embodiment of the invention provides a voice endpoint detection method, which comprises the steps of filtering and framing a received voice signal to obtain a primary signal; calculating the energy and frequency spectrum of the primary signal of each frame; constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal; calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies; and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.

Under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing by using a weighting factor constructed by an energy calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, thereby whitening the frequency spectrum of the noise signal to a certain degree, enabling the power spectrum distribution of the noise signal to be flatter and more uniform, further increasing the short-term spectrum entropy value of the noise signal, and enabling the reciprocal of the short-term spectrum entropy value of the noise signal to be a smaller value; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved. The energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, more accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.

Drawings

FIG. 1 is a flow chart illustrating the steps of a method for detecting a voice endpoint according to the present invention;

fig. 2 is a schematic flow chart of a voice endpoint detection method provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment of the present invention:

referring to fig. 1, a first embodiment of the present invention provides a method for detecting a voice endpoint, which at least includes:

s1: filtering and framing the received voice signal to obtain a primary signal;

s2: calculating the energy and frequency spectrum of the primary signal of each frame;

s3: constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;

the energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, the endpoint detection can be accurately carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is effectively improved.

S4: calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;

s5: calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;

s6: and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.

In this embodiment, under a noise type with relatively concentrated power spectrum distribution, a weighting factor constructed by using an energy calculation result and a spectrum of each frame of the primary signal are used for performing spectrum weighting processing to obtain the secondary signal, and the spectrum of the noise signal can be whitened to a certain extent, so that the power spectrum distribution of the noise signal is flatter and more uniform, the short-term spectrum entropy of the noise signal is increased, the reciprocal of the short-term spectrum entropy of the noise signal is smaller, the power spectrum of the voice signal is retained, the short-term spectrum entropy of the voice signal is smaller, and the reciprocal of the short-term spectrum entropy of the voice signal is larger, so that the voice signal and the noise signal can be distinguished, and the accuracy of detecting a speech endpoint of spectral entropy method is improved.

In the embodiment of the present invention, the average of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to determine a speech frame and a noise frame, specifically:

In this embodiment of the present invention, the calculating the energy and the frequency spectrum of the primary signal per frame specifically includes:

x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.

In this embodiment of the present invention, the constructing a weighting factor according to the energy, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal specifically includes:

X_g(n,l)＝X(n,l)./|X(n,l)|^e(n)。

Therefore, the energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, the endpoint detection can be accurately carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is improved.

In this embodiment of the present invention, the calculating a sum of a power spectrum and a spectral energy of each frame of the secondary signal specifically includes:

in this embodiment of the present invention, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:

wherein P (n, l) ═ S (n, l)/y (n);

wherein,Z<<N，J(n)＝1/H(n)。

Referring to fig. 2, a flow of one possible embodiment of the voice endpoint detection method of the present invention is as follows:

1. receiving a voice signal to be detected by a microphone, and recording the voice signal to be detected as x (t);

2. filtering and framing the received voice signal to obtain a primary signal and recording the primary signal as x (N, M), wherein N is 1,2,3, …, N frames, M is 1,2,3, …, M is the frame length of each frame;

3. estimating the energy of the primary signal x (n, m) per frame, and calculating the energy of the primary signal e (n) per frame as follows:

4. normalizing the energy E (n) of the primary signal of each frame to obtain E_g(n) and constructing a weighting factor e (n), wherein the calculation process is as follows:

E_g(n)＝E(n)/max(E(n))，

e(n)＝1-E_g(n)；

5. performing fourier transform on each frame of the primary signal X (n, m) to obtain a frequency spectrum X (n, l) of each frame of the primary signal, wherein the calculation process is as follows:

X(n,l)＝fft(x(n,m))，

wherein fft is fast fourier transform, l is frequency;

6. performing spectrum weighting processing on the frequency spectrum X (n, l) by using the weighting factor to obtain a secondary signal X_g(n, l), the calculation is as follows:

X_g(n,l)＝X(n,l)./|X(n,l)|^e(n)；

7. calculating the power spectrum module value S (n, l) of the secondary signal per frame, wherein the calculation process is as follows:

S(n,l)＝|X_g(n,l).*X_g(n,l)|；

8. calculating the spectrum energy sum Y (n) of the secondary signals of each frame according to the following calculation process:

wherein L is the length of the Fourier transform;

9. calculating a spectral probability density function P (n, l) of the secondary signal per frame, wherein the calculation result is as follows:

P(n,l)＝S(n,l)/Y(n)

10. calculating the short-time spectral entropy H (n) of each frame of the secondary signal, wherein the calculation result is as follows:

11. calculating the reciprocal J (n) of the short-time spectrum entropy value of each frame of the secondary signal, wherein the calculation result is as follows:

J(n)＝1/H(n)；

12. taking the average value of the spectrum entropy values of the first 20 frames as the detection threshold value K, and calculating the result as follows:

compared with the prior art, the voice endpoint detection method provided by the embodiment of the invention has the following beneficial effects:

(1) under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing by using a weighting factor constructed by an energy calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, thereby whitening the frequency spectrum of the noise signal to a certain degree, enabling the power spectrum distribution of the noise signal to be flatter and more uniform, further increasing the short-term spectrum entropy value of the noise signal, and enabling the reciprocal of the short-term spectrum entropy value of the noise signal to be a smaller value; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved.

(2) The energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, more accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.

(3) The spectrum whitening technology is utilized to whiten the frequency spectrum of the noise part signal to a certain degree, so that the power spectrum distribution of the noise signal is flatter and more uniform, and the spectrum entropy is increased; the power spectrum of the voice signal is reserved, the spectrum entropy is less, and the spectrum entropy of the voice signal and the spectrum entropy of the noise signal can be distinguished, so that the accuracy of detection under various noises is improved.

(4) An energy-based endpoint detection method is integrated into a spectrum entropy method, the method has the advantage of insensitivity to noise types, and energy is weighted to a spectrum whitening method in an exponential mode, so that the spectrum whitening degree is controlled; the method for weighting the frequency spectrum is combined with the method for weighting the energy to the spectral whitening in an exponential mode, and more accurate endpoint detection can be carried out under various noise types, so that the detection accuracy under various noises is improved.

Second embodiment of the invention:

a second embodiment of the present invention provides a voice endpoint detection apparatus, including:

In an embodiment of the present invention, the determining module is further configured to:

The first computing module is further configured to:

x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.

The spectral weighting module is further configured to:

X_g(n,l)＝X(n,l)./|X(n,l)|^e(n)。

The second computing module is further configured to:

wherein S (n, l) ═ X_g(n,l).*X_g(n,l)|，L is the length of the fourier transform.

The third computing module is further configured to:

wherein P (n, l) ═ S (n, l)/y (n);

the judging module is further configured to:

wherein,Z<<N，J(n)＝1/H(n)。

third embodiment of the invention:

the third embodiment of the present invention also provides a voice endpoint detection apparatus comprising a processor, a memory, and a computer program, such as an object fixing program, stored in the memory and configured to be executed by the processor. The processor, when executing the computer program, implements the steps of the voice endpoint detection method as described above, such as step S1 shown in fig. 1. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units in the above-mentioned device embodiments, such as the evaluation and analysis module.

Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the voice endpoint detection apparatus.

The voice endpoint detection device can be a desktop computer, a notebook computer, a palm computer, an intelligent tablet and other computing devices. The voice endpoint detection device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above components are merely examples of a voice endpoint detection device and do not constitute a limitation of a voice endpoint detection device and may include more or less components than those described above, or some components in combination, or different components, e.g., the voice endpoint detection device may also include input output devices, network access devices, buses, etc.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the voice endpoint detection device and connecting the various parts of the entire voice endpoint detection device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the voice endpoint detection apparatus by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Wherein, the integrated module/unit of the voice endpoint detection device can be stored in a computer readable storage medium if the integrated module/unit is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A voice endpoint detection method is characterized by comprising the following steps:

filtering and framing the received voice signal to obtain a primary signal;

2. The method for detecting a speech endpoint according to claim 1, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:

3. The method for detecting a voice endpoint according to claim 1, wherein the calculating the energy and the spectrum of the primary signal per frame specifically comprises:

x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.

4. The method according to claim 3, wherein the step of constructing a weighting factor according to the energy and performing spectral weighting on the spectrum by using the weighting factor to obtain a secondary signal comprises:

X_g(n,l)＝X(n,l)./|X(n,l)|^e(n)。

5. The method for detecting a voice endpoint according to claim 4, wherein the calculating the sum of the power spectrum and the spectral energy of the secondary signal of each frame is specifically:

6. The method according to claim 5, wherein the calculating a short-term spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy comprises:

wherein P (n, l) ═ S (n, l)/y (n);

7. the method for detecting a speech endpoint according to claim 6, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:

wherein,Z<<N，J(n)＝1/H(n)。

8. a voice endpoint detection apparatus, comprising:

9. A voice endpoint detection device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the voice endpoint detection method according to any one of claims 1-7 when executing the computer program.