CN110047519B - Voice endpoint detection method, device and equipment - Google Patents

Voice endpoint detection method, device and equipment Download PDF

Info

Publication number
CN110047519B
CN110047519B CN201910311947.7A CN201910311947A CN110047519B CN 110047519 B CN110047519 B CN 110047519B CN 201910311947 A CN201910311947 A CN 201910311947A CN 110047519 B CN110047519 B CN 110047519B
Authority
CN
China
Prior art keywords
frame
spectrum
short
calculating
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910311947.7A
Other languages
Chinese (zh)
Other versions
CN110047519A (en
Inventor
张承云
梁龙腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201910311947.7A priority Critical patent/CN110047519B/en
Publication of CN110047519A publication Critical patent/CN110047519A/en
Application granted granted Critical
Publication of CN110047519B publication Critical patent/CN110047519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Abstract

The invention discloses a voice endpoint detection method, which comprises the steps of filtering and framing a received voice signal to obtain a primary signal; calculating the short-time amplitude and the frequency spectrum of each frame of the primary signal; constructing a weighting factor according to the short-time amplitude, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; calculating the power spectrum of each frame of the secondary signal, and calculating the sum of spectral energy; calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies; and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame. The voice endpoint detection method provided by the invention can be suitable for noise types with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is improved.

Description

Voice endpoint detection method, device and equipment
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for detecting a speech endpoint.
Background
The voice endpoint detection is a technology applied to voice front-end processing, and extracts a noise-containing voice signal in the signal through an endpoint detection algorithm, so that effective information is provided for algorithms and technologies such as later sound source positioning, voice enhancement, voice recognition, voice coding and the like. The voice endpoint detection method in the prior art mainly comprises the following two steps: speech signal feature extraction and detection of speech signals. Firstly, extracting the characteristics of a voice signal through different algorithms, and distinguishing a sound signal from a noise signal; the extracted speech signal is then examined by different detection methods. The feature extraction of the voice signal is a core part of the voice endpoint detection technology, and determines the accuracy of the final voice endpoint detection.
The voice endpoint detection technology is mainly frequency domain endpoint detection in a processing domain, wherein the frequency domain endpoint detection is a voice endpoint detection method based on a spectral entropy method, signals are distinguished by using the characteristic that voice signals and noise signals have different spectral entropies, and then voice endpoint detection is carried out by detecting the flatness degree of a power spectrum, namely the spectral entropy is required to be calculated according to a spectral Probability Density Function (PDF). When the power spectrum distribution of the signal is relatively flat or uniform, the signal tends to be distributed with equal probability, the entropy function takes a larger value, and the reciprocal thereof takes a smaller value; on the contrary, when the power spectrum distribution of the signal is more concentrated or uneven, the entropy function takes a smaller value and the reciprocal thereof takes a larger value. Because the voice signal has a formant structure and the power spectrum distribution is concentrated and uneven, the spectrum entropy is lower and the reciprocal is a larger value; the power spectrum of noise signals (white noise, pink noise and the like) is relatively scattered, the spectrum entropy is relatively large, and the reciprocal is a relatively small value, so that the voice signals and the noise signals can be distinguished. The endpoint detection method based on the spectral entropy method has the characteristic of being less influenced by the energy of sound signals, so that the endpoint detection method has certain robustness on noise; however, in an actual noisy environment, such as a restaurant or a subway, which is full of noisy human noise, car driving noise, and the like, both the noise signal and the sound signal have relatively concentrated power spectrum distribution, so that the speech endpoint detection method based on the spectral entropy method is difficult to accurately estimate.
Disclosure of Invention
The invention provides a voice endpoint detection method, which aims to solve the technical problem that the voice endpoint detection method in the prior art is difficult to accurately estimate under the noise with concentrated power spectrum distribution; the invention can be suitable for noise types with relatively concentrated power spectrum distribution and improve the accuracy of voice endpoint detection.
In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection method, including:
filtering and framing the received voice signal to obtain a primary signal;
calculating the short-time amplitude and the frequency spectrum of each frame of the primary signal;
constructing a weighting factor according to the short-time amplitude, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; the method specifically comprises the following steps: normalizing the short-time amplitude E (n) of each frame of the primary signal, and constructing a weighting factor e (n); carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n, l); wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));Xg(n,l)=X(n,l)./|X(n,l)|e(n)(ii) a Wherein the content of the first and second substances,
Figure 100002_1
the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the number of frames, and M is the length of the frame; x (n, l) ═ fft (X (n, m)), fft is fast fourier transform, l is frequency;
calculating the power spectrum of each frame of the secondary signal, and calculating the sum of spectral energy;
calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
As a preferred scheme, the average value of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to judge speech frames and noise frames, and specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
As a preferred scheme, the calculating the short-time amplitude and the frequency spectrum of the primary signal per frame specifically includes:
calculating the short-time amplitude E (n) of the primary signal of each frame by using an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
Preferably, the method is characterized in that a weighting factor is constructed according to the short-time amplitude, and the spectrum is subjected to spectrum weighting by using the weighting factor to obtain a secondary signal, specifically:
normalizing the short-time amplitude E (n) of each frame of the primary signal, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of the primary signal of each frame by using the weighting factor e (n), so as to obtain each frameFrame said secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
As a preferred scheme, the calculating the power spectrum of each frame of the secondary signal and calculating the sum of spectral energy specifically includes:
calculating a power spectrum module value S (n, l) of each frame of the secondary signal, and calculating a spectrum energy sum Y (n);
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,
Figure GDA0003125962820000041
L is the length of the Fourier transform;
as a preferred scheme, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
Figure GDA0003125962820000042
as a preferred scheme, the average value of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to judge speech frames and noise frames, and specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein the content of the first and second substances,
Figure GDA0003125962820000043
Z<<N,J(n)=1/H(n)。
in order to solve the same technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the short-time amplitude and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the short-time amplitude value and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; the method specifically comprises the following steps: normalizing the short-time amplitude E (n) of each frame of the primary signal, and constructing a weighting factor e (n); carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n, l); wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));Xg(n,l)=X(n,l)./|X(n,l)|e(n)(ii) a Wherein, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length; x (n, l) ═ fft (X (n, m)), fft is fast fourier transform, l is frequency;
the second calculation module is used for calculating the power spectrum of each frame of the secondary signal and calculating the sum of spectral energy;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the voice endpoint detection method as described above when executing the computer program.
Compared with the prior art, the embodiment of the invention has the beneficial effects that the embodiment of the invention provides a voice endpoint detection method, which comprises the steps of filtering and framing a received voice signal to obtain a primary signal; calculating the short-time amplitude and the frequency spectrum of each frame of the primary signal; constructing a weighting factor according to the short-time amplitude, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; calculating the power spectrum of each frame of the secondary signal, and calculating the sum of spectral energy; calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies; and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
Under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing on a weighting factor constructed by using a short-time amplitude value calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, so that whitening is performed on the frequency spectrum of the noise signal to a certain extent, the power spectrum distribution of the noise signal can be more flat and uniform, the short-time spectrum entropy value of the noise signal is further increased, and the reciprocal of the short-time spectrum entropy value of the noise signal is smaller; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved. The energy-based endpoint detection method is integrated into the spectral entropy method, and the short-time amplitude is weighted on spectral whitening in an exponential mode, so that the effect of controlling the spectral whitening degree can be achieved, accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.
Drawings
FIG. 1 is a flow chart illustrating the steps of a method for detecting a voice endpoint according to the present invention;
fig. 2 is a schematic flow chart of a voice endpoint detection method provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment of the present invention:
referring to fig. 1, a first embodiment of the present invention provides a method for detecting a voice endpoint, which at least includes:
s1: filtering and framing the received voice signal to obtain a primary signal;
s2: calculating the short-time amplitude and the frequency spectrum of each frame of the primary signal;
s3: constructing a weighting factor according to the short-time amplitude, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the energy-based endpoint detection method is integrated into the spectral entropy method, and the short-time amplitude is weighted on spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, the endpoint detection can be accurately carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is effectively improved.
S4: calculating the power spectrum of each frame of the secondary signal, and calculating the sum of spectral energy;
s5: calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
s6: and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
In this embodiment, under a noise type with relatively concentrated power spectrum distribution, a weighting factor constructed by using a short-time amplitude calculation result and a spectrum of each frame of the primary signal are used for performing spectrum weighting processing to obtain the secondary signal, so that the spectrum of the noise signal can be whitened to a certain extent, the power spectrum distribution of the noise signal is flatter and more uniform, the short-time spectrum entropy of the noise signal is increased, the reciprocal of the short-time spectrum entropy of the noise signal is smaller, the power spectrum of the voice signal is retained, the short-time spectrum entropy of the voice signal is smaller, the reciprocal of the short-time spectrum entropy of the voice signal is larger, the voice signal and the noise signal can be distinguished, and the accuracy of detecting the end point of the voice in the spectral entropy french is improved.
In the embodiment of the present invention, the average of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to determine a speech frame and a noise frame, specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
In the embodiment of the present invention, the calculating the short-time amplitude and the frequency spectrum of the primary signal per frame specifically includes:
calculating the short-time amplitude E (n) of the primary signal of each frame by using an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
In this embodiment of the present invention, the constructing a weighting factor according to the short-time amplitude, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal specifically includes:
normalizing the short-time amplitude E (n) of each frame of the primary signal, and constructing a weighting factor e (n);
spectrally weighting the frequency spectrum X (n, l) of the primary signal per frame with the weighting factor e (n),obtaining the secondary signal X of each frameg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
Therefore, the energy-based endpoint detection method is integrated into the spectral entropy method, and the short-time amplitude is weighted to spectral whitening in an exponential mode, so that the effect of controlling the spectral whitening degree can be achieved, the accurate endpoint detection can be carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is improved.
In this embodiment of the present invention, the calculating a power spectrum of each frame of the secondary signal and calculating a sum of spectral energies specifically includes:
calculating a power spectrum module value S (n, l) of each frame of the secondary signal, and calculating a spectrum energy sum Y (n);
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,
Figure GDA0003125962820000081
L is the length of the Fourier transform;
in this embodiment of the present invention, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
Figure GDA0003125962820000082
in the embodiment of the present invention, the average of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to determine a speech frame and a noise frame, specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein the content of the first and second substances,
Figure GDA0003125962820000091
Z<<N,J(n)=1/H(n)。
in this embodiment, under a noise type with relatively concentrated power spectrum distribution, a weighting factor constructed by using a short-time amplitude calculation result and a spectrum of each frame of the primary signal are used for performing spectrum weighting processing to obtain the secondary signal, so that the spectrum of the noise signal can be whitened to a certain extent, the power spectrum distribution of the noise signal is flatter and more uniform, the short-time spectrum entropy of the noise signal is increased, the reciprocal of the short-time spectrum entropy of the noise signal is smaller, the power spectrum of the voice signal is retained, the short-time spectrum entropy of the voice signal is smaller, the reciprocal of the short-time spectrum entropy of the voice signal is larger, the voice signal and the noise signal can be distinguished, and the accuracy of detecting the end point of the voice in the spectral entropy french is improved.
Referring to fig. 2, a flow of one possible embodiment of the voice endpoint detection method of the present invention is as follows:
1. receiving a voice signal to be detected by a microphone, and recording the voice signal to be detected as x (t);
2. filtering and framing the received voice signal to obtain a primary signal and recording the primary signal as x (N, M), wherein N is 1,2,3, …, N frames, M is 1,2,3, …, M is the frame length of each frame;
3. estimating the short-time amplitude of each frame of the primary signal x (n, m), and calculating the short-time amplitude E (n) of each frame of the primary signal, wherein the calculation process is as follows:
Figure 2
4. normalizing the short-time amplitude E (n) of each frame of the primary signal to obtain Eg(n) andconstructing a weighting factor e (n), and calculating as follows:
Eg(n)=E(n)/max(E(n)),
e(n)=1-Eg(n);
5. performing fourier transform on each frame of the primary signal X (n, m) to obtain a frequency spectrum X (n, l) of each frame of the primary signal, wherein the calculation process is as follows:
X(n,l)=fft(x(n,m)),
wherein fft is fast fourier transform, l is frequency;
6. performing spectrum weighting processing on the frequency spectrum X (n, l) by using the weighting factor to obtain a secondary signal Xg(n, l), the calculation is as follows:
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
7. calculating the power spectrum module value S (n, l) of the secondary signal per frame, wherein the calculation process is as follows:
S(n,l)=|Xg(n,l).*Xg(n,l)|;
8. calculating the spectrum energy sum Y (n) of the secondary signals of each frame according to the following calculation process:
Figure GDA0003125962820000101
wherein L is the length of the Fourier transform;
9. calculating a spectral probability density function P (n, l) of the secondary signal per frame, wherein the calculation result is as follows:
P(n,l)=S(n,l)/Y(n)
10. calculating the short-time spectral entropy H (n) of each frame of the secondary signal, wherein the calculation result is as follows:
Figure GDA0003125962820000102
11. calculating the reciprocal J (n) of the short-time spectrum entropy value of each frame of the secondary signal, wherein the calculation result is as follows:
J(n)=1/H(n);
12. taking the average value of the spectrum entropy values of the first 20 frames as the detection threshold value K, and calculating the result as follows:
Figure GDA0003125962820000103
compared with the prior art, the voice endpoint detection method provided by the embodiment of the invention has the following beneficial effects:
(1) under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing on a weighting factor constructed by using a short-time amplitude value calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, so that whitening is performed on the frequency spectrum of the noise signal to a certain extent, the power spectrum distribution of the noise signal can be more flat and uniform, the short-time spectrum entropy value of the noise signal is further increased, and the reciprocal of the short-time spectrum entropy value of the noise signal is smaller; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved.
(2) The energy-based endpoint detection method is integrated into the spectral entropy method, and the short-time amplitude is weighted on spectral whitening in an exponential mode, so that the effect of controlling the spectral whitening degree can be achieved, accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.
(3) The spectrum whitening technology is utilized to whiten the frequency spectrum of the noise part signal to a certain degree, so that the power spectrum distribution of the noise signal is flatter and more uniform, and the spectrum entropy is increased; the power spectrum of the voice signal is reserved, the spectrum entropy is less, and the spectrum entropy of the voice signal and the spectrum entropy of the noise signal can be distinguished, so that the accuracy of detection under various noises is improved.
(4) An energy-based endpoint detection method is integrated into a spectrum entropy method, the method has the advantage of insensitivity to noise types, and the short-time amplitude is weighted to a spectrum whitening method in an exponential mode, so that the spectrum whitening degree is controlled; the method for weighting the frequency spectrum is combined with the method for weighting the short-time amplitude to the spectral whitening in an exponential mode, so that more accurate endpoint detection can be performed under various noise types, and the detection accuracy under various noises is improved.
Second embodiment of the invention:
a second embodiment of the present invention provides a voice endpoint detection apparatus, including:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the short-time amplitude and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the short-time amplitude value and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the second calculation module is used for calculating the power spectrum of each frame of the secondary signal and calculating the sum of spectral energy;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
In an embodiment of the present invention, the determining module is further configured to:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
The first computing module is further configured to:
calculating the short-time amplitude E (n) of the primary signal of each frame by using an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
The spectral weighting module is further configured to:
normalizing the short-time amplitude E (n) of each frame of the primary signal, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
The second computing module is further configured to:
calculating a power spectrum module value S (n, l) of each frame of the secondary signal, and calculating a spectrum energy sum Y (n);
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,
Figure GDA0003125962820000131
L is the length of the fourier transform.
The third computing module is further configured to:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
Figure GDA0003125962820000132
the judging module is further configured to:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein the content of the first and second substances,
Figure GDA0003125962820000133
Z<<N,J(n)=1/H(n)。
third embodiment of the invention:
the third embodiment of the present invention also provides a voice endpoint detection apparatus comprising a processor, a memory, and a computer program, such as an object fixing program, stored in the memory and configured to be executed by the processor. The processor, when executing the computer program, implements the steps of the voice endpoint detection method as described above, such as step S1 shown in fig. 1. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units in the above-mentioned device embodiments, such as the evaluation and analysis module.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the voice endpoint detection apparatus.
The voice endpoint detection device can be a desktop computer, a notebook computer, a palm computer, an intelligent tablet and other computing devices. The voice endpoint detection device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above components are merely examples of a voice endpoint detection device and do not constitute a limitation of a voice endpoint detection device and may include more or less components than those described above, or some components in combination, or different components, e.g., the voice endpoint detection device may also include input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the voice endpoint detection device and connecting the various parts of the entire voice endpoint detection device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the voice endpoint detection apparatus by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the integrated module/unit of the voice endpoint detection device can be stored in a computer readable storage medium if the integrated module/unit is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (8)

1. A voice endpoint detection method is characterized by comprising the following steps:
filtering and framing the received voice signal to obtain a primary signal;
calculating the short-time amplitude and the frequency spectrum of each frame of the primary signal;
constructing a weighting factor according to the short-time amplitude, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; the method specifically comprises the following steps: normalizing the short-time amplitude E (n) of each frame of the primary signal, and constructing a weighting factor e (n); carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n, l); wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));Xg(n,l)=X(n,l)./|X(n,l)|e(n)(ii) a Wherein the content of the first and second substances,
Figure 1
the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the number of frames, and M is the length of the frame; x (n, l) ═ fft (X (n, m)), fft is fast fourier transform, l is frequency;
calculating the power spectrum of each frame of the secondary signal, and calculating the sum of spectral energy;
calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
2. The method for detecting a speech endpoint according to claim 1, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
3. The method for detecting a voice endpoint according to claim 1, wherein the calculating the short-time amplitude and the frequency spectrum of the primary signal per frame specifically comprises:
calculating the short-time amplitude E (n) of the primary signal of each frame by using an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein the content of the first and second substances,
Figure 1
the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the number of frames, and M is the length of the frame;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
4. The method for detecting a voice endpoint according to claim 1, wherein the calculating the power spectrum of the secondary signal of each frame and the calculating the sum of the spectral energies are specifically as follows:
calculating a power spectrum module value S (n, l) of each frame of the secondary signal and calculating a spectrum energy sum Y (n);
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,
Figure FDA0003125962810000022
L is the length of the fourier transform.
5. The method according to claim 4, wherein the calculating a short-term spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy comprises:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum module value S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
Figure FDA0003125962810000023
6. the method for detecting a speech endpoint according to claim 5, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein the content of the first and second substances,
Figure FDA0003125962810000031
7. a voice endpoint detection apparatus, comprising:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the short-time amplitude and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the short-time amplitude value and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; the method specifically comprises the following steps: normalizing the short-time amplitude E (n) of each frame of the primary signal, and constructing a weighting factor e (n); carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n, l); wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));Xg(n,l)=X(n,l)./|X(n,l)|e(n)(ii) a Wherein the content of the first and second substances,
Figure 1
the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the number of frames, and M is the length of the frame; x (n, l) ═ fft (X (n, m)), fft is fast fourier transform, l is frequency;
the second calculation module is used for calculating the power spectrum of each frame of the secondary signal and calculating the sum of spectral energy;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
8. A voice endpoint detection device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the voice endpoint detection method according to any one of claims 1 to 6 when executing the computer program.
CN201910311947.7A 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment Active CN110047519B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910311947.7A CN110047519B (en) 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910311947.7A CN110047519B (en) 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment

Publications (2)

Publication Number Publication Date
CN110047519A CN110047519A (en) 2019-07-23
CN110047519B true CN110047519B (en) 2021-08-24

Family

ID=67277750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910311947.7A Active CN110047519B (en) 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment

Country Status (1)

Country Link
CN (1) CN110047519B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648692B (en) * 2019-09-26 2022-04-12 思必驰科技股份有限公司 Voice endpoint detection method and system
CN110995821B (en) * 2019-11-28 2021-05-04 深圳供电局有限公司 Power distribution network inspection system based on AI and intelligent helmet
CN111540368B (en) * 2020-05-07 2023-03-14 广州大学 Stable bird sound extraction method and device and computer readable storage medium
CN111650559B (en) * 2020-06-12 2022-11-01 深圳市裂石影音科技有限公司 Real-time processing two-dimensional sound source positioning method
CN112612008B (en) * 2020-12-08 2022-05-17 中国人民解放军陆军工程大学 Method and device for extracting initial parameters of echo signals of high-speed projectile
CN116665717B (en) * 2023-08-02 2023-09-29 广东技术师范大学 Cross-subband spectral entropy weighted likelihood ratio voice detection method and system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1503467A (en) * 2002-11-25 2004-06-09 ض� Noise matching for echo cancellers
CN1689072A (en) * 2002-08-16 2005-10-26 数字信号处理工厂有限公司 Method and system for processing subband signals using adaptive filters
KR100930061B1 (en) * 2008-01-22 2009-12-08 성균관대학교산학협력단 Signal detection method and apparatus
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN101777349A (en) * 2009-12-08 2010-07-14 中国科学院自动化研究所 Auditory perception property-based signal subspace microphone array voice enhancement method
CN102044243A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder
US20130267796A1 (en) * 2010-12-01 2013-10-10 Universitat Politecnica De Catalunya System and method for the simultaneous, non-invasive estimation of blood glucose, glucocorticoid level and blood pressure
CN103426440A (en) * 2013-08-22 2013-12-04 厦门大学 Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information
US9123351B2 (en) * 2011-03-31 2015-09-01 Oki Electric Industry Co., Ltd. Speech segment determination device, and storage medium
CN106536011A (en) * 2014-05-15 2017-03-22 布莱阿姆青年大学 Low-power miniature LED-based UV absorption detector with low detection limits for capillary liquid chromatography
WO2018069719A1 (en) * 2016-10-16 2018-04-19 Sentimoto Limited Voice activity detection method and apparatus
EP3443557A1 (en) * 2016-04-12 2019-02-20 Fraunhofer Gesellschaft zur Förderung der Angewand Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5649488B2 (en) * 2011-03-11 2015-01-07 株式会社東芝 Voice discrimination device, voice discrimination method, and voice discrimination program

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1689072A (en) * 2002-08-16 2005-10-26 数字信号处理工厂有限公司 Method and system for processing subband signals using adaptive filters
CN1503467A (en) * 2002-11-25 2004-06-09 ض� Noise matching for echo cancellers
KR100930061B1 (en) * 2008-01-22 2009-12-08 성균관대학교산학협력단 Signal detection method and apparatus
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102044243A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder
CN101777349A (en) * 2009-12-08 2010-07-14 中国科学院自动化研究所 Auditory perception property-based signal subspace microphone array voice enhancement method
US20130267796A1 (en) * 2010-12-01 2013-10-10 Universitat Politecnica De Catalunya System and method for the simultaneous, non-invasive estimation of blood glucose, glucocorticoid level and blood pressure
US9123351B2 (en) * 2011-03-31 2015-09-01 Oki Electric Industry Co., Ltd. Speech segment determination device, and storage medium
CN103426440A (en) * 2013-08-22 2013-12-04 厦门大学 Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information
CN106536011A (en) * 2014-05-15 2017-03-22 布莱阿姆青年大学 Low-power miniature LED-based UV absorption detector with low detection limits for capillary liquid chromatography
EP3443557A1 (en) * 2016-04-12 2019-02-20 Fraunhofer Gesellschaft zur Förderung der Angewand Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band
WO2018069719A1 (en) * 2016-10-16 2018-04-19 Sentimoto Limited Voice activity detection method and apparatus

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
Energy and Entropy based Switching Algorithm for Speech Endpoint Detection in Varying SNR Conditions;Chaitanya K, Sinha R.;《Ninth Annual Conference of the International Speech Communication Association》;20081226;全文 *
Entropy based voice activity detection in very noisy conditions;Renevey P, Drygajlo A.;《Seventh European Conference on Speech Communication and Technology》;20011207;全文 *
Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments;Wu B F, Wang K C;《IEEE Transactions on Speech & Audio Processing》;20051231;第13卷(第5期);第二章B节、第三章 *
Vlaj D, KačIčZ, Kos M..Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria.《Computers & Electrical Engineering》.2012, *
噪声估计和谱熵结合的语音激活检测算法;郑秋菊,李强,王岑;《现代电信科技》;20131225;第43卷(第12期);全文 *
基于传声器阵列的声源定位算法研究;梁龙腾;《中国优秀硕士学位论文全文数据库 信息科技辑》;20190501;全文 *
基于熵函数的语音端点检测算法研究;王博,郭英,韩立峰;《信号处理》;20090325;第25卷(第03期);全文 *
连续语音识别的稳健性技术研究;徐望;《中国优秀硕士学位论文全文数据库 信息科技辑》;20070615;全文 *

Also Published As

Publication number Publication date
CN110047519A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
CN110047519B (en) Voice endpoint detection method, device and equipment
CN108615535B (en) Voice enhancement method and device, intelligent voice equipment and computer equipment
US7117149B1 (en) Sound source classification
CN110634499A (en) Neural network for speech denoising with deep feature loss training
CN109256144B (en) Speech enhancement method based on ensemble learning and noise perception training
EP2828856B1 (en) Audio classification using harmonicity estimation
CN104637489B (en) The method and apparatus of sound signal processing
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN110880329A (en) Audio identification method and equipment and storage medium
Kaleem et al. Pathological speech signal analysis and classification using empirical mode decomposition
CN108962231B (en) Voice classification method, device, server and storage medium
CN111540342B (en) Energy threshold adjusting method, device, equipment and medium
CN102881291A (en) Sensing Hash value extracting method and sensing Hash value authenticating method for voice sensing Hash authentication
US20170294185A1 (en) Segmentation using prior distributions
May et al. Computational speech segregation based on an auditory-inspired modulation analysis
CN110503973B (en) Audio signal transient noise suppression method, system and storage medium
Abbas et al. Heart‐ID: human identity recognition using heart sounds based on modifying mel‐frequency cepstral features
Yarra et al. A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection
CN106847299B (en) Time delay estimation method and device
CN111968651A (en) WT (WT) -based voiceprint recognition method and system
JP6724290B2 (en) Sound processing device, sound processing method, and program
CN110875037A (en) Voice data processing method and device and electronic equipment
CN110534128B (en) Noise processing method, device, equipment and storage medium
CN111048096B (en) Voice signal processing method and device and terminal
CN111613247B (en) Foreground voice detection method and device based on microphone array

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant