CN110047519A - A kind of sound end detecting method, device and equipment - Google Patents
A kind of sound end detecting method, device and equipment Download PDFInfo
- Publication number
- CN110047519A CN110047519A CN201910311947.7A CN201910311947A CN110047519A CN 110047519 A CN110047519 A CN 110047519A CN 201910311947 A CN201910311947 A CN 201910311947A CN 110047519 A CN110047519 A CN 110047519A
- Authority
- CN
- China
- Prior art keywords
- spectrum
- frame
- energy
- signal
- short
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000001228 spectrum Methods 0.000 claims abstract description 232
- 238000001514 detection method Methods 0.000 claims abstract description 98
- 238000009432 framing Methods 0.000 claims abstract description 9
- 230000003595 spectral effect Effects 0.000 claims description 43
- 238000004364 calculation method Methods 0.000 claims description 21
- 230000006870 function Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 17
- 238000001914 filtration Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 3
- 230000002087 whitening effect Effects 0.000 description 14
- 238000012545 processing Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L2025/783—Detection of presence or absence of voice signals based on threshold decision
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephonic Communication Services (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention discloses a kind of sound end detecting methods, including are filtered simultaneously framing to the received voice signal of institute, obtain a signal;Calculate the energy and frequency spectrum of a signal described in every frame;Weighted factor is constructed according to the energy, and spectrum weighting is carried out to the frequency spectrum using the weighted factor, obtains secondary singal;Calculate the power spectrum and spectrum energy summation of secondary singal described in every frame;According to the power spectrum and the spectrum energy summation, the short-time spectrum entropy of secondary singal described in every frame is calculated;Using the average value reciprocal of the short-time spectrum entropy of several frames as the detection threshold value of sound end, the judgement of speech frame and noise frame is carried out.Sound end detecting method provided by the invention can be suitable for the noise type that Power Spectrum Distribution is comparatively concentrated, and improve the accuracy of speech terminals detection.
Description
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for detecting a speech endpoint.
Background
The voice endpoint detection is a technology applied to voice front-end processing, and extracts a noise-containing voice signal in the signal through an endpoint detection algorithm, so that effective information is provided for algorithms and technologies such as later sound source positioning, voice enhancement, voice recognition, voice coding and the like. The voice endpoint detection method in the prior art mainly comprises the following two steps: speech signal feature extraction and detection of speech signals. Firstly, extracting the characteristics of a voice signal through different algorithms, and distinguishing a sound signal from a noise signal; the extracted speech signal is then examined by different detection methods. The feature extraction of the voice signal is a core part of the voice endpoint detection technology, and determines the accuracy of the final voice endpoint detection.
The voice endpoint detection technology is mainly frequency domain endpoint detection in a processing domain, wherein the frequency domain endpoint detection is a voice endpoint detection method based on a spectral entropy method, signals are distinguished by using the characteristic that voice signals and noise signals have different spectral entropies, and then voice endpoint detection is carried out by detecting the flatness degree of a power spectrum, namely the spectral entropy is required to be calculated according to a spectral Probability Density Function (PDF). When the power spectrum distribution of the signal is relatively flat or uniform, the signal tends to be distributed with equal probability, the entropy function takes a larger value, and the reciprocal thereof takes a smaller value; on the contrary, when the power spectrum distribution of the signal is more concentrated or uneven, the entropy function takes a smaller value and the reciprocal thereof takes a larger value. Because the voice signal has a formant structure and the power spectrum distribution is concentrated and uneven, the spectrum entropy is lower and the reciprocal is a larger value; the power spectrum of noise signals (white noise, pink noise and the like) is relatively scattered, the spectrum entropy is relatively large, and the reciprocal is a relatively small value, so that the voice signals and the noise signals can be distinguished. The endpoint detection method based on the spectral entropy method has the characteristic of being less influenced by the energy of sound signals, so that the endpoint detection method has certain robustness on noise; however, in an actual noisy environment, such as a restaurant or a subway, which is full of noisy human noise, car driving noise, and the like, both the noise signal and the sound signal have relatively concentrated power spectrum distribution, so that the speech endpoint detection method based on the spectral entropy method is difficult to accurately estimate.
Disclosure of Invention
The invention provides a voice endpoint detection method, which aims to solve the technical problem that the voice endpoint detection method in the prior art is difficult to accurately estimate under the noise with concentrated power spectrum distribution; the invention can be suitable for noise types with relatively concentrated power spectrum distribution and improve the accuracy of voice endpoint detection.
In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection method, including:
filtering and framing the received voice signal to obtain a primary signal;
calculating the energy and frequency spectrum of the primary signal of each frame;
constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
As a preferred scheme, the average value of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to judge speech frames and noise frames, and specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
As a preferred scheme, the calculating the energy and the frequency spectrum of the primary signal per frame specifically includes:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
Preferably, the method is characterized in that a weighting factor is constructed according to the energy, and the spectrum is subjected to spectrum weighting by using the weighting factor to obtain a secondary signal, specifically:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)。
As a preferred scheme, the calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal specifically includes:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the Fourier transform;
as a preferred scheme, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
as a preferred scheme, the average value of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to judge speech frames and noise frames, and specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
in order to solve the same technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the energy and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the energy and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the second calculation module is used for calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the voice endpoint detection method as described above when executing the computer program.
Compared with the prior art, the embodiment of the invention has the beneficial effects that the embodiment of the invention provides a voice endpoint detection method, which comprises the steps of filtering and framing a received voice signal to obtain a primary signal; calculating the energy and frequency spectrum of the primary signal of each frame; constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal; calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies; and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
Under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing by using a weighting factor constructed by an energy calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, thereby whitening the frequency spectrum of the noise signal to a certain degree, enabling the power spectrum distribution of the noise signal to be flatter and more uniform, further increasing the short-term spectrum entropy value of the noise signal, and enabling the reciprocal of the short-term spectrum entropy value of the noise signal to be a smaller value; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved. The energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, more accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.
Drawings
FIG. 1 is a flow chart illustrating the steps of a method for detecting a voice endpoint according to the present invention;
fig. 2 is a schematic flow chart of a voice endpoint detection method provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment of the present invention:
referring to fig. 1, a first embodiment of the present invention provides a method for detecting a voice endpoint, which at least includes:
s1: filtering and framing the received voice signal to obtain a primary signal;
s2: calculating the energy and frequency spectrum of the primary signal of each frame;
s3: constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, the endpoint detection can be accurately carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is effectively improved.
S4: calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
s5: calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
s6: and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
In this embodiment, under a noise type with relatively concentrated power spectrum distribution, a weighting factor constructed by using an energy calculation result and a spectrum of each frame of the primary signal are used for performing spectrum weighting processing to obtain the secondary signal, and the spectrum of the noise signal can be whitened to a certain extent, so that the power spectrum distribution of the noise signal is flatter and more uniform, the short-term spectrum entropy of the noise signal is increased, the reciprocal of the short-term spectrum entropy of the noise signal is smaller, the power spectrum of the voice signal is retained, the short-term spectrum entropy of the voice signal is smaller, and the reciprocal of the short-term spectrum entropy of the voice signal is larger, so that the voice signal and the noise signal can be distinguished, and the accuracy of detecting a speech endpoint of spectral entropy method is improved.
In the embodiment of the present invention, the average of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to determine a speech frame and a noise frame, specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
In this embodiment of the present invention, the calculating the energy and the frequency spectrum of the primary signal per frame specifically includes:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
In this embodiment of the present invention, the constructing a weighting factor according to the energy, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal specifically includes:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)。
Therefore, the energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, the endpoint detection can be accurately carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is improved.
In this embodiment of the present invention, the calculating a sum of a power spectrum and a spectral energy of each frame of the secondary signal specifically includes:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the Fourier transform;
in this embodiment of the present invention, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
in the embodiment of the present invention, the average of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to determine a speech frame and a noise frame, specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
in this embodiment, under a noise type with relatively concentrated power spectrum distribution, a weighting factor constructed by using an energy calculation result and a spectrum of each frame of the primary signal are used for performing spectrum weighting processing to obtain the secondary signal, and the spectrum of the noise signal can be whitened to a certain extent, so that the power spectrum distribution of the noise signal is flatter and more uniform, the short-term spectrum entropy of the noise signal is increased, the reciprocal of the short-term spectrum entropy of the noise signal is smaller, the power spectrum of the voice signal is retained, the short-term spectrum entropy of the voice signal is smaller, and the reciprocal of the short-term spectrum entropy of the voice signal is larger, so that the voice signal and the noise signal can be distinguished, and the accuracy of detecting a speech endpoint of spectral entropy method is improved.
Referring to fig. 2, a flow of one possible embodiment of the voice endpoint detection method of the present invention is as follows:
1. receiving a voice signal to be detected by a microphone, and recording the voice signal to be detected as x (t);
2. filtering and framing the received voice signal to obtain a primary signal and recording the primary signal as x (N, M), wherein N is 1,2,3, …, N frames, M is 1,2,3, …, M is the frame length of each frame;
3. estimating the energy of the primary signal x (n, m) per frame, and calculating the energy of the primary signal e (n) per frame as follows:
4. normalizing the energy E (n) of the primary signal of each frame to obtain Eg(n) and constructing a weighting factor e (n), wherein the calculation process is as follows:
Eg(n)=E(n)/max(E(n)),
e(n)=1-Eg(n);
5. performing fourier transform on each frame of the primary signal X (n, m) to obtain a frequency spectrum X (n, l) of each frame of the primary signal, wherein the calculation process is as follows:
X(n,l)=fft(x(n,m)),
wherein fft is fast fourier transform, l is frequency;
6. performing spectrum weighting processing on the frequency spectrum X (n, l) by using the weighting factor to obtain a secondary signal Xg(n, l), the calculation is as follows:
Xg(n,l)=X(n,l)./|X(n,l)|e(n);
7. calculating the power spectrum module value S (n, l) of the secondary signal per frame, wherein the calculation process is as follows:
S(n,l)=|Xg(n,l).*Xg(n,l)|;
8. calculating the spectrum energy sum Y (n) of the secondary signals of each frame according to the following calculation process:
wherein L is the length of the Fourier transform;
9. calculating a spectral probability density function P (n, l) of the secondary signal per frame, wherein the calculation result is as follows:
P(n,l)=S(n,l)/Y(n)
10. calculating the short-time spectral entropy H (n) of each frame of the secondary signal, wherein the calculation result is as follows:
11. calculating the reciprocal J (n) of the short-time spectrum entropy value of each frame of the secondary signal, wherein the calculation result is as follows:
J(n)=1/H(n);
12. taking the average value of the spectrum entropy values of the first 20 frames as the detection threshold value K, and calculating the result as follows:
compared with the prior art, the voice endpoint detection method provided by the embodiment of the invention has the following beneficial effects:
(1) under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing by using a weighting factor constructed by an energy calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, thereby whitening the frequency spectrum of the noise signal to a certain degree, enabling the power spectrum distribution of the noise signal to be flatter and more uniform, further increasing the short-term spectrum entropy value of the noise signal, and enabling the reciprocal of the short-term spectrum entropy value of the noise signal to be a smaller value; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved.
(2) The energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, more accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.
(3) The spectrum whitening technology is utilized to whiten the frequency spectrum of the noise part signal to a certain degree, so that the power spectrum distribution of the noise signal is flatter and more uniform, and the spectrum entropy is increased; the power spectrum of the voice signal is reserved, the spectrum entropy is less, and the spectrum entropy of the voice signal and the spectrum entropy of the noise signal can be distinguished, so that the accuracy of detection under various noises is improved.
(4) An energy-based endpoint detection method is integrated into a spectrum entropy method, the method has the advantage of insensitivity to noise types, and energy is weighted to a spectrum whitening method in an exponential mode, so that the spectrum whitening degree is controlled; the method for weighting the frequency spectrum is combined with the method for weighting the energy to the spectral whitening in an exponential mode, and more accurate endpoint detection can be carried out under various noise types, so that the detection accuracy under various noises is improved.
Second embodiment of the invention:
a second embodiment of the present invention provides a voice endpoint detection apparatus, including:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the energy and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the energy and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the second calculation module is used for calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
In an embodiment of the present invention, the determining module is further configured to:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
The first computing module is further configured to:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
The spectral weighting module is further configured to:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)。
The second computing module is further configured to:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the fourier transform.
The third computing module is further configured to:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
the judging module is further configured to:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
third embodiment of the invention:
the third embodiment of the present invention also provides a voice endpoint detection apparatus comprising a processor, a memory, and a computer program, such as an object fixing program, stored in the memory and configured to be executed by the processor. The processor, when executing the computer program, implements the steps of the voice endpoint detection method as described above, such as step S1 shown in fig. 1. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units in the above-mentioned device embodiments, such as the evaluation and analysis module.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the voice endpoint detection apparatus.
The voice endpoint detection device can be a desktop computer, a notebook computer, a palm computer, an intelligent tablet and other computing devices. The voice endpoint detection device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above components are merely examples of a voice endpoint detection device and do not constitute a limitation of a voice endpoint detection device and may include more or less components than those described above, or some components in combination, or different components, e.g., the voice endpoint detection device may also include input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the voice endpoint detection device and connecting the various parts of the entire voice endpoint detection device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the voice endpoint detection apparatus by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the integrated module/unit of the voice endpoint detection device can be stored in a computer readable storage medium if the integrated module/unit is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.
Claims (9)
1. A voice endpoint detection method is characterized by comprising the following steps:
filtering and framing the received voice signal to obtain a primary signal;
calculating the energy and frequency spectrum of the primary signal of each frame;
constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
2. The method for detecting a speech endpoint according to claim 1, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
3. The method for detecting a voice endpoint according to claim 1, wherein the calculating the energy and the spectrum of the primary signal per frame specifically comprises:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
4. The method according to claim 3, wherein the step of constructing a weighting factor according to the energy and performing spectral weighting on the spectrum by using the weighting factor to obtain a secondary signal comprises:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)。
5. The method for detecting a voice endpoint according to claim 4, wherein the calculating the sum of the power spectrum and the spectral energy of the secondary signal of each frame is specifically:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the fourier transform.
6. The method according to claim 5, wherein the calculating a short-term spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy comprises:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
7. the method for detecting a speech endpoint according to claim 6, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
8. a voice endpoint detection apparatus, comprising:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the energy and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the energy and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the second calculation module is used for calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
9. A voice endpoint detection device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the voice endpoint detection method according to any one of claims 1-7 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910311947.7A CN110047519B (en) | 2019-04-16 | 2019-04-16 | Voice endpoint detection method, device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910311947.7A CN110047519B (en) | 2019-04-16 | 2019-04-16 | Voice endpoint detection method, device and equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110047519A true CN110047519A (en) | 2019-07-23 |
CN110047519B CN110047519B (en) | 2021-08-24 |
Family
ID=67277750
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910311947.7A Active CN110047519B (en) | 2019-04-16 | 2019-04-16 | Voice endpoint detection method, device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110047519B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110648692A (en) * | 2019-09-26 | 2020-01-03 | 苏州思必驰信息科技有限公司 | Voice endpoint detection method and system |
CN110995821A (en) * | 2019-11-28 | 2020-04-10 | 深圳供电局有限公司 | Power distribution network inspection system based on AI and intelligent helmet |
CN111540368A (en) * | 2020-05-07 | 2020-08-14 | 广州大学 | Stable bird sound extraction method and device and computer readable storage medium |
CN111650559A (en) * | 2020-06-12 | 2020-09-11 | 深圳市裂石影音科技有限公司 | Real-time processing two-dimensional sound source positioning method |
CN112612008A (en) * | 2020-12-08 | 2021-04-06 | 中国人民解放军陆军工程大学 | Method and device for extracting initial parameters of echo signals of high-speed projectile |
CN116665717A (en) * | 2023-08-02 | 2023-08-29 | 广东技术师范大学 | Cross-subband spectral entropy weighted likelihood ratio voice detection method and system |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1503467A (en) * | 2002-11-25 | 2004-06-09 | ض� | Noise matching for echo cancellers |
CN1689072A (en) * | 2002-08-16 | 2005-10-26 | 数字信号处理工厂有限公司 | Method and system for processing subband signals using adaptive filters |
KR100930061B1 (en) * | 2008-01-22 | 2009-12-08 | 성균관대학교산학협력단 | Signal detection method and apparatus |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN101777349A (en) * | 2009-12-08 | 2010-07-14 | 中国科学院自动化研究所 | Auditory perception property-based signal subspace microphone array voice enhancement method |
CN102044243A (en) * | 2009-10-15 | 2011-05-04 | 华为技术有限公司 | Method and device for voice activity detection (VAD) and encoder |
US20120232890A1 (en) * | 2011-03-11 | 2012-09-13 | Kabushiki Kaisha Toshiba | Apparatus and method for discriminating speech, and computer readable medium |
US20130267796A1 (en) * | 2010-12-01 | 2013-10-10 | Universitat Politecnica De Catalunya | System and method for the simultaneous, non-invasive estimation of blood glucose, glucocorticoid level and blood pressure |
CN103426440A (en) * | 2013-08-22 | 2013-12-04 | 厦门大学 | Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information |
US9123351B2 (en) * | 2011-03-31 | 2015-09-01 | Oki Electric Industry Co., Ltd. | Speech segment determination device, and storage medium |
CN106536011A (en) * | 2014-05-15 | 2017-03-22 | 布莱阿姆青年大学 | Low-power miniature LED-based UV absorption detector with low detection limits for capillary liquid chromatography |
WO2018069719A1 (en) * | 2016-10-16 | 2018-04-19 | Sentimoto Limited | Voice activity detection method and apparatus |
EP3443557A1 (en) * | 2016-04-12 | 2019-02-20 | Fraunhofer Gesellschaft zur Förderung der Angewand | Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band |
-
2019
- 2019-04-16 CN CN201910311947.7A patent/CN110047519B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1689072A (en) * | 2002-08-16 | 2005-10-26 | 数字信号处理工厂有限公司 | Method and system for processing subband signals using adaptive filters |
CN1503467A (en) * | 2002-11-25 | 2004-06-09 | ض� | Noise matching for echo cancellers |
KR100930061B1 (en) * | 2008-01-22 | 2009-12-08 | 성균관대학교산학협력단 | Signal detection method and apparatus |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN102044243A (en) * | 2009-10-15 | 2011-05-04 | 华为技术有限公司 | Method and device for voice activity detection (VAD) and encoder |
CN101777349A (en) * | 2009-12-08 | 2010-07-14 | 中国科学院自动化研究所 | Auditory perception property-based signal subspace microphone array voice enhancement method |
US20130267796A1 (en) * | 2010-12-01 | 2013-10-10 | Universitat Politecnica De Catalunya | System and method for the simultaneous, non-invasive estimation of blood glucose, glucocorticoid level and blood pressure |
US20120232890A1 (en) * | 2011-03-11 | 2012-09-13 | Kabushiki Kaisha Toshiba | Apparatus and method for discriminating speech, and computer readable medium |
US9123351B2 (en) * | 2011-03-31 | 2015-09-01 | Oki Electric Industry Co., Ltd. | Speech segment determination device, and storage medium |
CN103426440A (en) * | 2013-08-22 | 2013-12-04 | 厦门大学 | Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information |
CN106536011A (en) * | 2014-05-15 | 2017-03-22 | 布莱阿姆青年大学 | Low-power miniature LED-based UV absorption detector with low detection limits for capillary liquid chromatography |
EP3443557A1 (en) * | 2016-04-12 | 2019-02-20 | Fraunhofer Gesellschaft zur Förderung der Angewand | Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band |
WO2018069719A1 (en) * | 2016-10-16 | 2018-04-19 | Sentimoto Limited | Voice activity detection method and apparatus |
Non-Patent Citations (8)
Title |
---|
CHAITANYA K, SINHA R.: "Energy and Entropy based Switching Algorithm for Speech Endpoint Detection in Varying SNR Conditions", 《NINTH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION》 * |
RENEVEY P, DRYGAJLO A.: "Entropy based voice activity detection in very noisy conditions", 《SEVENTH EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY》 * |
VLAJ D, KAČIČ Z, KOS M.: "Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria", 《COMPUTERS & ELECTRICAL ENGINEERING》 * |
WU B F, WANG K C: "Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments", 《IEEE TRANSACTIONS ON SPEECH & AUDIO PROCESSING》 * |
徐望: "连续语音识别的稳健性技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
梁龙腾: "基于传声器阵列的声源定位算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王博,郭英,韩立峰: "基于熵函数的语音端点检测算法研究", 《信号处理》 * |
郑秋菊,李强,王岑: "噪声估计和谱熵结合的语音激活检测算法", 《现代电信科技》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110648692A (en) * | 2019-09-26 | 2020-01-03 | 苏州思必驰信息科技有限公司 | Voice endpoint detection method and system |
CN110648692B (en) * | 2019-09-26 | 2022-04-12 | 思必驰科技股份有限公司 | Voice endpoint detection method and system |
CN110995821A (en) * | 2019-11-28 | 2020-04-10 | 深圳供电局有限公司 | Power distribution network inspection system based on AI and intelligent helmet |
CN111540368A (en) * | 2020-05-07 | 2020-08-14 | 广州大学 | Stable bird sound extraction method and device and computer readable storage medium |
CN111540368B (en) * | 2020-05-07 | 2023-03-14 | 广州大学 | Stable bird sound extraction method and device and computer readable storage medium |
CN111650559A (en) * | 2020-06-12 | 2020-09-11 | 深圳市裂石影音科技有限公司 | Real-time processing two-dimensional sound source positioning method |
CN112612008A (en) * | 2020-12-08 | 2021-04-06 | 中国人民解放军陆军工程大学 | Method and device for extracting initial parameters of echo signals of high-speed projectile |
CN116665717A (en) * | 2023-08-02 | 2023-08-29 | 广东技术师范大学 | Cross-subband spectral entropy weighted likelihood ratio voice detection method and system |
CN116665717B (en) * | 2023-08-02 | 2023-09-29 | 广东技术师范大学 | Cross-subband spectral entropy weighted likelihood ratio voice detection method and system |
Also Published As
Publication number | Publication date |
---|---|
CN110047519B (en) | 2021-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110047519B (en) | Voice endpoint detection method, device and equipment | |
EP3806089B1 (en) | Mixed speech recognition method and apparatus, and computer readable storage medium | |
CN106486131B (en) | A kind of method and device of speech de-noising | |
CN110634497B (en) | Noise reduction method and device, terminal equipment and storage medium | |
CN104637489B (en) | The method and apparatus of sound signal processing | |
CN108615535A (en) | Sound enhancement method, device, intelligent sound equipment and computer equipment | |
WO2001016937A9 (en) | System and method for classification of sound sources | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
CN110956966A (en) | Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment | |
CN102881291A (en) | Sensing Hash value extracting method and sensing Hash value authenticating method for voice sensing Hash authentication | |
CN107293287A (en) | The method and apparatus for detecting audio signal | |
May et al. | Computational speech segregation based on an auditory-inspired modulation analysis | |
CN116564315A (en) | Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium | |
CN114186094A (en) | Audio scene classification method and device, terminal equipment and storage medium | |
CN115565548A (en) | Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment | |
CN110503973B (en) | Audio signal transient noise suppression method, system and storage medium | |
CN115394318A (en) | Audio detection method and device | |
Yarra et al. | A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection | |
Abbas et al. | Heart‐ID: human identity recognition using heart sounds based on modifying mel‐frequency cepstral features | |
CN111477248B (en) | Audio noise detection method and device | |
CN106847299B (en) | Time delay estimation method and device | |
CN113593604A (en) | Method, device and storage medium for detecting audio quality | |
CN108847251A (en) | A kind of voice De-weight method, device, server and storage medium | |
JP6724290B2 (en) | Sound processing device, sound processing method, and program | |
CN110534128B (en) | Noise processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |