CN110047519A - A kind of sound end detecting method, device and equipment - Google Patents

A kind of sound end detecting method, device and equipment Download PDF

Info

Publication number
CN110047519A
CN110047519A CN201910311947.7A CN201910311947A CN110047519A CN 110047519 A CN110047519 A CN 110047519A CN 201910311947 A CN201910311947 A CN 201910311947A CN 110047519 A CN110047519 A CN 110047519A
Authority
CN
China
Prior art keywords
spectrum
frame
energy
signal
short
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910311947.7A
Other languages
Chinese (zh)
Other versions
CN110047519B (en
Inventor
张承云
梁龙腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou University
Original Assignee
Guangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou University filed Critical Guangzhou University
Priority to CN201910311947.7A priority Critical patent/CN110047519B/en
Publication of CN110047519A publication Critical patent/CN110047519A/en
Application granted granted Critical
Publication of CN110047519B publication Critical patent/CN110047519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)

Abstract

The invention discloses a kind of sound end detecting methods, including are filtered simultaneously framing to the received voice signal of institute, obtain a signal;Calculate the energy and frequency spectrum of a signal described in every frame;Weighted factor is constructed according to the energy, and spectrum weighting is carried out to the frequency spectrum using the weighted factor, obtains secondary singal;Calculate the power spectrum and spectrum energy summation of secondary singal described in every frame;According to the power spectrum and the spectrum energy summation, the short-time spectrum entropy of secondary singal described in every frame is calculated;Using the average value reciprocal of the short-time spectrum entropy of several frames as the detection threshold value of sound end, the judgement of speech frame and noise frame is carried out.Sound end detecting method provided by the invention can be suitable for the noise type that Power Spectrum Distribution is comparatively concentrated, and improve the accuracy of speech terminals detection.

Description

Voice endpoint detection method, device and equipment
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for detecting a speech endpoint.
Background
The voice endpoint detection is a technology applied to voice front-end processing, and extracts a noise-containing voice signal in the signal through an endpoint detection algorithm, so that effective information is provided for algorithms and technologies such as later sound source positioning, voice enhancement, voice recognition, voice coding and the like. The voice endpoint detection method in the prior art mainly comprises the following two steps: speech signal feature extraction and detection of speech signals. Firstly, extracting the characteristics of a voice signal through different algorithms, and distinguishing a sound signal from a noise signal; the extracted speech signal is then examined by different detection methods. The feature extraction of the voice signal is a core part of the voice endpoint detection technology, and determines the accuracy of the final voice endpoint detection.
The voice endpoint detection technology is mainly frequency domain endpoint detection in a processing domain, wherein the frequency domain endpoint detection is a voice endpoint detection method based on a spectral entropy method, signals are distinguished by using the characteristic that voice signals and noise signals have different spectral entropies, and then voice endpoint detection is carried out by detecting the flatness degree of a power spectrum, namely the spectral entropy is required to be calculated according to a spectral Probability Density Function (PDF). When the power spectrum distribution of the signal is relatively flat or uniform, the signal tends to be distributed with equal probability, the entropy function takes a larger value, and the reciprocal thereof takes a smaller value; on the contrary, when the power spectrum distribution of the signal is more concentrated or uneven, the entropy function takes a smaller value and the reciprocal thereof takes a larger value. Because the voice signal has a formant structure and the power spectrum distribution is concentrated and uneven, the spectrum entropy is lower and the reciprocal is a larger value; the power spectrum of noise signals (white noise, pink noise and the like) is relatively scattered, the spectrum entropy is relatively large, and the reciprocal is a relatively small value, so that the voice signals and the noise signals can be distinguished. The endpoint detection method based on the spectral entropy method has the characteristic of being less influenced by the energy of sound signals, so that the endpoint detection method has certain robustness on noise; however, in an actual noisy environment, such as a restaurant or a subway, which is full of noisy human noise, car driving noise, and the like, both the noise signal and the sound signal have relatively concentrated power spectrum distribution, so that the speech endpoint detection method based on the spectral entropy method is difficult to accurately estimate.
Disclosure of Invention
The invention provides a voice endpoint detection method, which aims to solve the technical problem that the voice endpoint detection method in the prior art is difficult to accurately estimate under the noise with concentrated power spectrum distribution; the invention can be suitable for noise types with relatively concentrated power spectrum distribution and improve the accuracy of voice endpoint detection.
In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection method, including:
filtering and framing the received voice signal to obtain a primary signal;
calculating the energy and frequency spectrum of the primary signal of each frame;
constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
As a preferred scheme, the average value of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to judge speech frames and noise frames, and specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
As a preferred scheme, the calculating the energy and the frequency spectrum of the primary signal per frame specifically includes:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
Preferably, the method is characterized in that a weighting factor is constructed according to the energy, and the spectrum is subjected to spectrum weighting by using the weighting factor to obtain a secondary signal, specifically:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
As a preferred scheme, the calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal specifically includes:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the Fourier transform;
as a preferred scheme, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
as a preferred scheme, the average value of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to judge speech frames and noise frames, and specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
in order to solve the same technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the energy and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the energy and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the second calculation module is used for calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
In order to solve the above technical problem, an embodiment of the present invention provides a voice endpoint detection apparatus, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, where the processor implements the voice endpoint detection method as described above when executing the computer program.
Compared with the prior art, the embodiment of the invention has the beneficial effects that the embodiment of the invention provides a voice endpoint detection method, which comprises the steps of filtering and framing a received voice signal to obtain a primary signal; calculating the energy and frequency spectrum of the primary signal of each frame; constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal; calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal; calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies; and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
Under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing by using a weighting factor constructed by an energy calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, thereby whitening the frequency spectrum of the noise signal to a certain degree, enabling the power spectrum distribution of the noise signal to be flatter and more uniform, further increasing the short-term spectrum entropy value of the noise signal, and enabling the reciprocal of the short-term spectrum entropy value of the noise signal to be a smaller value; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved. The energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, more accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.
Drawings
FIG. 1 is a flow chart illustrating the steps of a method for detecting a voice endpoint according to the present invention;
fig. 2 is a schematic flow chart of a voice endpoint detection method provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The first embodiment of the present invention:
referring to fig. 1, a first embodiment of the present invention provides a method for detecting a voice endpoint, which at least includes:
s1: filtering and framing the received voice signal to obtain a primary signal;
s2: calculating the energy and frequency spectrum of the primary signal of each frame;
s3: constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, the endpoint detection can be accurately carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is effectively improved.
S4: calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
s5: calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
s6: and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
In this embodiment, under a noise type with relatively concentrated power spectrum distribution, a weighting factor constructed by using an energy calculation result and a spectrum of each frame of the primary signal are used for performing spectrum weighting processing to obtain the secondary signal, and the spectrum of the noise signal can be whitened to a certain extent, so that the power spectrum distribution of the noise signal is flatter and more uniform, the short-term spectrum entropy of the noise signal is increased, the reciprocal of the short-term spectrum entropy of the noise signal is smaller, the power spectrum of the voice signal is retained, the short-term spectrum entropy of the voice signal is smaller, and the reciprocal of the short-term spectrum entropy of the voice signal is larger, so that the voice signal and the noise signal can be distinguished, and the accuracy of detecting a speech endpoint of spectral entropy method is improved.
In the embodiment of the present invention, the average of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to determine a speech frame and a noise frame, specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
In this embodiment of the present invention, the calculating the energy and the frequency spectrum of the primary signal per frame specifically includes:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
In this embodiment of the present invention, the constructing a weighting factor according to the energy, and performing spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal specifically includes:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
Therefore, the energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, the endpoint detection can be accurately carried out by using the spectral entropy method under the noise type with relatively concentrated power spectrum distribution, and the accuracy of voice endpoint detection is improved.
In this embodiment of the present invention, the calculating a sum of a power spectrum and a spectral energy of each frame of the secondary signal specifically includes:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the Fourier transform;
in this embodiment of the present invention, the calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy specifically includes:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
in the embodiment of the present invention, the average of the reciprocals of the short-time spectrum entropy values of a plurality of frames is used as a detection threshold of a speech endpoint to determine a speech frame and a noise frame, specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
in this embodiment, under a noise type with relatively concentrated power spectrum distribution, a weighting factor constructed by using an energy calculation result and a spectrum of each frame of the primary signal are used for performing spectrum weighting processing to obtain the secondary signal, and the spectrum of the noise signal can be whitened to a certain extent, so that the power spectrum distribution of the noise signal is flatter and more uniform, the short-term spectrum entropy of the noise signal is increased, the reciprocal of the short-term spectrum entropy of the noise signal is smaller, the power spectrum of the voice signal is retained, the short-term spectrum entropy of the voice signal is smaller, and the reciprocal of the short-term spectrum entropy of the voice signal is larger, so that the voice signal and the noise signal can be distinguished, and the accuracy of detecting a speech endpoint of spectral entropy method is improved.
Referring to fig. 2, a flow of one possible embodiment of the voice endpoint detection method of the present invention is as follows:
1. receiving a voice signal to be detected by a microphone, and recording the voice signal to be detected as x (t);
2. filtering and framing the received voice signal to obtain a primary signal and recording the primary signal as x (N, M), wherein N is 1,2,3, …, N frames, M is 1,2,3, …, M is the frame length of each frame;
3. estimating the energy of the primary signal x (n, m) per frame, and calculating the energy of the primary signal e (n) per frame as follows:
4. normalizing the energy E (n) of the primary signal of each frame to obtain Eg(n) and constructing a weighting factor e (n), wherein the calculation process is as follows:
Eg(n)=E(n)/max(E(n)),
e(n)=1-Eg(n);
5. performing fourier transform on each frame of the primary signal X (n, m) to obtain a frequency spectrum X (n, l) of each frame of the primary signal, wherein the calculation process is as follows:
X(n,l)=fft(x(n,m)),
wherein fft is fast fourier transform, l is frequency;
6. performing spectrum weighting processing on the frequency spectrum X (n, l) by using the weighting factor to obtain a secondary signal Xg(n, l), the calculation is as follows:
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
7. calculating the power spectrum module value S (n, l) of the secondary signal per frame, wherein the calculation process is as follows:
S(n,l)=|Xg(n,l).*Xg(n,l)|;
8. calculating the spectrum energy sum Y (n) of the secondary signals of each frame according to the following calculation process:
wherein L is the length of the Fourier transform;
9. calculating a spectral probability density function P (n, l) of the secondary signal per frame, wherein the calculation result is as follows:
P(n,l)=S(n,l)/Y(n)
10. calculating the short-time spectral entropy H (n) of each frame of the secondary signal, wherein the calculation result is as follows:
11. calculating the reciprocal J (n) of the short-time spectrum entropy value of each frame of the secondary signal, wherein the calculation result is as follows:
J(n)=1/H(n);
12. taking the average value of the spectrum entropy values of the first 20 frames as the detection threshold value K, and calculating the result as follows:
compared with the prior art, the voice endpoint detection method provided by the embodiment of the invention has the following beneficial effects:
(1) under the noise type with relatively concentrated power spectrum distribution, performing spectrum weighting processing by using a weighting factor constructed by an energy calculation result and the frequency spectrum of each frame of primary signal to obtain a secondary signal, thereby whitening the frequency spectrum of the noise signal to a certain degree, enabling the power spectrum distribution of the noise signal to be flatter and more uniform, further increasing the short-term spectrum entropy value of the noise signal, and enabling the reciprocal of the short-term spectrum entropy value of the noise signal to be a smaller value; meanwhile, a power spectrum of the voice signal is reserved, the short-time spectrum entropy value of the voice signal is small, and the reciprocal of the short-time spectrum entropy value is large; therefore, the voice signal and the noise signal can be distinguished, and the accuracy of voice endpoint detection is improved.
(2) The energy-based endpoint detection method is integrated into the spectral entropy method, and the energy is weighted to spectral whitening through an exponential form, so that the effect of controlling the spectral whitening degree can be achieved, more accurate endpoint detection can be performed under the noise type with relatively concentrated power spectrum distribution, and the accuracy of spectral entropy French voice endpoint detection is effectively improved.
(3) The spectrum whitening technology is utilized to whiten the frequency spectrum of the noise part signal to a certain degree, so that the power spectrum distribution of the noise signal is flatter and more uniform, and the spectrum entropy is increased; the power spectrum of the voice signal is reserved, the spectrum entropy is less, and the spectrum entropy of the voice signal and the spectrum entropy of the noise signal can be distinguished, so that the accuracy of detection under various noises is improved.
(4) An energy-based endpoint detection method is integrated into a spectrum entropy method, the method has the advantage of insensitivity to noise types, and energy is weighted to a spectrum whitening method in an exponential mode, so that the spectrum whitening degree is controlled; the method for weighting the frequency spectrum is combined with the method for weighting the energy to the spectral whitening in an exponential mode, and more accurate endpoint detection can be carried out under various noise types, so that the detection accuracy under various noises is improved.
Second embodiment of the invention:
a second embodiment of the present invention provides a voice endpoint detection apparatus, including:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the energy and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the energy and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the second calculation module is used for calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
In an embodiment of the present invention, the determining module is further configured to:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
The first computing module is further configured to:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
The spectral weighting module is further configured to:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
The second computing module is further configured to:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the fourier transform.
The third computing module is further configured to:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
the judging module is further configured to:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
third embodiment of the invention:
the third embodiment of the present invention also provides a voice endpoint detection apparatus comprising a processor, a memory, and a computer program, such as an object fixing program, stored in the memory and configured to be executed by the processor. The processor, when executing the computer program, implements the steps of the voice endpoint detection method as described above, such as step S1 shown in fig. 1. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units in the above-mentioned device embodiments, such as the evaluation and analysis module.
Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the voice endpoint detection apparatus.
The voice endpoint detection device can be a desktop computer, a notebook computer, a palm computer, an intelligent tablet and other computing devices. The voice endpoint detection device may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the above components are merely examples of a voice endpoint detection device and do not constitute a limitation of a voice endpoint detection device and may include more or less components than those described above, or some components in combination, or different components, e.g., the voice endpoint detection device may also include input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the processor being the control center of the voice endpoint detection device and connecting the various parts of the entire voice endpoint detection device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement the various functions of the voice endpoint detection apparatus by running or executing the computer programs and/or modules stored in the memory and invoking the data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the integrated module/unit of the voice endpoint detection device can be stored in a computer readable storage medium if the integrated module/unit is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (9)

1. A voice endpoint detection method is characterized by comprising the following steps:
filtering and framing the received voice signal to obtain a primary signal;
calculating the energy and frequency spectrum of the primary signal of each frame;
constructing a weighting factor according to the energy, and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
calculating a short-time spectrum entropy value of each frame of the secondary signal according to the power spectrum and the sum of the spectrum energies;
and taking the average value of the reciprocal of the short-time spectrum entropy values of a plurality of frames as the detection threshold value of the voice endpoint to judge the voice frame and the noise frame.
2. The method for detecting a speech endpoint according to claim 1, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:
comparing the detection threshold with a short-time spectrum entropy value of each frame of the secondary signal;
when the short-term spectrum entropy value is larger than the detection threshold value, judging that a signal frame corresponding to the short-term spectrum entropy value is a speech frame;
and when the short-time spectrum entropy value is smaller than or equal to the detection threshold value, judging that the signal frame corresponding to the short-time spectrum entropy value is a noise frame.
3. The method for detecting a voice endpoint according to claim 1, wherein the calculating the energy and the spectrum of the primary signal per frame specifically comprises:
calculating the energy E (n) of the primary signal of each frame by an energy-based endpoint detection method;
calculating a frequency spectrum X (n, l) of the primary signal per frame by using Fourier transform;
wherein,n is 1,2,3, …, N, the primary signal is x (N, M), N is 1,2,3, …, N, M is 1,2,3, …, M, N is the frame number, M is the frame length;
x (n, l) is fft (X (n, m)), fft is fast fourier transform, and l is frequency.
4. The method according to claim 3, wherein the step of constructing a weighting factor according to the energy and performing spectral weighting on the spectrum by using the weighting factor to obtain a secondary signal comprises:
normalizing the energy E (n) of the primary signal of each frame, and constructing a weighting factor e (n);
carrying out spectrum weighting on the frequency spectrum X (n, l) of each frame of the primary signal by using the weighting factor e (n) to obtain each frame of the secondary signal Xg(n,l);
Wherein E (n) is a weighting factor, E (n) 1-Eg(n),Eg(n)=E(n)/max(E(n));
Xg(n,l)=X(n,l)./|X(n,l)|e(n)
5. The method for detecting a voice endpoint according to claim 4, wherein the calculating the sum of the power spectrum and the spectral energy of the secondary signal of each frame is specifically:
calculating a power spectrum module value S (n, l) and a spectrum energy sum Y (n) of each frame of the secondary signal;
wherein S (n, l) ═ Xg(n,l).*Xg(n,l)|,L is the length of the fourier transform.
6. The method according to claim 5, wherein the calculating a short-term spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy comprises:
calculating a spectral probability density function P (n, l) of each frame of the secondary signal according to the power spectrum modulus S (n, l) and the spectrum energy sum Y (n);
calculating a short-time spectrum entropy value H (n) of each frame of the secondary signal according to a spectrum probability density function P (n, l) of each frame of the secondary signal;
wherein P (n, l) ═ S (n, l)/y (n);
7. the method for detecting a speech endpoint according to claim 6, wherein the average of the reciprocals of the short-term spectrum entropy values of the frames is used as a detection threshold of the speech endpoint to determine the speech frame and the noise frame, specifically:
taking the average value of the reciprocal of the continuous front Z frame spectrum entropy value in the N frames of spectrum entropy values as the detection threshold K of the voice endpoint;
wherein,Z<<N,J(n)=1/H(n)。
8. a voice endpoint detection apparatus, comprising:
the preprocessing module is used for filtering and framing the received voice signal to obtain a primary signal;
the first calculation module is used for calculating the energy and the frequency spectrum of the primary signal of each frame;
the spectrum weighting module is used for constructing a weighting factor according to the energy and carrying out spectrum weighting on the frequency spectrum by using the weighting factor to obtain a secondary signal;
the second calculation module is used for calculating the sum of the power spectrum and the spectral energy of each frame of the secondary signal;
the third calculation module is used for calculating a short-time spectrum entropy value of each frame of the secondary signal according to the sum of the power spectrum and the spectrum energy;
and the judging module is used for judging the voice frame and the noise frame by taking the average value of the reciprocal of the short-time spectrum entropy values of the frames as the detection threshold of the voice endpoint.
9. A voice endpoint detection device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the voice endpoint detection method according to any one of claims 1-7 when executing the computer program.
CN201910311947.7A 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment Active CN110047519B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910311947.7A CN110047519B (en) 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910311947.7A CN110047519B (en) 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment

Publications (2)

Publication Number Publication Date
CN110047519A true CN110047519A (en) 2019-07-23
CN110047519B CN110047519B (en) 2021-08-24

Family

ID=67277750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910311947.7A Active CN110047519B (en) 2019-04-16 2019-04-16 Voice endpoint detection method, device and equipment

Country Status (1)

Country Link
CN (1) CN110047519B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648692A (en) * 2019-09-26 2020-01-03 苏州思必驰信息科技有限公司 Voice endpoint detection method and system
CN110995821A (en) * 2019-11-28 2020-04-10 深圳供电局有限公司 Power distribution network inspection system based on AI and intelligent helmet
CN111540368A (en) * 2020-05-07 2020-08-14 广州大学 Stable bird sound extraction method and device and computer readable storage medium
CN111650559A (en) * 2020-06-12 2020-09-11 深圳市裂石影音科技有限公司 Real-time processing two-dimensional sound source positioning method
CN112612008A (en) * 2020-12-08 2021-04-06 中国人民解放军陆军工程大学 Method and device for extracting initial parameters of echo signals of high-speed projectile
CN116665717A (en) * 2023-08-02 2023-08-29 广东技术师范大学 Cross-subband spectral entropy weighted likelihood ratio voice detection method and system

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1503467A (en) * 2002-11-25 2004-06-09 ض� Noise matching for echo cancellers
CN1689072A (en) * 2002-08-16 2005-10-26 数字信号处理工厂有限公司 Method and system for processing subband signals using adaptive filters
KR100930061B1 (en) * 2008-01-22 2009-12-08 성균관대학교산학협력단 Signal detection method and apparatus
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN101777349A (en) * 2009-12-08 2010-07-14 中国科学院自动化研究所 Auditory perception property-based signal subspace microphone array voice enhancement method
CN102044243A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder
US20120232890A1 (en) * 2011-03-11 2012-09-13 Kabushiki Kaisha Toshiba Apparatus and method for discriminating speech, and computer readable medium
US20130267796A1 (en) * 2010-12-01 2013-10-10 Universitat Politecnica De Catalunya System and method for the simultaneous, non-invasive estimation of blood glucose, glucocorticoid level and blood pressure
CN103426440A (en) * 2013-08-22 2013-12-04 厦门大学 Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information
US9123351B2 (en) * 2011-03-31 2015-09-01 Oki Electric Industry Co., Ltd. Speech segment determination device, and storage medium
CN106536011A (en) * 2014-05-15 2017-03-22 布莱阿姆青年大学 Low-power miniature LED-based UV absorption detector with low detection limits for capillary liquid chromatography
WO2018069719A1 (en) * 2016-10-16 2018-04-19 Sentimoto Limited Voice activity detection method and apparatus
EP3443557A1 (en) * 2016-04-12 2019-02-20 Fraunhofer Gesellschaft zur Förderung der Angewand Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1689072A (en) * 2002-08-16 2005-10-26 数字信号处理工厂有限公司 Method and system for processing subband signals using adaptive filters
CN1503467A (en) * 2002-11-25 2004-06-09 ض� Noise matching for echo cancellers
KR100930061B1 (en) * 2008-01-22 2009-12-08 성균관대학교산학협력단 Signal detection method and apparatus
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102044243A (en) * 2009-10-15 2011-05-04 华为技术有限公司 Method and device for voice activity detection (VAD) and encoder
CN101777349A (en) * 2009-12-08 2010-07-14 中国科学院自动化研究所 Auditory perception property-based signal subspace microphone array voice enhancement method
US20130267796A1 (en) * 2010-12-01 2013-10-10 Universitat Politecnica De Catalunya System and method for the simultaneous, non-invasive estimation of blood glucose, glucocorticoid level and blood pressure
US20120232890A1 (en) * 2011-03-11 2012-09-13 Kabushiki Kaisha Toshiba Apparatus and method for discriminating speech, and computer readable medium
US9123351B2 (en) * 2011-03-31 2015-09-01 Oki Electric Industry Co., Ltd. Speech segment determination device, and storage medium
CN103426440A (en) * 2013-08-22 2013-12-04 厦门大学 Voice endpoint detection device and voice endpoint detection method utilizing energy spectrum entropy spatial information
CN106536011A (en) * 2014-05-15 2017-03-22 布莱阿姆青年大学 Low-power miniature LED-based UV absorption detector with low detection limits for capillary liquid chromatography
EP3443557A1 (en) * 2016-04-12 2019-02-20 Fraunhofer Gesellschaft zur Förderung der Angewand Audio encoder for encoding an audio signal, method for encoding an audio signal and computer program under consideration of a detected peak spectral region in an upper frequency band
WO2018069719A1 (en) * 2016-10-16 2018-04-19 Sentimoto Limited Voice activity detection method and apparatus

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
CHAITANYA K, SINHA R.: "Energy and Entropy based Switching Algorithm for Speech Endpoint Detection in Varying SNR Conditions", 《NINTH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION》 *
RENEVEY P, DRYGAJLO A.: "Entropy based voice activity detection in very noisy conditions", 《SEVENTH EUROPEAN CONFERENCE ON SPEECH COMMUNICATION AND TECHNOLOGY》 *
VLAJ D, KAČIČ Z, KOS M.: "Voice activity detection algorithm using nonlinear spectral weights, hangover and hangbefore criteria", 《COMPUTERS & ELECTRICAL ENGINEERING》 *
WU B F, WANG K C: "Robust endpoint detection algorithm based on the adaptive band-partitioning spectral entropy in adverse environments", 《IEEE TRANSACTIONS ON SPEECH & AUDIO PROCESSING》 *
徐望: "连续语音识别的稳健性技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
梁龙腾: "基于传声器阵列的声源定位算法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王博,郭英,韩立峰: "基于熵函数的语音端点检测算法研究", 《信号处理》 *
郑秋菊,李强,王岑: "噪声估计和谱熵结合的语音激活检测算法", 《现代电信科技》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648692A (en) * 2019-09-26 2020-01-03 苏州思必驰信息科技有限公司 Voice endpoint detection method and system
CN110648692B (en) * 2019-09-26 2022-04-12 思必驰科技股份有限公司 Voice endpoint detection method and system
CN110995821A (en) * 2019-11-28 2020-04-10 深圳供电局有限公司 Power distribution network inspection system based on AI and intelligent helmet
CN111540368A (en) * 2020-05-07 2020-08-14 广州大学 Stable bird sound extraction method and device and computer readable storage medium
CN111540368B (en) * 2020-05-07 2023-03-14 广州大学 Stable bird sound extraction method and device and computer readable storage medium
CN111650559A (en) * 2020-06-12 2020-09-11 深圳市裂石影音科技有限公司 Real-time processing two-dimensional sound source positioning method
CN112612008A (en) * 2020-12-08 2021-04-06 中国人民解放军陆军工程大学 Method and device for extracting initial parameters of echo signals of high-speed projectile
CN116665717A (en) * 2023-08-02 2023-08-29 广东技术师范大学 Cross-subband spectral entropy weighted likelihood ratio voice detection method and system
CN116665717B (en) * 2023-08-02 2023-09-29 广东技术师范大学 Cross-subband spectral entropy weighted likelihood ratio voice detection method and system

Also Published As

Publication number Publication date
CN110047519B (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN110047519B (en) Voice endpoint detection method, device and equipment
EP3806089B1 (en) Mixed speech recognition method and apparatus, and computer readable storage medium
CN106486131B (en) A kind of method and device of speech de-noising
CN110634497B (en) Noise reduction method and device, terminal equipment and storage medium
CN104637489B (en) The method and apparatus of sound signal processing
CN108615535A (en) Sound enhancement method, device, intelligent sound equipment and computer equipment
WO2001016937A9 (en) System and method for classification of sound sources
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
CN110956966A (en) Voiceprint authentication method, voiceprint authentication device, voiceprint authentication medium and electronic equipment
CN102881291A (en) Sensing Hash value extracting method and sensing Hash value authenticating method for voice sensing Hash authentication
CN107293287A (en) The method and apparatus for detecting audio signal
May et al. Computational speech segregation based on an auditory-inspired modulation analysis
CN116564315A (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium
CN114186094A (en) Audio scene classification method and device, terminal equipment and storage medium
CN115565548A (en) Abnormal sound detection method, abnormal sound detection device, storage medium and electronic equipment
CN110503973B (en) Audio signal transient noise suppression method, system and storage medium
CN115394318A (en) Audio detection method and device
Yarra et al. A mode-shape classification technique for robust speech rate estimation and syllable nuclei detection
Abbas et al. Heart‐ID: human identity recognition using heart sounds based on modifying mel‐frequency cepstral features
CN111477248B (en) Audio noise detection method and device
CN106847299B (en) Time delay estimation method and device
CN113593604A (en) Method, device and storage medium for detecting audio quality
CN108847251A (en) A kind of voice De-weight method, device, server and storage medium
JP6724290B2 (en) Sound processing device, sound processing method, and program
CN110534128B (en) Noise processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant