WO2020097841A1 - Voice activity detection method and apparatus, storage medium and electronic device - Google Patents

Voice activity detection method and apparatus, storage medium and electronic device Download PDF

Info

Publication number
WO2020097841A1
WO2020097841A1 PCT/CN2018/115601 CN2018115601W WO2020097841A1 WO 2020097841 A1 WO2020097841 A1 WO 2020097841A1 CN 2018115601 W CN2018115601 W CN 2018115601W WO 2020097841 A1 WO2020097841 A1 WO 2020097841A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
frame
signal
frequency domain
noise reduction
Prior art date
Application number
PCT/CN2018/115601
Other languages
French (fr)
Chinese (zh)
Inventor
陈岩
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to CN201880097699.4A priority Critical patent/CN112955951A/en
Priority to PCT/CN2018/115601 priority patent/WO2020097841A1/en
Publication of WO2020097841A1 publication Critical patent/WO2020097841A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Definitions

  • the present application belongs to the technical field of terminals, and particularly relates to a voice endpoint detection method, device, storage medium, and electronic equipment.
  • voice processing technologies such as voiceprint wake-up and voice recognition have also developed more and more mature.
  • speech endpoint detection technology greatly affects the performance of speech processing technology.
  • the detection premise of the voice endpoint detection technology is to assume that the voice signal is a short-term stationary signal, which leads to a low accuracy of the voice endpoint detection technology in different non-stationary noise environments.
  • Embodiments of the present application provide a voice endpoint detection method, device, storage medium, and electronic equipment, which can improve the accuracy of voice endpoint detection.
  • an embodiment of the present application provides a voice endpoint detection method, including:
  • an embodiment of the present application provides a voice endpoint detection device, including:
  • Acquisition module for acquiring noisy speech signals
  • a noise reduction module configured to perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal
  • a calculation module used to calculate the spectral entropy ratio of the noise-reduced speech signal, and calculate the short-term energy of the noise-reduced speech signal;
  • the detection module is configured to perform speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  • an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the voice endpoint detection method provided in this embodiment.
  • an embodiment of the present application provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:
  • the noise-reducing voice signal since the noise-containing voice signal affects the accuracy of the detection of the voice endpoint, the noise-reducing voice signal is processed to make it a noise-reduced voice signal, and then the noise reduction that can improve the accuracy of the voice endpoint detection
  • the spectral entropy ratio and short-term energy of the voice signal are used to detect the voice endpoint of the noise-reduced voice signal, which effectively improves the accuracy of voice endpoint detection.
  • FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 2 is a second schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 3 is a third schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 4 is a fourth schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice endpoint detection device provided by an embodiment of the present application.
  • FIG. 6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 7 is a second schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • the process of the voice endpoint detection method may include:
  • Voice endpoint detection refers to detecting the presence or absence of voice in a noisy environment. It is usually used in voice processing systems such as voice coding and voice enhancement to reduce the voice coding rate, save communication bandwidth, reduce mobile device energy consumption, and improve recognition Rate and other effects.
  • the noisy speech signal may refer to speech signals in different unstable noise environments.
  • the noise-reduced speech signal is subjected to noise reduction processing to obtain a noise-reduced speech signal.
  • the electronic device may perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal.
  • the frequency of the noise signal is lower than the frequency of the speech signal, and the lower the frequency of the noise signal, the smaller the impact on the accuracy of the detection of speech endpoints. Therefore, the electronic device can increase the frequency of the voice signal in the noisy voice signal and reduce the frequency of the noise signal in the noisy voice signal, so as to reduce the influence of the noise signal on the accuracy of voice endpoint detection.
  • noise reduction processing of the noisy speech signal is not limited to the above method, but may be other methods as long as the purpose of reducing the noise of the noisy speech signal can be achieved.
  • the spectral entropy ratio of the noise-reduced speech signal is calculated, and the short-term energy of the noise-reduced speech signal is calculated.
  • the electronic device can calculate the noise-reduced speech The spectral entropy ratio of the signal and the short-term energy of the noise-reduced speech signal are calculated, so that speech endpoint detection can be performed according to the characteristics of the two noise-reduced speech signals.
  • speech endpoint detection is performed according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  • the electronic device can use the spectral entropy ratio of the noise-reduced speech signal and the short-term energy value of the noise-reduced speech signal as threshold parameters to perform speech endpoint detection, that is, to detect the presence or absence of speech, thereby detecting a valid speech segment.
  • the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice; when the short-term energy value is D, it can detect that there is no voice.
  • the noise-reducing voice signal since the noise-containing voice signal affects the accuracy of voice endpoint detection, the noise-reducing voice signal is subjected to noise reduction processing to make it a noise-reduced voice signal, and then used to improve the accuracy of voice endpoint detection
  • the spectral entropy ratio and short-term energy of the noise-reduced speech signal are used to detect the speech endpoint of the noise-reduced speech signal, which effectively improves the accuracy of speech endpoint detection.
  • FIG. 2 is a second schematic flowchart of a voice endpoint detection method according to an embodiment of the present application.
  • the process of the voice endpoint detection method may include:
  • the electronic device acquires a noisy speech signal.
  • a noisy speech signal may refer to a speech signal in different unstable noise environments.
  • the noisy speech signal is a time-domain signal.
  • the electronic device performs frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time domain signal.
  • the electronic device performs frame windowing on the noisy speech signal y (n) to obtain a multi-frame windowed time domain signal.
  • the electronic device can usually take a frame length of 20ms and take a frame shift of 10ms to frame the noisy speech signal.
  • the windowed time domain signal is a time domain signal, and the windowed time domain signal of each frame may include a noise signal part and a speech signal part.
  • the noise signal and speech signal here are both time-domain signals.
  • the electronic device performs Fourier transform on each frame of the windowed time domain signal of the multi-frame windowed time domain signal to obtain a multi-frame frequency domain signal.
  • each frame frequency domain signal may include a noise signal part and a voice signal part.
  • the noise signal and speech signal here are both frequency domain signals.
  • f is the frequency component
  • i is the number of frames.
  • the electronic device estimates the Fourier coefficients of the frequency domain signal for each frame.
  • the Fourier coefficient of the frequency domain signal of the i-th frame can be estimated using the following formula:
  • ⁇ (f, i) is the estimated prior signal-to-noise ratio
  • ⁇ (f, i) is the estimated posterior signal-to-noise ratio
  • p (f, i) represents The probability of the existence of speech
  • q (f, i) represents the probability of the absence of speech.
  • the electronic device performs noise reduction processing on the frequency domain signal of each frame according to the Fourier coefficients of the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
  • the electronic device in order to reduce the influence of the noise signal on the detection of the voice endpoint, can reduce the Fourier coefficient of the noise signal portion in the frequency domain signal of each frame by using an appropriate G 0 .
  • the electronic device can also increase the Fourier coefficient of the speech signal part in the frequency domain signal of each frame.
  • the electronic device calculates the energy spectrum of the noise-reduced frequency domain signal for each frame.
  • the process 206 may be implemented by the processes 2061, 2062, 2063, and 2064, which may be:
  • the electronic device acquires the frequency band information of the noise-reduced frequency domain signal for each frame.
  • the electronic device can acquire the frequency range of the noise-reduced frequency domain signal of the i-th frame.
  • the electronic device divides the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame.
  • the frequency band of the noise reduction frequency domain signal of the ith frame is 500 Hz to 1400 Hz.
  • the electronic device may divide the noise reduction frequency domain signal of the i-th frame into multiple sub-noise reduction frequency domain signals according to the frequency band range. For example, assuming that the i-th frame noise reduction frequency domain signal is divided into 3 sub-noise reduction frequency domain signals, then the electronic device may divide the first sub-noise reduction frequency domain signal into a frequency band range of 500 Hz to 800 Hz, and the second sub-noise reduction noise The frequency range included in the frequency domain signal is 800 Hz to 1100 Hz, and the frequency range included in the third sub-noise reduction frequency domain signal is 1100 Hz to 1400 Hz.
  • noise reduction frequency domain signal of each frame and how many sub-noise reduction frequency domain signals are divided into noise reduction frequency domain signals of each frame can be determined according to actual needs, and no specific limitation is made here.
  • the electronic device calculates the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals.
  • the formula for calculating the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:
  • E (w, i) represents the energy spectrum of the wth sub-noise reduction frequency domain signal in the i frame
  • N b represents the total number of sub-noise reduction frequency domain signals
  • N can be set according to actual needs, usually set to 2 nth power. For example, N can be set to 256, 512, 1024, etc.
  • the electronic device calculates the energy spectrum of the noise reduction frequency domain signal of each frame according to the energy spectrum of each sub-noise reduction frequency domain signal.
  • the energy spectrum of the noise reduction frequency domain signal in the i-th frame is the sum of the energy spectrums of all sub-noise reduction frequency domain signals divided in the frame.
  • the formula for calculating the energy spectrum of the noise-reduced frequency domain signal in the i-th frame can be:
  • N b is the total number of sub-noise reduction frequency domain signals
  • E (i) is the energy spectrum of the i-th frame noise reduction frequency domain signal
  • E (w, i) is The energy spectrum of the w-th sub-noise frequency-domain signal in the i-th frame.
  • the electronic device calculates the spectral entropy of the noise-reduced frequency domain signal for each frame.
  • the process 207 may be implemented through the process 2071 and the process 2072, which may be:
  • the electronic device calculates the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of each frame of the denoise frequency domain signal.
  • the calculation formula of the normalized density of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:
  • w represents the wth sub-noise reduction frequency domain signal
  • N b represents the total number of sub-noise reduction frequency domain signals
  • p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame
  • E (w, i) represents the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame
  • E (i) represents the energy spectrum of the i-th frame noise reduction frequency-domain signal.
  • the electronic device calculates the spectral entropy of the noise reduction frequency domain signal for each frame according to the normalized probability density of each sub-noise reduction frequency domain signal.
  • the formula for calculating the spectral entropy of the noise-reduced frequency-domain signal in frame i can be:
  • w represents the wth sub-noise reduction frequency domain signal
  • N b represents the total number of sub-noise reduction frequency domain signals
  • p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame
  • H (i) represents the spectral entropy of the noise-reduced frequency domain signal of the i-th frame.
  • the electronic device calculates the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain Spectral entropy ratio.
  • the formula for calculating the spectral entropy ratio of the noise-reduced frequency-domain signal in frame i can be:
  • EER TEO (i) represents the spectral entropy ratio of the noise reduction frequency domain signal of the i frame
  • E (i) represents the energy spectrum of the noise reduction frequency domain signal of the i frame
  • H (i) represents the noise reduction frequency domain of the i frame The spectral entropy of the signal.
  • the electronic device can calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the calculation formula of the spectral entropy ratio of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the spectrum of all frames of the noise reduction speech signal Entropy ratio.
  • the electronic device performs an inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain a multi-frame noise-reduced time domain signal.
  • performing inverse Fourier transform on the frequency domain signal can convert the frequency domain signal into a time domain signal. Therefore, the electronic device performs inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain multi-frame noise-reduced time domain signal
  • the electronic device calculates the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.
  • the calculation formula of the short-term energy of the noise reduction time-domain signal of the i frame is:
  • Represents the short-term energy of the noise reduction time-domain signal of frame i Represents the noise reduction time-domain signal of frame i
  • the electronic device can calculate the short-term energy of the noise reduction frequency domain signal of each frame according to the calculation formula of the short-term energy of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the short frame of all frames of the noise reduction speech signal ⁇ ⁇ Time energy.
  • the electronic device determines the position of the voice starting point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or the electronic device
  • the spectral entropy ratio of the frame and the short-term energy of all frames of the noise-reduced speech signal determine the position of the speech termination point.
  • the process 211 may be: if the electronic device detects that no speech exists according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, detecting the presence of speech, the electronic device determines that the first frame of the second number of frames is located The position is the position of the starting point of the voice.
  • the process 212 may be: if the electronic device detects the presence of speech according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, it is detected that no speech exists, and the electronic device determines The position is the position of the voice termination point.
  • the electronic device can determine The position of the first frame in subsequent frames is the position of the starting point of speech.
  • the electronic device may determine The position of the first frame in subsequent frames is the position of the voice termination point.
  • the noise-reduced speech signal is divided into 20 frames, namely the first frame, the second frame, the third frame ... the 19th frame, the 20th frame.
  • the electronic device detects that no speech exists according to the spectral entropy ratio of the first frame to the fifth frame and the short-term energy of the first frame to the fifth frame, and according to the spectral entropy ratio of the sixth frame to the tenth frame and the sixth
  • the electronic device determines that the position of frame 6 is the position of the starting point of the voice.
  • the electronic device detects the presence of speech based on the spectral entropy ratio of frames 11 to 15 and the short-term energy of frames 11 to 15 and based on the spectral entropy ratio of frames 16 to 20 and frames 16 to 20.
  • the short-term energy of the 20th frame detects that there is no voice, and the electronic device determines that the position of the 16th frame is the position of the voice termination point.
  • the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice, and when the short-term energy value is D, it can detect that there is no voice.
  • the above is only one example of determining the position of the voice start point and the position of the voice end point proposed in this embodiment. It can be understood that, within the protection scope of the embodiments of the present application, the position of the voice start point and the position of the voice end point may also be determined in other ways, and no specific limitation is made here.
  • FIG. 5 is a schematic structural diagram of a voice endpoint detection device 300 according to an embodiment of the present application.
  • the voice endpoint detection device 300 may include: an acquisition module 301, a noise reduction module 302, a calculation module 303, and a detection module 304.
  • the obtaining module 301 is used to obtain a noisy speech signal.
  • the noise reduction module 302 is configured to perform noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal.
  • the calculation module 303 is configured to calculate the spectral entropy ratio of the noise-reduced speech signal and calculate the short-term energy of the noise-reduced speech signal.
  • the detection module 304 is configured to perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  • the noise reduction module 302 may be used to: perform frame-by-frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time-domain signal; Perform a Fourier transform on the windowed time-domain signal of each frame to obtain a multi-frame frequency domain signal; estimate the Fourier coefficients of the frequency domain signal of each frame; according to the Fourier coefficients of the frequency domain signal of each frame, for each frame frequency
  • the domain signal is subjected to noise reduction processing to obtain a multi-frame noise reduction frequency domain signal.
  • the calculation module 303 may be used to: calculate the energy spectrum of the noise reduction frequency domain signal per frame; calculate the spectral entropy of the noise reduction frequency domain signal per frame; according to the noise reduction frequency domain signal per frame The energy spectrum and the spectral entropy of the noise-reduced frequency domain signal per frame calculate the spectral entropy ratio of the noise-reduced frequency domain signal per frame to obtain the spectral entropy ratio of all frames of the noise-reduced speech signal.
  • the calculation module 303 may also be used to: obtain frequency band information of the noise reduction frequency domain signal of each frame; divide the noise reduction frequency domain signal of each frame according to the frequency band information to obtain a reduction of each frame A plurality of sub-noise reduction frequency domain signals corresponding to the noise frequency domain signal; calculating an energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; calculating according to the energy spectrum of each sub-noise reduction frequency domain signal The energy spectrum of the noise-reduced frequency domain signal per frame.
  • the calculation module 303 may be further used to calculate each sub-noise reduction frequency domain according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal The normalized probability density of the signal; according to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.
  • the calculation module 303 can also be used to: perform inverse Fourier transform on the noise reduction frequency domain signal of each frame to obtain a multi-frame noise reduction time domain signal; Time energy, the short-term energy of all frames of the noise-reduced speech signal is obtained.
  • the detection module 304 may be used to determine the position of the voice start point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal And / or determine the position of the voice termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.
  • the detection module 304 may be further configured to: based on the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, No speech is detected, and according to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the second The position of the first frame in the number of frames is the position of the starting point of speech.
  • the detection module 304 may be further configured to: based on the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, The presence of speech is detected, and according to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, the absence of speech is detected, and the fourth is determined
  • the position of the first frame in the number of frames is the position of the voice termination point.
  • An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed on a computer, the computer is caused to execute the process in the voice endpoint detection method provided in this embodiment .
  • An embodiment of the present application also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program stored in the memory by executing the computer program Process in the voice endpoint detection method.
  • the aforementioned electronic device may be a mobile terminal such as a tablet computer or a smart phone.
  • FIG. 6, is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the mobile terminal 400 may include components such as a microphone 401, a memory 402, and a processor 403. Those skilled in the art may understand that the structure of the mobile terminal shown in FIG. 6 does not constitute a limitation on the mobile terminal, and may include more or fewer components than those illustrated, or combine certain components, or arrange different components.
  • the microphone 401 can be used to pick up the voice uttered by the user and the like.
  • the memory 402 may be used to store application programs and data.
  • the application program stored in the memory 402 contains executable code.
  • the application program can form various functional modules.
  • the processor 403 executes application programs stored in the memory 402 to execute various functional applications and data processing.
  • the processor 403 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 402 and calling data stored in the memory 402 Various functions and processing data to monitor the mobile terminal as a whole.
  • the processor 403 in the mobile terminal loads the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 runs and stores the memory in the memory The application in 402, thereby implementing the process:
  • FIG. 7 is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the mobile terminal 500 may include components such as a microphone 501, a memory 502, a processor 503, an input unit 504, an output unit 505, a speaker 506, and the like.
  • the microphone 501 can be used to pick up the voice uttered by the user and the like.
  • the memory 502 may be used to store application programs and data.
  • the application program stored in the memory 502 contains executable code.
  • the application program can form various functional modules.
  • the processor 503 executes application programs stored in the memory 502 to execute various functional applications and data processing.
  • the processor 503 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 502 and calling data stored in the memory 502 Various functions and processing data to monitor the mobile terminal as a whole.
  • the input unit 504 may be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
  • user characteristic information such as fingerprints
  • the output unit 505 can be used to display information input by the user or provided to the user and various graphical user interfaces of the mobile terminal. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof.
  • the output unit may include a display panel.
  • the processor 503 in the mobile terminal will load the executable code corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 503 will run the stored code in the memory The application in 502, thereby implementing the process:
  • the processor 503 when the processor 503 executes the process of performing noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal, it may perform: performing frame-and-window processing on the noise-containing speech signal, Obtain multi-frame windowed time-domain signals; Fourier transform the windowed time-domain signals of each frame of the multi-frame windowed time-domain signals to obtain multi-frame frequency-domain signals; estimate the Fourier of each frame of frequency-domain signals Coefficient; according to the Fourier coefficient of the frequency domain signal of each frame, perform noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
  • the processor 503 when it executes the process of calculating the spectral entropy ratio of the noise-reduced speech signal, it may perform: calculating the energy spectrum of the noise-reducing frequency domain signal per frame; Spectral entropy of each frame; calculate the spectral entropy ratio of the noise reduction frequency domain signal per frame according to the energy spectrum of the noise reduction frequency domain signal per frame and the spectral entropy of the noise reduction frequency domain signal per frame to obtain the noise reduction speech signal The ratio of the spectral entropy of all frames.
  • the processor 503 when the processor 503 executes the process of calculating the energy spectrum of the noise reduction frequency domain signal per frame, it may perform: acquiring frequency band information of the noise reduction frequency domain signal per frame; according to the frequency band information Divide the noise reduction frequency domain signal of each frame to obtain a plurality of sub-noise reduction frequency domain signals corresponding to each frame of the noise reduction frequency domain signal; calculate the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.
  • the processor 503 when the processor 503 executes the process of calculating the spectral entropy of the noise reduction frequency domain signal per frame, it may perform: according to the energy spectrum of each sub-noise reduction frequency domain signal and the noise reduction per frame The energy spectrum of the frequency domain signal, calculate the normalized probability density of each sub-noise reduction frequency domain signal; calculate the spectral entropy of the noise reduction frequency domain signal per frame according to the normalized probability density of each sub-noise reduction frequency domain signal .
  • the processor 503 when the processor 503 executes the process of calculating the short-term energy of the noise-reduced speech signal, it may perform: performing an inverse Fourier transform on each frame of the noise-reduced frequency domain signal to obtain multi-frame noise reduction Time-domain signal; calculate the short-term energy of the noise-reduced time-domain signal of each frame to obtain the short-term energy of all frames of the noise-reduced speech signal.
  • the processor 503 when the processor 503 executes the process of performing voice endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal, it may execute: according to the noise reduction Determine the position of the starting point of the speech by the spectral entropy ratio of all frames of the speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or according to the spectral entropy ratio of all frames of the noise-reduced speech signal and all Describe the short-term energy of all frames of the noise-reduced speech signal to determine the position of the speech termination point.
  • the processor 503 executes the process of determining the position of the voice start point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the noise reduction The spectral entropy ratio of the second number of frames of the speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the position of the first frame in the second number of frames is determined to be speech The position of the starting point.
  • the processor 503 executes the process of determining the position of the speech termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, the presence of speech is detected, and according to the noise reduction The spectral entropy ratio of the fourth number of frames of the speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is speech The location of the end point.
  • the voice endpoint detection device provided by the embodiment of the present application and the voice endpoint detection method in the above embodiments belong to the same concept, and any of the voice endpoint detection method embodiments provided on the voice endpoint detection device can be run on the voice endpoint detection device
  • any of the voice endpoint detection method embodiments provided on the voice endpoint detection device can be run on the voice endpoint detection device
  • the voice endpoint detection method described in the embodiments of the present application a person of ordinary skill in the art can understand that all or part of the process of implementing the voice endpoint detection method described in the embodiments of the present application can be controlled by a computer program.
  • the computer program may be stored in a computer-readable storage medium, such as stored in a memory, and executed by at least one processor, during the execution process may include the implementation of the voice endpoint detection method Example process.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and so on.
  • each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .

Abstract

A voice activity detection method and apparatus, a storage medium and an electronic device. The method comprises: obtaining a noisy speech signal (101); performing noise reduction on the noisy speech signal to obtain a noise-reduced speech signal (102); calculating a spectral entropy ratio of the noise-reduced speech signal, and calculating short-time energy of the noise-reduced speech signal (103); and performing voice activity detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-time energy of the noise-reduced speech signal (104).

Description

语音端点检测方法、装置、存储介质及电子设备Voice endpoint detection method, device, storage medium and electronic equipment 技术领域Technical field
本申请属于终端技术领域,尤其涉及一种语音端点检测方法、装置、存储介质及电子设备。The present application belongs to the technical field of terminals, and particularly relates to a voice endpoint detection method, device, storage medium, and electronic equipment.
背景技术Background technique
随着终端技术的快速发展,诸如声纹唤醒、语音识别等语音处理技术也发展得越来越趋近成熟。语音端点检测技术作为语音预处理技术中一个重要的环节,极大地影响到语音处理技术的性能。相关技术中,语音端点检测技术的检测前提是假设语音信号是短时平稳信号,这导致在不同的非稳定噪声环境中时,该语音端点检测技术的准确性较低。With the rapid development of terminal technology, voice processing technologies such as voiceprint wake-up and voice recognition have also developed more and more mature. As an important part of speech preprocessing technology, speech endpoint detection technology greatly affects the performance of speech processing technology. In the related art, the detection premise of the voice endpoint detection technology is to assume that the voice signal is a short-term stationary signal, which leads to a low accuracy of the voice endpoint detection technology in different non-stationary noise environments.
发明内容Summary of the invention
本申请实施例提供一种语音端点检测方法、装置、存储介质及电子设备,可以提高语音端点检测的准确性。Embodiments of the present application provide a voice endpoint detection method, device, storage medium, and electronic equipment, which can improve the accuracy of voice endpoint detection.
第一方面,本申请实施例提供一种语音端点检测方法,包括:In a first aspect, an embodiment of the present application provides a voice endpoint detection method, including:
获取含噪语音信号;Obtain noisy speech signals;
对所述含噪语音信号进行降噪处理,得到降噪语音信号;Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;
计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;
根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
第二方面,本申请实施例提供一种语音端点检测装置,包括:In a second aspect, an embodiment of the present application provides a voice endpoint detection device, including:
获取模块,用于获取含噪语音信号;Acquisition module for acquiring noisy speech signals;
降噪模块,用于对所述含噪语音信号进行降噪处理,得到降噪语音信号;A noise reduction module, configured to perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal;
计算模块,用于计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;A calculation module, used to calculate the spectral entropy ratio of the noise-reduced speech signal, and calculate the short-term energy of the noise-reduced speech signal;
检测模块,用于根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。The detection module is configured to perform speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
第三方面,本申请实施例提供一种存储介质,其上存储有计算机程序,其中,当所述计算机程序在计算机上执行时,使得所述计算机执行本实施例提供的语音端点检测方法。In a third aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the voice endpoint detection method provided in this embodiment.
第四方面,本申请实施例提供一种电子设备,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:According to a fourth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:
获取含噪语音信号;Obtain noisy speech signals;
对所述含噪语音信号进行降噪处理,得到降噪语音信号;Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;
计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;
根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
本申请实施例中,由于含噪语音信号影响语音端点检测的准确性,因此对含噪语音信号进行降噪处理,使得其成为降噪语音信号,然后采用能提高语音端点检测准确性的降噪语音信号的谱熵比值和短时能量对该降噪语音信号进行语音端点检测,有效提高了语音端点检测的准确性。In the embodiment of the present application, since the noise-containing voice signal affects the accuracy of the detection of the voice endpoint, the noise-reducing voice signal is processed to make it a noise-reduced voice signal, and then the noise reduction that can improve the accuracy of the voice endpoint detection The spectral entropy ratio and short-term energy of the voice signal are used to detect the voice endpoint of the noise-reduced voice signal, which effectively improves the accuracy of voice endpoint detection.
附图说明BRIEF DESCRIPTION
下面结合附图,通过对本申请的具体实施方式详细描述,将使本申请的技术方案及其有益效果显而易见。The technical solutions and beneficial effects of the present application will be apparent through the detailed description of the specific implementation of the present application in conjunction with the accompanying drawings.
图1是本申请实施例提供的语音端点检测方法的第一种流程示意图。FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
图2是本申请实施例提供的语音端点检测方法的第二种流程示意图。2 is a second schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
图3是本申请实施例提供的语音端点检测方法的第三种流程示意图。FIG. 3 is a third schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
图4是本申请实施例提供的语音端点检测方法的第四种流程示意图。FIG. 4 is a fourth schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
图5是本申请实施例提供的语音端点检测装置的结构示意图。5 is a schematic structural diagram of a voice endpoint detection device provided by an embodiment of the present application.
图6是本申请实施例提供的电子设备的第一种结构示意图。6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
图7是本申请实施例提供的电子设备的第二种结构示意图。7 is a second schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式detailed description
请参照图示,其中相同的组件符号代表相同的组件,本申请的原理是以实施在一适当的运算环境中来举例说明。以下的说明是基于所例示的本申请具体实施例,其不应被视为限制本申请未在此详述的其它具体实施例。Please refer to the illustration, where the same component symbol represents the same component. The principle of the present application is illustrated by implementation in an appropriate computing environment. The following description is based on the illustrated specific embodiments of the present application, which should not be considered as limiting other specific embodiments not detailed herein.
请参阅图1,图1是本申请实施例提供的语音端点检测方法的第一种流程示意图。该语音端点检测方法的流程可以包括:Please refer to FIG. 1. FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application. The process of the voice endpoint detection method may include:
语音端点检测,是指在噪声环境中检测语音的存在与否,通常用于语音编 码、语音增强等语音处理系统中,起到降低语音编码速率、节省通信带宽、减少移动设备能耗、提高识别率等作用。Voice endpoint detection refers to detecting the presence or absence of voice in a noisy environment. It is usually used in voice processing systems such as voice coding and voice enhancement to reduce the voice coding rate, save communication bandwidth, reduce mobile device energy consumption, and improve recognition Rate and other effects.
在101中,获取含噪语音信号。In 101, a noisy speech signal is obtained.
其中,含噪语音信号可以指在不同的非稳定噪声环境中的语音信号。含噪语音信号可以用y(n)=s(n)+u(n)表示,其中,s(n)为语音信号,u(n)为噪声信号。Among them, the noisy speech signal may refer to speech signals in different unstable noise environments. The noisy speech signal can be represented by y (n) = s (n) + u (n), where s (n) is the speech signal and u (n) is the noise signal.
在102中,对含噪语音信号进行降噪处理,得到降噪语音信号。In 102, the noise-reduced speech signal is subjected to noise reduction processing to obtain a noise-reduced speech signal.
在本实施例中,由于含噪语音信号中的噪声信号会影响语音端点检测的准确性,因此电子设备可以对含噪语音信号进行降噪处理,以得到降噪语音信号。In this embodiment, since the noise signal in the noisy speech signal will affect the accuracy of the detection of the voice endpoint, the electronic device may perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal.
在一般情况下,噪声信号的频率低于语音信号的频率,而噪声信号的频率越低,对语音端点检测的准确性的影响越小。因此电子设备可以将含噪语音信号中的语音信号的频率提高,将含噪语音信号中的噪声信号的频率降低,以减少噪声信号对语音端点检测的准确性的影响。In general, the frequency of the noise signal is lower than the frequency of the speech signal, and the lower the frequency of the noise signal, the smaller the impact on the accuracy of the detection of speech endpoints. Therefore, the electronic device can increase the frequency of the voice signal in the noisy voice signal and reduce the frequency of the noise signal in the noisy voice signal, so as to reduce the influence of the noise signal on the accuracy of voice endpoint detection.
需要说明的是,对含噪语音信号进行降噪处理不仅仅限于以上方法,还可以是其他方法,只要能达到对含噪语音信号进行降噪的目的即可。It should be noted that the noise reduction processing of the noisy speech signal is not limited to the above method, but may be other methods as long as the purpose of reducing the noise of the noisy speech signal can be achieved.
在103中,计算降噪语音信号的谱熵比值,并计算降噪语音信号的短时能量。In 103, the spectral entropy ratio of the noise-reduced speech signal is calculated, and the short-term energy of the noise-reduced speech signal is calculated.
在本实施例中,由于降噪语音信号的谱熵比值和降噪语音信号的短时能量这两个降噪语音信号的特征能提高语音端点检测的准确性,因此电子设备可以计算降噪语音信号的谱熵比值,并计算降噪语音信号的短时能量,从而可以根据这两个降噪语音信号的特征进行语音端点检测。In this embodiment, since the characteristics of two noise-reduced speech signals, such as the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal, can improve the accuracy of detection of the voice endpoint, the electronic device can calculate the noise-reduced speech The spectral entropy ratio of the signal and the short-term energy of the noise-reduced speech signal are calculated, so that speech endpoint detection can be performed according to the characteristics of the two noise-reduced speech signals.
在104中,根据降噪语音信号的谱熵比值和降噪语音信号的短时能量进行语音端点检测。In 104, speech endpoint detection is performed according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
比如,电子设备可以将降噪语音信号的谱熵比值和降噪语音信号的短时能量的取值作为门限参数进行语音端点检测,即检测语音的存在与否,从而检测出有效的语音段。For example, the electronic device can use the spectral entropy ratio of the noise-reduced speech signal and the short-term energy value of the noise-reduced speech signal as threshold parameters to perform speech endpoint detection, that is, to detect the presence or absence of speech, thereby detecting a valid speech segment.
在本实施例中,可以根据实际情况对谱熵比值和短时能量的取值进行设置。例如,可以设置当谱熵比值的取值为A时,即可以检测到有语音存在,当谱熵比值的取值为B时,即可以检测到无语音存在。可以设置当短时能量的取值为C时,即可以检测到有语音存在,当短时能量的取值为D时,即可以检 测到无语音存在。In this embodiment, the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice; when the short-term energy value is D, it can detect that there is no voice.
可以理解的是,本实施例中,由于含噪语音信号影响语音端点检测的准确性,因此对含噪语音信号进行降噪处理,使得其成为降噪语音信号,然后采用能提高语音端点检测准确性的降噪语音信号的谱熵比值和短时能量对该降噪语音信号进行语音端点检测,有效提高了语音端点检测的准确性。It can be understood that, in this embodiment, since the noise-containing voice signal affects the accuracy of voice endpoint detection, the noise-reducing voice signal is subjected to noise reduction processing to make it a noise-reduced voice signal, and then used to improve the accuracy of voice endpoint detection The spectral entropy ratio and short-term energy of the noise-reduced speech signal are used to detect the speech endpoint of the noise-reduced speech signal, which effectively improves the accuracy of speech endpoint detection.
请参阅图2,图2为本申请实施例提供的语音端点检测方法的第二种流程示意图。该语音端点检测方法的流程可以包括:Please refer to FIG. 2, which is a second schematic flowchart of a voice endpoint detection method according to an embodiment of the present application. The process of the voice endpoint detection method may include:
在201中,电子设备获取含噪语音信号。In 201, the electronic device acquires a noisy speech signal.
比如,含噪语音信号可以指在不同的非稳定噪声环境中的语音信号。含噪语音信号可以用y(n)=s(n)+u(n)表示,其中,s(n)为语音信号,u(n)为噪声信号。其中,含噪语音信号为时域信号。For example, a noisy speech signal may refer to a speech signal in different unstable noise environments. The noisy speech signal can be represented by y (n) = s (n) + u (n), where s (n) is the speech signal and u (n) is the noise signal. Among them, the noisy speech signal is a time-domain signal.
在202中,电子设备对含噪语音信号进行分帧加窗处理,得到多帧加窗时域信号。In 202, the electronic device performs frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time domain signal.
例如,电子设备对含噪语音信号y(n)进行分帧加窗,得到多帧加窗时域信号。其中,电子设备通常可以取一帧长度为20ms,取帧移为10ms对含噪语音信号进行分帧。电子设备对含噪语音信号加窗时,优先而不局限地,窗函数可以选取矩形窗,即w(n)=1。For example, the electronic device performs frame windowing on the noisy speech signal y (n) to obtain a multi-frame windowed time domain signal. Among them, the electronic device can usually take a frame length of 20ms and take a frame shift of 10ms to frame the noisy speech signal. When the electronic device adds a window to the noisy speech signal, it has priority and is not limited, and the window function can select a rectangular window, that is, w (n) = 1.
其中,加窗时域信号为时域信号,每帧加窗时域信号可以包括噪声信号部分和语音信号部分。此处的噪声信号与语音信号都是时域信号。The windowed time domain signal is a time domain signal, and the windowed time domain signal of each frame may include a noise signal part and a speech signal part. The noise signal and speech signal here are both time-domain signals.
在203中,电子设备对多帧加窗时域信号的每帧加窗时域信号进行傅里叶变换,得到多帧频域信号。In 203, the electronic device performs Fourier transform on each frame of the windowed time domain signal of the multi-frame windowed time domain signal to obtain a multi-frame frequency domain signal.
可以理解的是,对时域信号进行傅里叶变换可以将时域信号转换为频域信号。因此,电子设备对每帧加窗时域信号进行傅里叶变换,可以得到多帧频域信号。其中,每帧频域信号可以包括噪声信号部分和语音信号部分。此处的噪声信号和语音信号都是频域信号。It can be understood that performing a Fourier transform on the time domain signal can convert the time domain signal into a frequency domain signal. Therefore, the electronic device performs Fourier transform on the windowed time domain signal of each frame to obtain a multi-frame frequency domain signal. Wherein, each frame frequency domain signal may include a noise signal part and a voice signal part. The noise signal and speech signal here are both frequency domain signals.
在本实施例中,第i帧频域信号可以表示为:Y(f,i)=S(f,i)+U(f,i)。其中,f为频率分量,i为帧数。In this embodiment, the frequency domain signal of the i-th frame can be expressed as: Y (f, i) = S (f, i) + U (f, i). Among them, f is the frequency component, i is the number of frames.
在204中,电子设备估算每帧频域信号的傅里叶系数。In 204, the electronic device estimates the Fourier coefficients of the frequency domain signal for each frame.
例如,第i帧频域信号的傅里叶系数可以采用如下公式估算:For example, the Fourier coefficient of the frequency domain signal of the i-th frame can be estimated using the following formula:
Figure PCTCN2018115601-appb-000001
其中,
Figure PCTCN2018115601-appb-000002
为第i帧频域信 号的傅里叶系数,ζ(f,i)为估计的先验信噪比,γ(f,i)为估计的后验信噪比,p(f,i)表示语音存在的概率,q(f,i)表示语音不存在的概率。
Figure PCTCN2018115601-appb-000001
among them,
Figure PCTCN2018115601-appb-000002
Is the Fourier coefficient of the frequency domain signal of frame i, ζ (f, i) is the estimated prior signal-to-noise ratio, γ (f, i) is the estimated posterior signal-to-noise ratio, and p (f, i) represents The probability of the existence of speech, q (f, i) represents the probability of the absence of speech.
当有语音存在时,增益计算
Figure PCTCN2018115601-appb-000003
p(f,i)=1,q(f,i)=0。
When there is speech, the gain calculation
Figure PCTCN2018115601-appb-000003
p (f, i) = 1 and q (f, i) = 0.
当无语音存在时,
Figure PCTCN2018115601-appb-000004
p(f,i)=0,q(f,i)=1,其中,G 0为常数。
When there is no voice,
Figure PCTCN2018115601-appb-000004
p (f, i) = 0 and q (f, i) = 1, where G 0 is a constant.
在205中,电子设备根据每帧频域信号的傅里叶系数,对每帧频域信号进行降噪处理,得到多帧降噪频域信号。In 205, the electronic device performs noise reduction processing on the frequency domain signal of each frame according to the Fourier coefficients of the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
在本实施例中,为了降低噪声信号对语音端点检测的影响,电子设备可以通过采用合适的G 0将每帧频域信号中的噪声信号部分的傅里叶系数降低。相应的,电子设备还可以将每帧频域信号中的语音信号部分的傅里叶系数提高。 In this embodiment, in order to reduce the influence of the noise signal on the detection of the voice endpoint, the electronic device can reduce the Fourier coefficient of the noise signal portion in the frequency domain signal of each frame by using an appropriate G 0 . Correspondingly, the electronic device can also increase the Fourier coefficient of the speech signal part in the frequency domain signal of each frame.
在206中,电子设备计算每帧降噪频域信号的能量谱。In 206, the electronic device calculates the energy spectrum of the noise-reduced frequency domain signal for each frame.
如图3所示,在一些实施例中,流程206可以通过流程2061、流程2062、流程2063及流程2064来实现,可以为:As shown in FIG. 3, in some embodiments, the process 206 may be implemented by the processes 2061, 2062, 2063, and 2064, which may be:
在2061中,电子设备获取每帧降噪频域信号的频带信息。In 2061, the electronic device acquires the frequency band information of the noise-reduced frequency domain signal for each frame.
比如,电子设备可以获取第i帧降噪频域信号的频带范围。For example, the electronic device can acquire the frequency range of the noise-reduced frequency domain signal of the i-th frame.
在2062中,电子设备根据频带信息对每帧降噪频域信号进行划分,得到每帧降噪频域信号对应的多个子降噪频域信号。In 2062, the electronic device divides the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame.
例如,第i帧降噪频域信号的频带范围为500Hz~1400Hz。电子设备可以根据该频带范围将第i帧降噪频域信号平分为多个子降噪频域信号。例如,假设将第i帧降噪频域信号划分为3个子降噪频域信号,那么电子设备可以划分第1个子降噪频域信号所包括的频带范围为500Hz~800Hz,第2个子降噪频域信号所包括的频带范围为800Hz~1100Hz,第3个子降噪频域信号所包括的频带范围为1100Hz~1400Hz。For example, the frequency band of the noise reduction frequency domain signal of the ith frame is 500 Hz to 1400 Hz. The electronic device may divide the noise reduction frequency domain signal of the i-th frame into multiple sub-noise reduction frequency domain signals according to the frequency band range. For example, assuming that the i-th frame noise reduction frequency domain signal is divided into 3 sub-noise reduction frequency domain signals, then the electronic device may divide the first sub-noise reduction frequency domain signal into a frequency band range of 500 Hz to 800 Hz, and the second sub-noise reduction noise The frequency range included in the frequency domain signal is 800 Hz to 1100 Hz, and the frequency range included in the third sub-noise reduction frequency domain signal is 1100 Hz to 1400 Hz.
需要说明的是,如何对每帧降噪频域信号进行划分,及将每帧降噪频域信号划分为多少个子降噪频域信号可以根据实际需求确定,此处不做具体限制。It should be noted that how to divide the noise reduction frequency domain signal of each frame and how many sub-noise reduction frequency domain signals are divided into noise reduction frequency domain signals of each frame can be determined according to actual needs, and no specific limitation is made here.
在2063中,电子设备计算多个子降噪频域信号的每个子降噪频域信号的能量谱。In 2063, the electronic device calculates the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals.
比如,第i帧第w个子降噪频域信号的能量谱的计算公式可以为:For example, the formula for calculating the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:
Figure PCTCN2018115601-appb-000005
其中,E(w,i)表示第i帧第w个子降噪频域信号的能量谱,N b表示子降噪频域信号的总数量,N可以根据实际需求进行设置,通常设置为2的n次幂。例如,N可以设置为256,512,1024等。
Figure PCTCN2018115601-appb-000005
Among them, E (w, i) represents the energy spectrum of the wth sub-noise reduction frequency domain signal in the i frame, N b represents the total number of sub-noise reduction frequency domain signals, N can be set according to actual needs, usually set to 2 nth power. For example, N can be set to 256, 512, 1024, etc.
在2064中,电子设备根据每个子降噪频域信号的能量谱计算每帧降噪频域信号的能量谱。In 2064, the electronic device calculates the energy spectrum of the noise reduction frequency domain signal of each frame according to the energy spectrum of each sub-noise reduction frequency domain signal.
其中,第i帧降噪频域信号的能量谱为该帧所划分的所有子降噪频域信号的能量谱的总和。The energy spectrum of the noise reduction frequency domain signal in the i-th frame is the sum of the energy spectrums of all sub-noise reduction frequency domain signals divided in the frame.
也即,第i帧降噪频域信号的能量谱的计算公式可以为:That is, the formula for calculating the energy spectrum of the noise-reduced frequency domain signal in the i-th frame can be:
Figure PCTCN2018115601-appb-000006
其中,w表示第w个子降噪频域信号,N b表示子降噪频域信号的总数量,E(i)表示第i帧降噪频域信号的能量谱,E(w,i)表示第i帧第w个子降噪频域信号的能量谱。
Figure PCTCN2018115601-appb-000006
Where w is the wth sub-noise reduction frequency domain signal, N b is the total number of sub-noise reduction frequency domain signals, E (i) is the energy spectrum of the i-th frame noise reduction frequency domain signal, and E (w, i) is The energy spectrum of the w-th sub-noise frequency-domain signal in the i-th frame.
在207中,电子设备计算每帧降噪频域信号的谱熵。In 207, the electronic device calculates the spectral entropy of the noise-reduced frequency domain signal for each frame.
如图4所示,在一些实施例中,流程207可以通过流程2071及流程2072来实现,可以为:As shown in FIG. 4, in some embodiments, the process 207 may be implemented through the process 2071 and the process 2072, which may be:
在2071中,电子设备根据每个子降噪频域信号的能量谱及每帧降噪频域信号的能量谱,计算每个子降噪频域信号的归一化概率密度。In 2071, the electronic device calculates the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of each frame of the denoise frequency domain signal.
比如,第i帧第w个子降噪频域信号的归一化密度的计算公式可以为:For example, the calculation formula of the normalized density of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:
Figure PCTCN2018115601-appb-000007
其中,w表示第w个子降噪频域信号,N b表示子降噪频域信号的总数量,p(w,i)表示第i帧第w个子降噪频域信号的归一化密度,E(w,i)表示第i帧第w个子降噪频域信号的能量谱,E(i)表示第i帧降噪频域信号的能量谱。
Figure PCTCN2018115601-appb-000007
Where w represents the wth sub-noise reduction frequency domain signal, N b represents the total number of sub-noise reduction frequency domain signals, and p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame, E (w, i) represents the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame, and E (i) represents the energy spectrum of the i-th frame noise reduction frequency-domain signal.
在2072中,电子设备根据每个子降噪频域信号的归一化概率密度,计算每帧降噪频域信号的谱熵。In 2072, the electronic device calculates the spectral entropy of the noise reduction frequency domain signal for each frame according to the normalized probability density of each sub-noise reduction frequency domain signal.
比如,第i帧降噪频域信号的谱熵的计算公式可以为:For example, the formula for calculating the spectral entropy of the noise-reduced frequency-domain signal in frame i can be:
Figure PCTCN2018115601-appb-000008
其中,w表示第w个子降噪频域信号,N b表示子降噪频域信号的总数量,p(w,i)表示第i帧第w个子降噪频域信号的归一化密度,H(i)表示第i帧降噪频域信号的谱熵。
Figure PCTCN2018115601-appb-000008
Where w represents the wth sub-noise reduction frequency domain signal, N b represents the total number of sub-noise reduction frequency domain signals, and p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame, H (i) represents the spectral entropy of the noise-reduced frequency domain signal of the i-th frame.
在208中,电子设备根据每帧降噪频域信号的能量谱和每帧降噪频域信号的谱熵计算每帧降噪频域信号的谱熵比值,得到降噪语音信号的所有帧的谱熵比值。In 208, the electronic device calculates the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain Spectral entropy ratio.
比如,第i帧降噪频域信号的谱熵比值的计算公式可以为:For example, the formula for calculating the spectral entropy ratio of the noise-reduced frequency-domain signal in frame i can be:
Figure PCTCN2018115601-appb-000009
其中,EER TEO(i)表示第i帧降噪频域信号的谱熵比值,E(i)表示第i帧降噪频域信号的能量谱,H(i)表示第i帧降噪频域信号的谱熵。
Figure PCTCN2018115601-appb-000009
Among them, EER TEO (i) represents the spectral entropy ratio of the noise reduction frequency domain signal of the i frame, E (i) represents the energy spectrum of the noise reduction frequency domain signal of the i frame, and H (i) represents the noise reduction frequency domain of the i frame The spectral entropy of the signal.
可以理解的是,电子设备可以根据上述第i帧降噪频域信号的谱熵比值的计算公式计算出每帧降噪频域信号的谱熵比值,从而得到降噪语音信号的所有帧的谱熵比值。It can be understood that the electronic device can calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the calculation formula of the spectral entropy ratio of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the spectrum of all frames of the noise reduction speech signal Entropy ratio.
在209中,电子设备对每帧降噪频域信号进行逆傅里叶变换,得到多帧降噪时域信号。In 209, the electronic device performs an inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain a multi-frame noise-reduced time domain signal.
可以理解,对频域信号进行逆傅里叶变换可以将频域信号转换为时域信号。因此,电子设备对每帧降噪频域信号进行逆傅里叶变换,可以得到多帧降噪时域信号It can be understood that performing inverse Fourier transform on the frequency domain signal can convert the frequency domain signal into a time domain signal. Therefore, the electronic device performs inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain multi-frame noise-reduced time domain signal
在210中,电子设备计算每帧降噪时域信号的短时能量,得到降噪语音信号的所有帧的短时能量。In 210, the electronic device calculates the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.
比如,第i帧降噪时域信号的短时能量的计算公式为:For example, the calculation formula of the short-term energy of the noise reduction time-domain signal of the i frame is:
Figure PCTCN2018115601-appb-000010
其中,
Figure PCTCN2018115601-appb-000011
表示第i帧降噪时域信号的短时能量,
Figure PCTCN2018115601-appb-000012
表示第i帧降噪时域信号,
Figure PCTCN2018115601-appb-000013
表示第(i+1)帧降噪时域信号,
Figure PCTCN2018115601-appb-000014
表示第(i-1)帧降噪时域信号。
Figure PCTCN2018115601-appb-000010
among them,
Figure PCTCN2018115601-appb-000011
Represents the short-term energy of the noise reduction time-domain signal of frame i,
Figure PCTCN2018115601-appb-000012
Represents the noise reduction time-domain signal of frame i,
Figure PCTCN2018115601-appb-000013
Represents the noise-reduced time-domain signal at frame (i + 1),
Figure PCTCN2018115601-appb-000014
Represents the noise-reduced time-domain signal at frame (i-1).
可以理解的是,电子设备可以根据上述第i帧降噪频域信号的短时能量的计算公式计算出每帧降噪频域信号的短时能量,从而得到降噪语音信号的所有帧的短时能量。It can be understood that the electronic device can calculate the short-term energy of the noise reduction frequency domain signal of each frame according to the calculation formula of the short-term energy of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the short frame of all frames of the noise reduction speech signal时 能量。 Time energy.
在211中,电子设备根据降噪语音信号的所有帧的谱熵比值和降噪语音信号的所有帧的短时能量,确定语音起始点的位置;和/或电子设备根据降噪语音信号的所有帧的谱熵比值和降噪语音信号的所有帧的短时能量,确定语音终 止点的位置。In 211, the electronic device determines the position of the voice starting point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or the electronic device The spectral entropy ratio of the frame and the short-term energy of all frames of the noise-reduced speech signal determine the position of the speech termination point.
在一些实施例中,流程211可以为:若电子设备根据降噪语音信号的第一数量帧的谱熵比值和降噪语音信号的第一数量帧的短时能量,检测到无语音存在,且根据降噪语音信号的第二数量帧的谱熵比值和降噪语音信号的第二数量帧的谱熵比值,检测到有语音存在,则电子设备确定第二数量帧中的第一帧所在的位置为语音起始点的位置。In some embodiments, the process 211 may be: if the electronic device detects that no speech exists according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, detecting the presence of speech, the electronic device determines that the first frame of the second number of frames is located The position is the position of the starting point of the voice.
在一些实施例中,流程212可以为:若电子设备根据降噪语音信号的第三数量帧的谱熵比值和降噪语音信号的第三数量帧的短时能量,检测到有语音存在,且根据降噪语音信号的第四数量帧的谱熵比值和降噪语音信号的第四数量帧的短时能量,检测到无语音存在,则电子设备确定第四数量帧中的第一帧所在的位置为语音终止点的位置。In some embodiments, the process 212 may be: if the electronic device detects the presence of speech according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, it is detected that no speech exists, and the electronic device determines The position is the position of the voice termination point.
比如,若电子设备根据连续若干帧的谱熵比值和短时能量,检测到无语音存在,且根据随后的若干帧的谱熵比值和短时能量,检测到有语音存在,则电子设备可以确定随后的若干帧中的第一帧所在的位置为语音起始点的位置。For example, if the electronic device detects the absence of speech based on the spectral entropy ratio and short-term energy of consecutive frames, and the presence of speech is detected based on the spectral entropy ratio and short-term energy of subsequent frames, the electronic device can determine The position of the first frame in subsequent frames is the position of the starting point of speech.
比如,若电子设备根据连续若干帧的谱熵比值和短时能量,检测到有语音存在,且根据随后的若干帧的谱熵比值和短时能量,检测到无语音存在,则电子设备可以确定随后的若干帧中的第一帧所在的位置为语音终止点的位置。For example, if the electronic device detects the presence of speech based on the spectral entropy ratio and short-term energy of consecutive frames, and the absence of speech is detected based on the spectral entropy ratio and short-term energy of subsequent frames, the electronic device may determine The position of the first frame in subsequent frames is the position of the voice termination point.
例如,假设降噪语音信号分为20帧,分别为第1帧,第2帧,第3帧……第19帧,第20帧。For example, suppose that the noise-reduced speech signal is divided into 20 frames, namely the first frame, the second frame, the third frame ... the 19th frame, the 20th frame.
其中,电子设备根据第1帧至第5帧的谱熵比值和第1帧至第5帧的短时能量检测到无语音存在,且根据第6帧至第10帧的谱熵比值和第6帧至第10帧的短时能量检测到有语音存在,则电子设备确定第6帧所在的位置为语音起始点的位置。Among them, the electronic device detects that no speech exists according to the spectral entropy ratio of the first frame to the fifth frame and the short-term energy of the first frame to the fifth frame, and according to the spectral entropy ratio of the sixth frame to the tenth frame and the sixth When the short-term energy from frame to frame 10 detects the presence of voice, the electronic device determines that the position of frame 6 is the position of the starting point of the voice.
电子设备根据第11帧至第15帧的谱熵比值和第11帧至第15帧的短时能量检测到有语音存在,且根据第16帧至第20帧的谱熵比值和第16帧至第20帧的短时能量检测到无语音存在,则电子设备确定第16帧所在的位置为语音终止点的位置。The electronic device detects the presence of speech based on the spectral entropy ratio of frames 11 to 15 and the short-term energy of frames 11 to 15 and based on the spectral entropy ratio of frames 16 to 20 and frames 16 to 20. The short-term energy of the 20th frame detects that there is no voice, and the electronic device determines that the position of the 16th frame is the position of the voice termination point.
在本实施例中,可以根据实际情况对谱熵比值和短时能量的取值进行设置。例如,可以设置当谱熵比值的取值为A时,即可以检测到有语音存在,当谱熵比值的取值为B时,即可以检测到无语音存在。可以设置当短时能量的取 值为C时,即可以检测到有语音存在,当短时能量的取值为D时,即可以检测到无语音存在。In this embodiment, the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice, and when the short-term energy value is D, it can detect that there is no voice.
需要说明的是,上述仅仅是本实施例提出的确定语音起始点的位置和确定语音终止点的位置的一种示例。可以理解,在本申请实施例的保护范围内,还可以通过其它方式确定语音起始点的位置和确定语音终止点的位置,此处不做具体限制。It should be noted that the above is only one example of determining the position of the voice start point and the position of the voice end point proposed in this embodiment. It can be understood that, within the protection scope of the embodiments of the present application, the position of the voice start point and the position of the voice end point may also be determined in other ways, and no specific limitation is made here.
请参阅图5,图5为本申请实施例提供的语音端点检测装置300的结构示意图。该语音端点检测装置300可以包括:获取模块301,降噪模块302,计算模块303,检测模块304。Please refer to FIG. 5, which is a schematic structural diagram of a voice endpoint detection device 300 according to an embodiment of the present application. The voice endpoint detection device 300 may include: an acquisition module 301, a noise reduction module 302, a calculation module 303, and a detection module 304.
获取模块301,用于获取含噪语音信号。The obtaining module 301 is used to obtain a noisy speech signal.
降噪模块302,用于对所述含噪语音信号进行降噪处理,得到降噪语音信号。The noise reduction module 302 is configured to perform noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal.
计算模块303,用于计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量。The calculation module 303 is configured to calculate the spectral entropy ratio of the noise-reduced speech signal and calculate the short-term energy of the noise-reduced speech signal.
检测模块304,用于根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。The detection module 304 is configured to perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
在一些实施方式中,所述降噪模块302可以用于:对所述含噪语音信号进行分帧加窗处理,得到多帧加窗时域信号;对所述多帧加窗时域信号的每帧加窗时域信号进行傅里叶变换,得到多帧频域信号;估算每帧频域信号的傅里叶系数;根据所述每帧频域信号的傅里叶系数,对每帧频域信号进行降噪处理,得到多帧降噪频域信号。In some embodiments, the noise reduction module 302 may be used to: perform frame-by-frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time-domain signal; Perform a Fourier transform on the windowed time-domain signal of each frame to obtain a multi-frame frequency domain signal; estimate the Fourier coefficients of the frequency domain signal of each frame; according to the Fourier coefficients of the frequency domain signal of each frame, for each frame frequency The domain signal is subjected to noise reduction processing to obtain a multi-frame noise reduction frequency domain signal.
在一些实施方式中,所述计算模块303可以用于:计算每帧降噪频域信号的能量谱;计算每帧降噪频域信号的谱熵;根据所述每帧降噪频域信号的能量谱和所述每帧降噪频域信号的谱熵计算每帧降噪频域信号的谱熵比值,得到所述降噪语音信号的所有帧的谱熵比值。In some embodiments, the calculation module 303 may be used to: calculate the energy spectrum of the noise reduction frequency domain signal per frame; calculate the spectral entropy of the noise reduction frequency domain signal per frame; according to the noise reduction frequency domain signal per frame The energy spectrum and the spectral entropy of the noise-reduced frequency domain signal per frame calculate the spectral entropy ratio of the noise-reduced frequency domain signal per frame to obtain the spectral entropy ratio of all frames of the noise-reduced speech signal.
在一些实施方式中,所述计算模块303还可以用于:获取每帧降噪频域信号的频带信息;根据所述频带信息对所述每帧降噪频域信号进行划分,得到每帧降噪频域信号对应的多个子降噪频域信号;计算所述多个子降噪频域信号的每个子降噪频域信号的能量谱;根据所述每个子降噪频域信号的能量谱计算每帧降噪频域信号的能量谱。In some embodiments, the calculation module 303 may also be used to: obtain frequency band information of the noise reduction frequency domain signal of each frame; divide the noise reduction frequency domain signal of each frame according to the frequency band information to obtain a reduction of each frame A plurality of sub-noise reduction frequency domain signals corresponding to the noise frequency domain signal; calculating an energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; calculating according to the energy spectrum of each sub-noise reduction frequency domain signal The energy spectrum of the noise-reduced frequency domain signal per frame.
在一些实施方式中,所述计算模块303还可以用于:根据所述每个子降噪频域信号的能量谱及所述每帧降噪频域信号的能量谱,计算每个子降噪频域信号的归一化概率密度;根据所述每个子降噪频域信号的归一化概率密度,计算每帧降噪频域信号的谱熵。In some embodiments, the calculation module 303 may be further used to calculate each sub-noise reduction frequency domain according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal The normalized probability density of the signal; according to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.
在一些实施方式中,所述计算模块303还可以用于:对每帧降噪频域信号进行逆傅里叶变换,得到多帧降噪时域信号;计算每帧降噪时域信号的短时能量,得到所述降噪语音信号的所有帧的短时能量。In some embodiments, the calculation module 303 can also be used to: perform inverse Fourier transform on the noise reduction frequency domain signal of each frame to obtain a multi-frame noise reduction time domain signal; Time energy, the short-term energy of all frames of the noise-reduced speech signal is obtained.
在一些实施方式中,所述检测模块304可以用于:根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音起始点的位置;和/或根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音终止点的位置。In some embodiments, the detection module 304 may be used to determine the position of the voice start point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal And / or determine the position of the voice termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.
在一些实施方式中,所述检测模块304还可以用于:若根据所述降噪语音信号的第一数量帧的谱熵比值和所述降噪语音信号的第一数量帧的短时能量,检测到无语音存在,且根据所述降噪语音信号的第二数量帧的谱熵比值和所述降噪语音信号的第二数量帧的谱熵比值,检测到有语音存在,则确定第二数量帧中的第一帧所在的位置为语音起始点的位置。In some embodiments, the detection module 304 may be further configured to: based on the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, No speech is detected, and according to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the second The position of the first frame in the number of frames is the position of the starting point of speech.
在一些实施方式中,所述检测模块304还可以用于:若根据所述降噪语音信号的第三数量帧的谱熵比值和所述降噪语音信号的第三数量帧的短时能量,检测到有语音存在,且根据所述降噪语音信号的第四数量帧的谱熵比值和所述降噪语音信号的第四数量帧的短时能量,检测到无语音存在,则确定第四数量帧中的第一帧所在的位置为语音终止点的位置。In some implementations, the detection module 304 may be further configured to: based on the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, The presence of speech is detected, and according to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, the absence of speech is detected, and the fourth is determined The position of the first frame in the number of frames is the position of the voice termination point.
本申请实施例提供一种计算机可读的存储介质,其上存储有计算机程序,当所述计算机程序在计算机上执行时,使得所述计算机执行如本实施例提供的语音端点检测方法中的流程。An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed on a computer, the computer is caused to execute the process in the voice endpoint detection method provided in this embodiment .
本申请实施例还提供一种电子设备,包括存储器,处理器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行本实施例提供的语音端点检测方法中的流程。An embodiment of the present application also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program stored in the memory by executing the computer program Process in the voice endpoint detection method.
例如,上述电子设备可以是诸如平板电脑或者智能手机等移动终端。请参阅图6,图6为本申请实施例提供的电子设备的第一种结构示意图。For example, the aforementioned electronic device may be a mobile terminal such as a tablet computer or a smart phone. Please refer to FIG. 6, which is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
该移动终端400可以包括麦克风401、存储器402、处理器403等部件。 本领域技术人员可以理解,图6中示出的移动终端结构并不构成对移动终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The mobile terminal 400 may include components such as a microphone 401, a memory 402, and a processor 403. Those skilled in the art may understand that the structure of the mobile terminal shown in FIG. 6 does not constitute a limitation on the mobile terminal, and may include more or fewer components than those illustrated, or combine certain components, or arrange different components.
麦克风401可以用于拾取用户发出的语音等。The microphone 401 can be used to pick up the voice uttered by the user and the like.
存储器402可用于存储应用程序和数据。存储器402存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器403通过运行存储在存储器402的应用程序,从而执行各种功能应用以及数据处理。The memory 402 may be used to store application programs and data. The application program stored in the memory 402 contains executable code. The application program can form various functional modules. The processor 403 executes application programs stored in the memory 402 to execute various functional applications and data processing.
处理器403是移动终端的控制中心,利用各种接口和线路连接整个移动终端的各个部分,通过运行或执行存储在存储器402内的应用程序,以及调用存储在存储器402内的数据,执行移动终端的各种功能和处理数据,从而对移动终端进行整体监控。The processor 403 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 402 and calling data stored in the memory 402 Various functions and processing data to monitor the mobile terminal as a whole.
在本实施例中,移动终端中的处理器403会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器402中,并由处理器403来运行存储在存储器402中的应用程序,从而实现流程:In this embodiment, the processor 403 in the mobile terminal loads the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 runs and stores the memory in the memory The application in 402, thereby implementing the process:
获取含噪语音信号;Obtain noisy speech signals;
对所述含噪语音信号进行降噪处理,得到降噪语音信号;Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;
计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;
根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
请参阅图7,图7为本申请实施例提供的电子设备的第二种结构示意图。Please refer to FIG. 7, which is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
该移动终端500可以包括麦克风501、存储器502、处理器503、输入单元504、输出单元505、扬声器506等部件。The mobile terminal 500 may include components such as a microphone 501, a memory 502, a processor 503, an input unit 504, an output unit 505, a speaker 506, and the like.
麦克风501可以用于拾取用户发出的语音等。The microphone 501 can be used to pick up the voice uttered by the user and the like.
存储器502可用于存储应用程序和数据。存储器502存储的应用程序中包含有可执行代码。应用程序可以组成各种功能模块。处理器503通过运行存储在存储器502的应用程序,从而执行各种功能应用以及数据处理。The memory 502 may be used to store application programs and data. The application program stored in the memory 502 contains executable code. The application program can form various functional modules. The processor 503 executes application programs stored in the memory 502 to execute various functional applications and data processing.
处理器503是移动终端的控制中心,利用各种接口和线路连接整个移动终端的各个部分,通过运行或执行存储在存储器502内的应用程序,以及调用存储在存储器502内的数据,执行移动终端的各种功能和处理数据,从而对移动终端进行整体监控。The processor 503 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 502 and calling data stored in the memory 502 Various functions and processing data to monitor the mobile terminal as a whole.
输入单元504可用于接收输入的数字、字符信息或用户特征信息(比如指纹),以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。The input unit 504 may be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
输出单元505可用于显示由用户输入的信息或提供给用户的信息以及移动终端的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。输出单元可包括显示面板。The output unit 505 can be used to display information input by the user or provided to the user and various graphical user interfaces of the mobile terminal. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof. The output unit may include a display panel.
在本实施例中,移动终端中的处理器503会按照如下的指令,将一个或一个以上的应用程序的进程对应的可执行代码加载到存储器502中,并由处理器503来运行存储在存储器502中的应用程序,从而实现流程:In this embodiment, the processor 503 in the mobile terminal will load the executable code corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 503 will run the stored code in the memory The application in 502, thereby implementing the process:
获取含噪语音信号;Obtain noisy speech signals;
对所述含噪语音信号进行降噪处理,得到降噪语音信号;Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;
计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;
根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
在一些实施方式中,处理器503执行所述对所述含噪语音信号进行降噪处理,得到降噪语音信号的流程时,可以执行:对所述含噪语音信号进行分帧加窗处理,得到多帧加窗时域信号;对所述多帧加窗时域信号的每帧加窗时域信号进行傅里叶变换,得到多帧频域信号;估算每帧频域信号的傅里叶系数;根据所述每帧频域信号的傅里叶系数,对每帧频域信号进行降噪处理,得到多帧降噪频域信号。In some implementations, when the processor 503 executes the process of performing noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal, it may perform: performing frame-and-window processing on the noise-containing speech signal, Obtain multi-frame windowed time-domain signals; Fourier transform the windowed time-domain signals of each frame of the multi-frame windowed time-domain signals to obtain multi-frame frequency-domain signals; estimate the Fourier of each frame of frequency-domain signals Coefficient; according to the Fourier coefficient of the frequency domain signal of each frame, perform noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
在一些实施方式中,处理器503执行所述计算所述降噪语音信号的谱熵比值的流程时,可以执行:计算每帧降噪频域信号的能量谱;计算每帧降噪频域信号的谱熵;根据所述每帧降噪频域信号的能量谱和所述每帧降噪频域信号的谱熵计算每帧降噪频域信号的谱熵比值,得到所述降噪语音信号的所有帧的谱熵比值。In some embodiments, when the processor 503 executes the process of calculating the spectral entropy ratio of the noise-reduced speech signal, it may perform: calculating the energy spectrum of the noise-reducing frequency domain signal per frame; Spectral entropy of each frame; calculate the spectral entropy ratio of the noise reduction frequency domain signal per frame according to the energy spectrum of the noise reduction frequency domain signal per frame and the spectral entropy of the noise reduction frequency domain signal per frame to obtain the noise reduction speech signal The ratio of the spectral entropy of all frames.
在一些实施方式中,处理器503执行所述计算每帧降噪频域信号的能量谱的流程时,可以执行:获取每帧降噪频域信号的频带信息;根据所述频带信息对所述每帧降噪频域信号进行划分,得到每帧降噪频域信号对应的多个子降噪频域信号;计算所述多个子降噪频域信号的每个子降噪频域信号的能量谱;根据所述每个子降噪频域信号的能量谱计算每帧降噪频域信号的能量谱。In some embodiments, when the processor 503 executes the process of calculating the energy spectrum of the noise reduction frequency domain signal per frame, it may perform: acquiring frequency band information of the noise reduction frequency domain signal per frame; according to the frequency band information Divide the noise reduction frequency domain signal of each frame to obtain a plurality of sub-noise reduction frequency domain signals corresponding to each frame of the noise reduction frequency domain signal; calculate the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.
在一些实施方式中,处理器503执行所述计算每帧降噪频域信号的谱熵的流程时,可以执行:根据所述每个子降噪频域信号的能量谱及所述每帧降噪频域信号的能量谱,计算每个子降噪频域信号的归一化概率密度;根据所述每个子降噪频域信号的归一化概率密度,计算每帧降噪频域信号的谱熵。In some embodiments, when the processor 503 executes the process of calculating the spectral entropy of the noise reduction frequency domain signal per frame, it may perform: according to the energy spectrum of each sub-noise reduction frequency domain signal and the noise reduction per frame The energy spectrum of the frequency domain signal, calculate the normalized probability density of each sub-noise reduction frequency domain signal; calculate the spectral entropy of the noise reduction frequency domain signal per frame according to the normalized probability density of each sub-noise reduction frequency domain signal .
在一些实施方式中,处理器503执行所述计算所述降噪语音信号的短时能量的流程时,可以执行:对每帧降噪频域信号进行逆傅里叶变换,得到多帧降噪时域信号;计算每帧降噪时域信号的短时能量,得到所述降噪语音信号的所有帧的短时能量。In some embodiments, when the processor 503 executes the process of calculating the short-term energy of the noise-reduced speech signal, it may perform: performing an inverse Fourier transform on each frame of the noise-reduced frequency domain signal to obtain multi-frame noise reduction Time-domain signal; calculate the short-term energy of the noise-reduced time-domain signal of each frame to obtain the short-term energy of all frames of the noise-reduced speech signal.
在一些实施方式中,处理器503执行所述根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测的流程时,可以执行:根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音起始点的位置;和/或根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音终止点的位置。In some embodiments, when the processor 503 executes the process of performing voice endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal, it may execute: according to the noise reduction Determine the position of the starting point of the speech by the spectral entropy ratio of all frames of the speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or according to the spectral entropy ratio of all frames of the noise-reduced speech signal and all Describe the short-term energy of all frames of the noise-reduced speech signal to determine the position of the speech termination point.
在一些实施方式中,处理器503执行所述根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音起始点的位置的流程时,可以执行:若根据所述降噪语音信号的第一数量帧的谱熵比值和所述降噪语音信号的第一数量帧的短时能量,检测到无语音存在,且根据所述降噪语音信号的第二数量帧的谱熵比值和所述降噪语音信号的第二数量帧的谱熵比值,检测到有语音存在,则确定第二数量帧中的第一帧所在的位置为语音起始点的位置。In some embodiments, the processor 503 executes the process of determining the position of the voice start point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the noise reduction The spectral entropy ratio of the second number of frames of the speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the position of the first frame in the second number of frames is determined to be speech The position of the starting point.
在一些实施方式中,处理器503执行所述根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音终止点的位置的流程时,可以执行:若根据所述降噪语音信号的第三数量帧的谱熵比值和所述降噪语音信号的第三数量帧的短时能量,检测到有语音存在,且根据所述降噪语音信号的第四数量帧的谱熵比值和所述降噪语音信号的第四数量帧的短时能量,检测到无语音存在,则确定第四数量帧中的第一帧所在的位置为语音终止点的位置。In some embodiments, the processor 503 executes the process of determining the position of the speech termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, the presence of speech is detected, and according to the noise reduction The spectral entropy ratio of the fourth number of frames of the speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is speech The location of the end point.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见上文针对语音端点检测方法的详细描述,此处不再赘述。In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the detailed description of the voice endpoint detection method above, which will not be repeated here.
本申请实施例提供的所述语音端点检测装置与上文实施例中的语音端点 检测方法属于同一构思,在所述语音端点检测装置上可以运行所述语音端点检测方法实施例中提供的任一方法,其具体实现过程详见所述语音端点检测方法实施例,此处不再赘述。The voice endpoint detection device provided by the embodiment of the present application and the voice endpoint detection method in the above embodiments belong to the same concept, and any of the voice endpoint detection method embodiments provided on the voice endpoint detection device can be run on the voice endpoint detection device For the method and the specific implementation process, please refer to the embodiments of the voice endpoint detection method, which will not be repeated here.
需要说明的是,对本申请实施例所述语音端点检测方法而言,本领域普通技术人员可以理解实现本申请实施例所述语音端点检测方法的全部或部分流程,是可以通过计算机程序来控制相关的硬件来完成,所述计算机程序可存储于一计算机可读取存储介质中,如存储在存储器中,并被至少一个处理器执行,在执行过程中可包括如所述语音端点检测方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)等。It should be noted that, for the voice endpoint detection method described in the embodiments of the present application, a person of ordinary skill in the art can understand that all or part of the process of implementing the voice endpoint detection method described in the embodiments of the present application can be controlled by a computer program. Completed by hardware, the computer program may be stored in a computer-readable storage medium, such as stored in a memory, and executed by at least one processor, during the execution process may include the implementation of the voice endpoint detection method Example process. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and so on.
对本申请实施例的所述语音端点检测装置而言,其各功能模块可以集成在一个处理芯片中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中,所述存储介质譬如为只读存储器,磁盘或光盘等。For the voice endpoint detection device of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules may be integrated into one module. The above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .
以上对本申请实施例所提供的一种语音端点检测方法、装置、存储介质以及电子设备进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。The voice endpoint detection method, device, storage medium, and electronic equipment provided in the embodiments of the present application are described in detail above. Specific examples are used in this article to explain the principles and implementation of the present application. It is only used to help understand the method and core ideas of this application; meanwhile, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and application scope. In summary, this The content of the description should not be construed as limiting the application.

Claims (20)

  1. 一种语音端点检测方法,其中,包括:A voice endpoint detection method, including:
    获取含噪语音信号;Obtain noisy speech signals;
    对所述含噪语音信号进行降噪处理,得到降噪语音信号;Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;
    计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;
    根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  2. 根据权利要求1所述的语音端点检测方法,其中,所述对所述含噪语音信号进行降噪处理,得到降噪语音信号,包括:The method for detecting a voice endpoint according to claim 1, wherein the performing noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal includes:
    对所述含噪语音信号进行分帧加窗处理,得到多帧加窗时域信号;Performing frame-by-frame windowing on the noisy speech signal to obtain a multi-frame windowed time-domain signal;
    对所述多帧加窗时域信号的每帧加窗时域信号进行傅里叶变换,得到多帧频域信号;Performing a Fourier transform on each frame of the multi-frame windowed time-domain signal to obtain a multi-frame frequency domain signal;
    估算每帧频域信号的傅里叶系数;Estimate the Fourier coefficient of each frame frequency domain signal;
    根据所述每帧频域信号的傅里叶系数,对每帧频域信号进行降噪处理,得到多帧降噪频域信号。According to the Fourier coefficients of the frequency domain signal of each frame, performing noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
  3. 根据权利要求2所述的语音端点检测方法,其中,所述计算所述降噪语音信号的谱熵比值,包括:The method for detecting a voice endpoint according to claim 2, wherein the calculating the spectral entropy ratio of the noise-reduced voice signal includes:
    计算每帧降噪频域信号的能量谱;Calculate the energy spectrum of the noise-reduced frequency domain signal per frame;
    计算每帧降噪频域信号的谱熵;Calculate the spectral entropy of the noise-reduced frequency domain signal per frame;
    根据所述每帧降噪频域信号的能量谱和所述每帧降噪频域信号的谱熵计算每帧降噪频域信号的谱熵比值,得到所述降噪语音信号的所有帧的谱熵比值。Calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain the Spectral entropy ratio.
  4. 根据权利要求3所述的语音端点检测方法,其中,所述计算每帧降噪频域信号的能量谱,包括:The voice endpoint detection method according to claim 3, wherein the calculating the energy spectrum of the noise-reduced frequency domain signal per frame includes:
    获取每帧降噪频域信号的频带信息;Obtain the frequency band information of the noise reduction frequency domain signal of each frame;
    根据所述频带信息对所述每帧降噪频域信号进行划分,得到每帧降噪频域信号对应的多个子降噪频域信号;Dividing the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame;
    计算所述多个子降噪频域信号的每个子降噪频域信号的能量谱;Calculating the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals;
    根据所述每个子降噪频域信号的能量谱计算每帧降噪频域信号的能量谱。The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.
  5. 根据权利要求4所述的语音端点检测方法,其中,所述计算每帧降噪频域信号的谱熵,包括:The speech endpoint detection method according to claim 4, wherein the calculating the spectral entropy of the noise-reduced frequency domain signal per frame includes:
    根据所述每个子降噪频域信号的能量谱及所述每帧降噪频域信号的能量谱,计算每个子降噪频域信号的归一化概率密度;Calculating the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal;
    根据所述每个子降噪频域信号的归一化概率密度,计算每帧降噪频域信号的谱熵。According to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.
  6. 根据权利要求3所述的语音端点检测方法,其中,所述计算所述降噪语音信号的短时能量,包括:The voice endpoint detection method according to claim 3, wherein the calculation of the short-term energy of the noise-reduced voice signal includes:
    对每帧降噪频域信号进行逆傅里叶变换,得到多帧降噪时域信号;Perform inverse Fourier transform on each frame of noise reduction frequency domain signal to obtain multiframe noise reduction time domain signal;
    计算每帧降噪时域信号的短时能量,得到所述降噪语音信号的所有帧的短时能量。Calculate the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.
  7. 根据权利要求6所述的语音端点检测方法,其中,所述根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测,包括:The speech endpoint detection method according to claim 6, wherein the speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal includes:
    根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音起始点的位置;和/或Determine the position of the speech starting point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or
    根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音终止点的位置。The position of the speech termination point is determined according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.
  8. 根据权利要求7所述的语音端点检测方法,其中,所述根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音起始点的位置,包括:The method for detecting a voice endpoint according to claim 7, wherein said determining the starting point of the voice based on the spectral entropy ratio of all frames of the noise-reduced voice signal and the short-term energy of all frames of the noise-reduced voice signal Location, including:
    若根据所述降噪语音信号的第一数量帧的谱熵比值和所述降噪语音信号的第一数量帧的短时能量,检测到无语音存在,且根据所述降噪语音信号的第二数量帧的谱熵比值和所述降噪语音信号的第二数量帧的谱熵比值,检测到有语音存在,则确定第二数量帧中的第一帧所在的位置为语音起始点的位置。If according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the first The spectral entropy ratio of the second number of frames and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, it is determined that the position of the first frame in the second number of frames is the position of the starting point of the speech .
  9. 根据权利要求7所述的语音端点检测方法,其中,所述根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音终止点的位置,包括:The method of detecting a voice endpoint according to claim 7, wherein the determining of the voice termination point is based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal Location, including:
    若根据所述降噪语音信号的第三数量帧的谱熵比值和所述降噪语音信号的第三数量帧的短时能量,检测到有语音存在,且根据所述降噪语音信号的第四数量帧的谱熵比值和所述降噪语音信号的第四数量帧的短时能量,检测到无语音存在,则确定第四数量帧中的第一帧所在的位置为语音终止点的位置。If the presence of speech is detected according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and according to the third The spectral entropy ratio of the fourth number of frames and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is the position of the voice termination point .
  10. 一种语音端点检测装置,其中,包括:A voice endpoint detection device, including:
    获取模块,用于获取含噪语音信号;Acquisition module for acquiring noisy speech signals;
    降噪模块,用于对所述含噪语音信号进行降噪处理,得到降噪语音信号;A noise reduction module, configured to perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal;
    计算模块,用于计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;A calculation module, used to calculate the spectral entropy ratio of the noise-reduced speech signal, and calculate the short-term energy of the noise-reduced speech signal;
    检测模块,用于根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。The detection module is configured to perform speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  11. 一种存储介质,其中,所述存储介质中存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行权利要求1至9任一项所述的语音端点检测方法。A storage medium, wherein a computer program is stored in the storage medium, and when the computer program is run on a computer, the computer is caused to execute the voice endpoint detection method according to any one of claims 1 to 9.
  12. 一种电子设备,其中,所述电子设备包括处理器和存储器,所述存储器中存储有计算机程序,所述处理器通过调用所述存储器中存储的所述计算机程序,用于执行:An electronic device, wherein the electronic device includes a processor and a memory, a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:
    获取含噪语音信号;Obtain noisy speech signals;
    对所述含噪语音信号进行降噪处理,得到降噪语音信号;Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;
    计算所述降噪语音信号的谱熵比值,并计算所述降噪语音信号的短时能量;Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;
    根据所述降噪语音信号的谱熵比值和所述降噪语音信号的短时能量进行语音端点检测。Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  13. 根据权利要求12所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 12, wherein the processor is configured to execute:
    对所述含噪语音信号进行分帧加窗处理,得到多帧加窗时域信号;Performing frame-by-frame windowing on the noisy speech signal to obtain a multi-frame windowed time-domain signal;
    对所述多帧加窗时域信号的每帧加窗时域信号进行傅里叶变换,得到多帧频域信号;Performing a Fourier transform on each frame of the multi-frame windowed time-domain signal to obtain a multi-frame frequency domain signal;
    估算每帧频域信号的傅里叶系数;Estimate the Fourier coefficient of each frame frequency domain signal;
    根据所述每帧频域信号的傅里叶系数,对每帧频域信号进行降噪处理,得到多帧降噪频域信号。According to the Fourier coefficients of the frequency domain signal of each frame, performing noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
  14. 根据权利要求13所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 13, wherein the processor is configured to execute:
    计算每帧降噪频域信号的能量谱;Calculate the energy spectrum of the noise-reduced frequency domain signal per frame;
    计算每帧降噪频域信号的谱熵;Calculate the spectral entropy of the noise-reduced frequency domain signal per frame;
    根据所述每帧降噪频域信号的能量谱和所述每帧降噪频域信号的谱熵计算每帧降噪频域信号的谱熵比值,得到所述降噪语音信号的所有帧的谱熵比值。Calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain the Spectral entropy ratio.
  15. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 14, wherein the processor is configured to execute:
    获取每帧降噪频域信号的频带信息;Obtain the frequency band information of the noise reduction frequency domain signal of each frame;
    根据所述频带信息对所述每帧降噪频域信号进行划分,得到每帧降噪频域信号对应的多个子降噪频域信号;Dividing the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame;
    计算所述多个子降噪频域信号的每个子降噪频域信号的能量谱;Calculating the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals;
    根据所述每个子降噪频域信号的能量谱计算每帧降噪频域信号的能量谱。The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.
  16. 根据权利要求15所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 15, wherein the processor is configured to execute:
    根据所述每个子降噪频域信号的能量谱及所述每帧降噪频域信号的能量谱,计算每个子降噪频域信号的归一化概率密度;Calculating the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal;
    根据所述每个子降噪频域信号的归一化概率密度,计算每帧降噪频域信号的谱熵。According to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.
  17. 根据权利要求14所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 14, wherein the processor is configured to execute:
    对每帧降噪频域信号进行逆傅里叶变换,得到多帧降噪时域信号;Perform inverse Fourier transform on each frame of noise reduction frequency domain signal to obtain multiframe noise reduction time domain signal;
    计算每帧降噪时域信号的短时能量,得到所述降噪语音信号的所有帧的短时能量。Calculate the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.
  18. 根据权利要求17所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 17, wherein the processor is configured to execute:
    根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音起始点的位置;和/或Determine the position of the speech starting point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or
    根据所述降噪语音信号的所有帧的谱熵比值和所述降噪语音信号的所有帧的短时能量,确定语音终止点的位置。The position of the speech termination point is determined according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.
  19. 根据权利要求18所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 18, wherein the processor is configured to execute:
    若根据所述降噪语音信号的第一数量帧的谱熵比值和所述降噪语音信号的第一数量帧的短时能量,检测到无语音存在,且根据所述降噪语音信号的第二数量帧的谱熵比值和所述降噪语音信号的第二数量帧的谱熵比值,检测到有语音存在,则确定第二数量帧中的第一帧所在的位置为语音起始点的位置。If according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the first The spectral entropy ratio of the second number of frames and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, it is determined that the position of the first frame in the second number of frames is the position of the starting point of the speech .
  20. 根据权利要求18所述的电子设备,其中,所述处理器用于执行:The electronic device according to claim 18, wherein the processor is configured to execute:
    若根据所述降噪语音信号的第三数量帧的谱熵比值和所述降噪语音信号的第三数量帧的短时能量,检测到有语音存在,且根据所述降噪语音信号的第四数量帧的谱熵比值和所述降噪语音信号的第四数量帧的短时能量,检测到无语音存在,则确定第四数量帧中的第一帧所在的位置为语音终止点的位置。If the presence of speech is detected according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and according to the third The spectral entropy ratio of the fourth number of frames and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is the position of the voice termination point .
PCT/CN2018/115601 2018-11-15 2018-11-15 Voice activity detection method and apparatus, storage medium and electronic device WO2020097841A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880097699.4A CN112955951A (en) 2018-11-15 2018-11-15 Voice endpoint detection method and device, storage medium and electronic equipment
PCT/CN2018/115601 WO2020097841A1 (en) 2018-11-15 2018-11-15 Voice activity detection method and apparatus, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/115601 WO2020097841A1 (en) 2018-11-15 2018-11-15 Voice activity detection method and apparatus, storage medium and electronic device

Publications (1)

Publication Number Publication Date
WO2020097841A1 true WO2020097841A1 (en) 2020-05-22

Family

ID=70731178

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/115601 WO2020097841A1 (en) 2018-11-15 2018-11-15 Voice activity detection method and apparatus, storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN112955951A (en)
WO (1) WO2020097841A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216261A1 (en) * 2004-03-26 2005-09-29 Canon Kabushiki Kaisha Signal processing apparatus and method
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN107731223A (en) * 2017-11-22 2018-02-23 腾讯科技(深圳)有限公司 Voice activity detection method, relevant apparatus and equipment
CN107910017A (en) * 2017-12-19 2018-04-13 河海大学 A kind of method that threshold value is set in noisy speech end-point detection

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5732976B2 (en) * 2011-03-31 2015-06-10 沖電気工業株式会社 Speech segment determination device, speech segment determination method, and program
CN104810024A (en) * 2014-01-28 2015-07-29 上海力声特医学科技有限公司 Double-path microphone speech noise reduction treatment method and system
CN105023572A (en) * 2014-04-16 2015-11-04 王景芳 Noised voice end point robustness detection method
CN105825871B (en) * 2016-03-16 2019-07-30 大连理工大学 A kind of end-point detecting method without leading mute section of voice
CN106653062A (en) * 2017-02-17 2017-05-10 重庆邮电大学 Spectrum-entropy improvement based speech endpoint detection method in low signal-to-noise ratio environment
CN108428456A (en) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 Voice de-noising algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216261A1 (en) * 2004-03-26 2005-09-29 Canon Kabushiki Kaisha Signal processing apparatus and method
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN107731223A (en) * 2017-11-22 2018-02-23 腾讯科技(深圳)有限公司 Voice activity detection method, relevant apparatus and equipment
CN107910017A (en) * 2017-12-19 2018-04-13 河海大学 A kind of method that threshold value is set in noisy speech end-point detection

Also Published As

Publication number Publication date
CN112955951A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
US20210327448A1 (en) Speech noise reduction method and apparatus, computing device, and computer-readable storage medium
WO2019101123A1 (en) Voice activity detection method, related device, and apparatus
US10504539B2 (en) Voice activity detection systems and methods
US20230298610A1 (en) Noise suppression method and apparatus for quickly calculating speech presence probability, and storage medium and terminal
WO2021139327A1 (en) Audio signal processing method, model training method, and related apparatus
WO2012158156A1 (en) Noise supression method and apparatus using multiple feature modeling for speech/noise likelihood
US9374651B2 (en) Sensitivity calibration method and audio device
WO2022105570A1 (en) Speech endpoint detection method, apparatus and device, and computer readable storage medium
US10839820B2 (en) Voice processing method, apparatus, device and storage medium
CN110648687B (en) Activity voice detection method and system
CN110875049B (en) Voice signal processing method and device
WO2022218254A1 (en) Voice signal enhancement method and apparatus, and electronic device
CN110503973B (en) Audio signal transient noise suppression method, system and storage medium
US11915718B2 (en) Position detection method, apparatus, electronic device and computer readable storage medium
WO2024041512A1 (en) Audio noise reduction method and apparatus, and electronic device and readable storage medium
WO2017128910A1 (en) Method, apparatus and electronic device for determining speech presence probability
US20230223014A1 (en) Adapting Automated Speech Recognition Parameters Based on Hotword Properties
WO2020097841A1 (en) Voice activity detection method and apparatus, storage medium and electronic device
US11922933B2 (en) Voice processing device and voice processing method
CN112216285A (en) Multi-person session detection method, system, mobile terminal and storage medium
CN113470621B (en) Voice detection method, device, medium and electronic equipment
TWI756817B (en) Voice activity detection device and method
US20230046518A1 (en) Howling suppression method and apparatus, computer device, and storage medium
CN116913306A (en) Voice enhancement method and device and electronic equipment
CN116364106A (en) Voice detection method, device, terminal equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18939865

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939865

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18939865

Country of ref document: EP

Kind code of ref document: A1