WO2020097841A1 - Procédé et appareil de détection d'activité vocale, support d'informations et dispositif électronique - Google Patents

Procédé et appareil de détection d'activité vocale, support d'informations et dispositif électronique Download PDF

Info

Publication number
WO2020097841A1
WO2020097841A1 PCT/CN2018/115601 CN2018115601W WO2020097841A1 WO 2020097841 A1 WO2020097841 A1 WO 2020097841A1 CN 2018115601 W CN2018115601 W CN 2018115601W WO 2020097841 A1 WO2020097841 A1 WO 2020097841A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
frame
signal
frequency domain
noise reduction
Prior art date
Application number
PCT/CN2018/115601
Other languages
English (en)
Chinese (zh)
Inventor
陈岩
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to CN201880097699.4A priority Critical patent/CN112955951A/zh
Priority to PCT/CN2018/115601 priority patent/WO2020097841A1/fr
Publication of WO2020097841A1 publication Critical patent/WO2020097841A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Definitions

  • the present application belongs to the technical field of terminals, and particularly relates to a voice endpoint detection method, device, storage medium, and electronic equipment.
  • voice processing technologies such as voiceprint wake-up and voice recognition have also developed more and more mature.
  • speech endpoint detection technology greatly affects the performance of speech processing technology.
  • the detection premise of the voice endpoint detection technology is to assume that the voice signal is a short-term stationary signal, which leads to a low accuracy of the voice endpoint detection technology in different non-stationary noise environments.
  • Embodiments of the present application provide a voice endpoint detection method, device, storage medium, and electronic equipment, which can improve the accuracy of voice endpoint detection.
  • an embodiment of the present application provides a voice endpoint detection method, including:
  • an embodiment of the present application provides a voice endpoint detection device, including:
  • Acquisition module for acquiring noisy speech signals
  • a noise reduction module configured to perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal
  • a calculation module used to calculate the spectral entropy ratio of the noise-reduced speech signal, and calculate the short-term energy of the noise-reduced speech signal;
  • the detection module is configured to perform speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  • an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the voice endpoint detection method provided in this embodiment.
  • an embodiment of the present application provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:
  • the noise-reducing voice signal since the noise-containing voice signal affects the accuracy of the detection of the voice endpoint, the noise-reducing voice signal is processed to make it a noise-reduced voice signal, and then the noise reduction that can improve the accuracy of the voice endpoint detection
  • the spectral entropy ratio and short-term energy of the voice signal are used to detect the voice endpoint of the noise-reduced voice signal, which effectively improves the accuracy of voice endpoint detection.
  • FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 2 is a second schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 3 is a third schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 4 is a fourth schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice endpoint detection device provided by an embodiment of the present application.
  • FIG. 6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 7 is a second schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.
  • the process of the voice endpoint detection method may include:
  • Voice endpoint detection refers to detecting the presence or absence of voice in a noisy environment. It is usually used in voice processing systems such as voice coding and voice enhancement to reduce the voice coding rate, save communication bandwidth, reduce mobile device energy consumption, and improve recognition Rate and other effects.
  • the noisy speech signal may refer to speech signals in different unstable noise environments.
  • the noise-reduced speech signal is subjected to noise reduction processing to obtain a noise-reduced speech signal.
  • the electronic device may perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal.
  • the frequency of the noise signal is lower than the frequency of the speech signal, and the lower the frequency of the noise signal, the smaller the impact on the accuracy of the detection of speech endpoints. Therefore, the electronic device can increase the frequency of the voice signal in the noisy voice signal and reduce the frequency of the noise signal in the noisy voice signal, so as to reduce the influence of the noise signal on the accuracy of voice endpoint detection.
  • noise reduction processing of the noisy speech signal is not limited to the above method, but may be other methods as long as the purpose of reducing the noise of the noisy speech signal can be achieved.
  • the spectral entropy ratio of the noise-reduced speech signal is calculated, and the short-term energy of the noise-reduced speech signal is calculated.
  • the electronic device can calculate the noise-reduced speech The spectral entropy ratio of the signal and the short-term energy of the noise-reduced speech signal are calculated, so that speech endpoint detection can be performed according to the characteristics of the two noise-reduced speech signals.
  • speech endpoint detection is performed according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  • the electronic device can use the spectral entropy ratio of the noise-reduced speech signal and the short-term energy value of the noise-reduced speech signal as threshold parameters to perform speech endpoint detection, that is, to detect the presence or absence of speech, thereby detecting a valid speech segment.
  • the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice; when the short-term energy value is D, it can detect that there is no voice.
  • the noise-reducing voice signal since the noise-containing voice signal affects the accuracy of voice endpoint detection, the noise-reducing voice signal is subjected to noise reduction processing to make it a noise-reduced voice signal, and then used to improve the accuracy of voice endpoint detection
  • the spectral entropy ratio and short-term energy of the noise-reduced speech signal are used to detect the speech endpoint of the noise-reduced speech signal, which effectively improves the accuracy of speech endpoint detection.
  • FIG. 2 is a second schematic flowchart of a voice endpoint detection method according to an embodiment of the present application.
  • the process of the voice endpoint detection method may include:
  • the electronic device acquires a noisy speech signal.
  • a noisy speech signal may refer to a speech signal in different unstable noise environments.
  • the noisy speech signal is a time-domain signal.
  • the electronic device performs frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time domain signal.
  • the electronic device performs frame windowing on the noisy speech signal y (n) to obtain a multi-frame windowed time domain signal.
  • the electronic device can usually take a frame length of 20ms and take a frame shift of 10ms to frame the noisy speech signal.
  • the windowed time domain signal is a time domain signal, and the windowed time domain signal of each frame may include a noise signal part and a speech signal part.
  • the noise signal and speech signal here are both time-domain signals.
  • the electronic device performs Fourier transform on each frame of the windowed time domain signal of the multi-frame windowed time domain signal to obtain a multi-frame frequency domain signal.
  • each frame frequency domain signal may include a noise signal part and a voice signal part.
  • the noise signal and speech signal here are both frequency domain signals.
  • f is the frequency component
  • i is the number of frames.
  • the electronic device estimates the Fourier coefficients of the frequency domain signal for each frame.
  • the Fourier coefficient of the frequency domain signal of the i-th frame can be estimated using the following formula:
  • ⁇ (f, i) is the estimated prior signal-to-noise ratio
  • ⁇ (f, i) is the estimated posterior signal-to-noise ratio
  • p (f, i) represents The probability of the existence of speech
  • q (f, i) represents the probability of the absence of speech.
  • the electronic device performs noise reduction processing on the frequency domain signal of each frame according to the Fourier coefficients of the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
  • the electronic device in order to reduce the influence of the noise signal on the detection of the voice endpoint, can reduce the Fourier coefficient of the noise signal portion in the frequency domain signal of each frame by using an appropriate G 0 .
  • the electronic device can also increase the Fourier coefficient of the speech signal part in the frequency domain signal of each frame.
  • the electronic device calculates the energy spectrum of the noise-reduced frequency domain signal for each frame.
  • the process 206 may be implemented by the processes 2061, 2062, 2063, and 2064, which may be:
  • the electronic device acquires the frequency band information of the noise-reduced frequency domain signal for each frame.
  • the electronic device can acquire the frequency range of the noise-reduced frequency domain signal of the i-th frame.
  • the electronic device divides the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame.
  • the frequency band of the noise reduction frequency domain signal of the ith frame is 500 Hz to 1400 Hz.
  • the electronic device may divide the noise reduction frequency domain signal of the i-th frame into multiple sub-noise reduction frequency domain signals according to the frequency band range. For example, assuming that the i-th frame noise reduction frequency domain signal is divided into 3 sub-noise reduction frequency domain signals, then the electronic device may divide the first sub-noise reduction frequency domain signal into a frequency band range of 500 Hz to 800 Hz, and the second sub-noise reduction noise The frequency range included in the frequency domain signal is 800 Hz to 1100 Hz, and the frequency range included in the third sub-noise reduction frequency domain signal is 1100 Hz to 1400 Hz.
  • noise reduction frequency domain signal of each frame and how many sub-noise reduction frequency domain signals are divided into noise reduction frequency domain signals of each frame can be determined according to actual needs, and no specific limitation is made here.
  • the electronic device calculates the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals.
  • the formula for calculating the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:
  • E (w, i) represents the energy spectrum of the wth sub-noise reduction frequency domain signal in the i frame
  • N b represents the total number of sub-noise reduction frequency domain signals
  • N can be set according to actual needs, usually set to 2 nth power. For example, N can be set to 256, 512, 1024, etc.
  • the electronic device calculates the energy spectrum of the noise reduction frequency domain signal of each frame according to the energy spectrum of each sub-noise reduction frequency domain signal.
  • the energy spectrum of the noise reduction frequency domain signal in the i-th frame is the sum of the energy spectrums of all sub-noise reduction frequency domain signals divided in the frame.
  • the formula for calculating the energy spectrum of the noise-reduced frequency domain signal in the i-th frame can be:
  • N b is the total number of sub-noise reduction frequency domain signals
  • E (i) is the energy spectrum of the i-th frame noise reduction frequency domain signal
  • E (w, i) is The energy spectrum of the w-th sub-noise frequency-domain signal in the i-th frame.
  • the electronic device calculates the spectral entropy of the noise-reduced frequency domain signal for each frame.
  • the process 207 may be implemented through the process 2071 and the process 2072, which may be:
  • the electronic device calculates the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of each frame of the denoise frequency domain signal.
  • the calculation formula of the normalized density of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:
  • w represents the wth sub-noise reduction frequency domain signal
  • N b represents the total number of sub-noise reduction frequency domain signals
  • p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame
  • E (w, i) represents the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame
  • E (i) represents the energy spectrum of the i-th frame noise reduction frequency-domain signal.
  • the electronic device calculates the spectral entropy of the noise reduction frequency domain signal for each frame according to the normalized probability density of each sub-noise reduction frequency domain signal.
  • the formula for calculating the spectral entropy of the noise-reduced frequency-domain signal in frame i can be:
  • w represents the wth sub-noise reduction frequency domain signal
  • N b represents the total number of sub-noise reduction frequency domain signals
  • p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame
  • H (i) represents the spectral entropy of the noise-reduced frequency domain signal of the i-th frame.
  • the electronic device calculates the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain Spectral entropy ratio.
  • the formula for calculating the spectral entropy ratio of the noise-reduced frequency-domain signal in frame i can be:
  • EER TEO (i) represents the spectral entropy ratio of the noise reduction frequency domain signal of the i frame
  • E (i) represents the energy spectrum of the noise reduction frequency domain signal of the i frame
  • H (i) represents the noise reduction frequency domain of the i frame The spectral entropy of the signal.
  • the electronic device can calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the calculation formula of the spectral entropy ratio of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the spectrum of all frames of the noise reduction speech signal Entropy ratio.
  • the electronic device performs an inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain a multi-frame noise-reduced time domain signal.
  • performing inverse Fourier transform on the frequency domain signal can convert the frequency domain signal into a time domain signal. Therefore, the electronic device performs inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain multi-frame noise-reduced time domain signal
  • the electronic device calculates the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.
  • the calculation formula of the short-term energy of the noise reduction time-domain signal of the i frame is:
  • Represents the short-term energy of the noise reduction time-domain signal of frame i Represents the noise reduction time-domain signal of frame i
  • the electronic device can calculate the short-term energy of the noise reduction frequency domain signal of each frame according to the calculation formula of the short-term energy of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the short frame of all frames of the noise reduction speech signal ⁇ ⁇ Time energy.
  • the electronic device determines the position of the voice starting point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or the electronic device
  • the spectral entropy ratio of the frame and the short-term energy of all frames of the noise-reduced speech signal determine the position of the speech termination point.
  • the process 211 may be: if the electronic device detects that no speech exists according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, detecting the presence of speech, the electronic device determines that the first frame of the second number of frames is located The position is the position of the starting point of the voice.
  • the process 212 may be: if the electronic device detects the presence of speech according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, it is detected that no speech exists, and the electronic device determines The position is the position of the voice termination point.
  • the electronic device can determine The position of the first frame in subsequent frames is the position of the starting point of speech.
  • the electronic device may determine The position of the first frame in subsequent frames is the position of the voice termination point.
  • the noise-reduced speech signal is divided into 20 frames, namely the first frame, the second frame, the third frame ... the 19th frame, the 20th frame.
  • the electronic device detects that no speech exists according to the spectral entropy ratio of the first frame to the fifth frame and the short-term energy of the first frame to the fifth frame, and according to the spectral entropy ratio of the sixth frame to the tenth frame and the sixth
  • the electronic device determines that the position of frame 6 is the position of the starting point of the voice.
  • the electronic device detects the presence of speech based on the spectral entropy ratio of frames 11 to 15 and the short-term energy of frames 11 to 15 and based on the spectral entropy ratio of frames 16 to 20 and frames 16 to 20.
  • the short-term energy of the 20th frame detects that there is no voice, and the electronic device determines that the position of the 16th frame is the position of the voice termination point.
  • the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice, and when the short-term energy value is D, it can detect that there is no voice.
  • the above is only one example of determining the position of the voice start point and the position of the voice end point proposed in this embodiment. It can be understood that, within the protection scope of the embodiments of the present application, the position of the voice start point and the position of the voice end point may also be determined in other ways, and no specific limitation is made here.
  • FIG. 5 is a schematic structural diagram of a voice endpoint detection device 300 according to an embodiment of the present application.
  • the voice endpoint detection device 300 may include: an acquisition module 301, a noise reduction module 302, a calculation module 303, and a detection module 304.
  • the obtaining module 301 is used to obtain a noisy speech signal.
  • the noise reduction module 302 is configured to perform noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal.
  • the calculation module 303 is configured to calculate the spectral entropy ratio of the noise-reduced speech signal and calculate the short-term energy of the noise-reduced speech signal.
  • the detection module 304 is configured to perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
  • the noise reduction module 302 may be used to: perform frame-by-frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time-domain signal; Perform a Fourier transform on the windowed time-domain signal of each frame to obtain a multi-frame frequency domain signal; estimate the Fourier coefficients of the frequency domain signal of each frame; according to the Fourier coefficients of the frequency domain signal of each frame, for each frame frequency
  • the domain signal is subjected to noise reduction processing to obtain a multi-frame noise reduction frequency domain signal.
  • the calculation module 303 may be used to: calculate the energy spectrum of the noise reduction frequency domain signal per frame; calculate the spectral entropy of the noise reduction frequency domain signal per frame; according to the noise reduction frequency domain signal per frame The energy spectrum and the spectral entropy of the noise-reduced frequency domain signal per frame calculate the spectral entropy ratio of the noise-reduced frequency domain signal per frame to obtain the spectral entropy ratio of all frames of the noise-reduced speech signal.
  • the calculation module 303 may also be used to: obtain frequency band information of the noise reduction frequency domain signal of each frame; divide the noise reduction frequency domain signal of each frame according to the frequency band information to obtain a reduction of each frame A plurality of sub-noise reduction frequency domain signals corresponding to the noise frequency domain signal; calculating an energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; calculating according to the energy spectrum of each sub-noise reduction frequency domain signal The energy spectrum of the noise-reduced frequency domain signal per frame.
  • the calculation module 303 may be further used to calculate each sub-noise reduction frequency domain according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal The normalized probability density of the signal; according to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.
  • the calculation module 303 can also be used to: perform inverse Fourier transform on the noise reduction frequency domain signal of each frame to obtain a multi-frame noise reduction time domain signal; Time energy, the short-term energy of all frames of the noise-reduced speech signal is obtained.
  • the detection module 304 may be used to determine the position of the voice start point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal And / or determine the position of the voice termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.
  • the detection module 304 may be further configured to: based on the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, No speech is detected, and according to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the second The position of the first frame in the number of frames is the position of the starting point of speech.
  • the detection module 304 may be further configured to: based on the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, The presence of speech is detected, and according to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, the absence of speech is detected, and the fourth is determined
  • the position of the first frame in the number of frames is the position of the voice termination point.
  • An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed on a computer, the computer is caused to execute the process in the voice endpoint detection method provided in this embodiment .
  • An embodiment of the present application also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program stored in the memory by executing the computer program Process in the voice endpoint detection method.
  • the aforementioned electronic device may be a mobile terminal such as a tablet computer or a smart phone.
  • FIG. 6, is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the mobile terminal 400 may include components such as a microphone 401, a memory 402, and a processor 403. Those skilled in the art may understand that the structure of the mobile terminal shown in FIG. 6 does not constitute a limitation on the mobile terminal, and may include more or fewer components than those illustrated, or combine certain components, or arrange different components.
  • the microphone 401 can be used to pick up the voice uttered by the user and the like.
  • the memory 402 may be used to store application programs and data.
  • the application program stored in the memory 402 contains executable code.
  • the application program can form various functional modules.
  • the processor 403 executes application programs stored in the memory 402 to execute various functional applications and data processing.
  • the processor 403 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 402 and calling data stored in the memory 402 Various functions and processing data to monitor the mobile terminal as a whole.
  • the processor 403 in the mobile terminal loads the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 runs and stores the memory in the memory The application in 402, thereby implementing the process:
  • FIG. 7 is a second schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the mobile terminal 500 may include components such as a microphone 501, a memory 502, a processor 503, an input unit 504, an output unit 505, a speaker 506, and the like.
  • the microphone 501 can be used to pick up the voice uttered by the user and the like.
  • the memory 502 may be used to store application programs and data.
  • the application program stored in the memory 502 contains executable code.
  • the application program can form various functional modules.
  • the processor 503 executes application programs stored in the memory 502 to execute various functional applications and data processing.
  • the processor 503 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 502 and calling data stored in the memory 502 Various functions and processing data to monitor the mobile terminal as a whole.
  • the input unit 504 may be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.
  • user characteristic information such as fingerprints
  • the output unit 505 can be used to display information input by the user or provided to the user and various graphical user interfaces of the mobile terminal. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof.
  • the output unit may include a display panel.
  • the processor 503 in the mobile terminal will load the executable code corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 503 will run the stored code in the memory The application in 502, thereby implementing the process:
  • the processor 503 when the processor 503 executes the process of performing noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal, it may perform: performing frame-and-window processing on the noise-containing speech signal, Obtain multi-frame windowed time-domain signals; Fourier transform the windowed time-domain signals of each frame of the multi-frame windowed time-domain signals to obtain multi-frame frequency-domain signals; estimate the Fourier of each frame of frequency-domain signals Coefficient; according to the Fourier coefficient of the frequency domain signal of each frame, perform noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
  • the processor 503 when it executes the process of calculating the spectral entropy ratio of the noise-reduced speech signal, it may perform: calculating the energy spectrum of the noise-reducing frequency domain signal per frame; Spectral entropy of each frame; calculate the spectral entropy ratio of the noise reduction frequency domain signal per frame according to the energy spectrum of the noise reduction frequency domain signal per frame and the spectral entropy of the noise reduction frequency domain signal per frame to obtain the noise reduction speech signal The ratio of the spectral entropy of all frames.
  • the processor 503 when the processor 503 executes the process of calculating the energy spectrum of the noise reduction frequency domain signal per frame, it may perform: acquiring frequency band information of the noise reduction frequency domain signal per frame; according to the frequency band information Divide the noise reduction frequency domain signal of each frame to obtain a plurality of sub-noise reduction frequency domain signals corresponding to each frame of the noise reduction frequency domain signal; calculate the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.
  • the processor 503 when the processor 503 executes the process of calculating the spectral entropy of the noise reduction frequency domain signal per frame, it may perform: according to the energy spectrum of each sub-noise reduction frequency domain signal and the noise reduction per frame The energy spectrum of the frequency domain signal, calculate the normalized probability density of each sub-noise reduction frequency domain signal; calculate the spectral entropy of the noise reduction frequency domain signal per frame according to the normalized probability density of each sub-noise reduction frequency domain signal .
  • the processor 503 when the processor 503 executes the process of calculating the short-term energy of the noise-reduced speech signal, it may perform: performing an inverse Fourier transform on each frame of the noise-reduced frequency domain signal to obtain multi-frame noise reduction Time-domain signal; calculate the short-term energy of the noise-reduced time-domain signal of each frame to obtain the short-term energy of all frames of the noise-reduced speech signal.
  • the processor 503 when the processor 503 executes the process of performing voice endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal, it may execute: according to the noise reduction Determine the position of the starting point of the speech by the spectral entropy ratio of all frames of the speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or according to the spectral entropy ratio of all frames of the noise-reduced speech signal and all Describe the short-term energy of all frames of the noise-reduced speech signal to determine the position of the speech termination point.
  • the processor 503 executes the process of determining the position of the voice start point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the noise reduction The spectral entropy ratio of the second number of frames of the speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the position of the first frame in the second number of frames is determined to be speech The position of the starting point.
  • the processor 503 executes the process of determining the position of the speech termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, the presence of speech is detected, and according to the noise reduction The spectral entropy ratio of the fourth number of frames of the speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is speech The location of the end point.
  • the voice endpoint detection device provided by the embodiment of the present application and the voice endpoint detection method in the above embodiments belong to the same concept, and any of the voice endpoint detection method embodiments provided on the voice endpoint detection device can be run on the voice endpoint detection device
  • any of the voice endpoint detection method embodiments provided on the voice endpoint detection device can be run on the voice endpoint detection device
  • the voice endpoint detection method described in the embodiments of the present application a person of ordinary skill in the art can understand that all or part of the process of implementing the voice endpoint detection method described in the embodiments of the present application can be controlled by a computer program.
  • the computer program may be stored in a computer-readable storage medium, such as stored in a memory, and executed by at least one processor, during the execution process may include the implementation of the voice endpoint detection method Example process.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and so on.
  • each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .

Abstract

L'invention concerne un procédé et un appareil d'activité vocale, un support d'informations et un dispositif électronique. Le procédé consiste : à obtenir un signal de parole bruité (101) ; à effectuer une réduction de bruit sur le signal de parole bruité afin d'obtenir un signal de parole à bruit réduit (102) ; à calculer un rapport d'entropie spectrale du signal de parole à bruit réduit et à calculer une énergie de courte durée du signal de parole à bruit réduit (103) ; et à effectuer une détection d'activité vocale en fonction du rapport d'entropie spectrale du signal de parole à bruit réduit et de l'énergie de courte durée du signal de parole à bruit réduit (104).
PCT/CN2018/115601 2018-11-15 2018-11-15 Procédé et appareil de détection d'activité vocale, support d'informations et dispositif électronique WO2020097841A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880097699.4A CN112955951A (zh) 2018-11-15 2018-11-15 语音端点检测方法、装置、存储介质及电子设备
PCT/CN2018/115601 WO2020097841A1 (fr) 2018-11-15 2018-11-15 Procédé et appareil de détection d'activité vocale, support d'informations et dispositif électronique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/115601 WO2020097841A1 (fr) 2018-11-15 2018-11-15 Procédé et appareil de détection d'activité vocale, support d'informations et dispositif électronique

Publications (1)

Publication Number Publication Date
WO2020097841A1 true WO2020097841A1 (fr) 2020-05-22

Family

ID=70731178

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/115601 WO2020097841A1 (fr) 2018-11-15 2018-11-15 Procédé et appareil de détection d'activité vocale, support d'informations et dispositif électronique

Country Status (2)

Country Link
CN (1) CN112955951A (fr)
WO (1) WO2020097841A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216261A1 (en) * 2004-03-26 2005-09-29 Canon Kabushiki Kaisha Signal processing apparatus and method
CN101599269A (zh) * 2009-07-02 2009-12-09 中国农业大学 语音端点检测方法及装置
CN107731223A (zh) * 2017-11-22 2018-02-23 腾讯科技(深圳)有限公司 语音活性检测方法、相关装置和设备
CN107910017A (zh) * 2017-12-19 2018-04-13 河海大学 一种带噪语音端点检测中阈值设定的方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5732976B2 (ja) * 2011-03-31 2015-06-10 沖電気工業株式会社 音声区間判定装置、音声区間判定方法、及びプログラム
CN104810024A (zh) * 2014-01-28 2015-07-29 上海力声特医学科技有限公司 一种双路麦克风语音降噪处理方法及系统
CN105023572A (zh) * 2014-04-16 2015-11-04 王景芳 一种含噪语音端点鲁棒检测方法
CN105825871B (zh) * 2016-03-16 2019-07-30 大连理工大学 一种无前导静音段语音的端点检测方法
CN106653062A (zh) * 2017-02-17 2017-05-10 重庆邮电大学 一种低信噪比环境下基于谱熵改进的语音端点检测方法
CN108428456A (zh) * 2018-03-29 2018-08-21 浙江凯池电子科技有限公司 语音降噪算法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216261A1 (en) * 2004-03-26 2005-09-29 Canon Kabushiki Kaisha Signal processing apparatus and method
CN101599269A (zh) * 2009-07-02 2009-12-09 中国农业大学 语音端点检测方法及装置
CN107731223A (zh) * 2017-11-22 2018-02-23 腾讯科技(深圳)有限公司 语音活性检测方法、相关装置和设备
CN107910017A (zh) * 2017-12-19 2018-04-13 河海大学 一种带噪语音端点检测中阈值设定的方法

Also Published As

Publication number Publication date
CN112955951A (zh) 2021-06-11

Similar Documents

Publication Publication Date Title
EP3828885B1 (fr) Procédé et appareil de débruitage vocal, dispositif informatique et support de stockage lisible par ordinateur
WO2019101123A1 (fr) Procédé de détection d'activité vocale, dispositif associé et appareil
US10504539B2 (en) Voice activity detection systems and methods
US9640194B1 (en) Noise suppression for speech processing based on machine-learning mask estimation
US20230298610A1 (en) Noise suppression method and apparatus for quickly calculating speech presence probability, and storage medium and terminal
WO2021139327A1 (fr) Procédé de positionnement de signal audio, procédé d'apprentissage de modèle et appareil associé
WO2012158156A1 (fr) Procédé de suppression de bruit et appareil utilisant une modélisation de caractéristiques multiples pour une vraisemblance voix/bruit
US9374651B2 (en) Sensitivity calibration method and audio device
WO2022105570A1 (fr) Procédé, appareil et dispositif de détection de point final de parole, et support de stockage lisible par ordinateur
US10839820B2 (en) Voice processing method, apparatus, device and storage medium
CN110648687B (zh) 一种活动语音检测方法及系统
CN110875049B (zh) 语音信号的处理方法及装置
WO2022218254A1 (fr) Procédé et appareil d'amélioration de signal vocal, et dispositif électronique
CN110503973B (zh) 音频信号瞬态噪音抑制方法、系统以及存储介质
US11915718B2 (en) Position detection method, apparatus, electronic device and computer readable storage medium
WO2024041512A1 (fr) Procédé et appareil de réduction de bruit audio, dispositif électronique et support d'enregistrement lisible
WO2017128910A1 (fr) Procédé, appareil et dispositif électronique pour déterminer une probabilité de présence de parole
US20230223014A1 (en) Adapting Automated Speech Recognition Parameters Based on Hotword Properties
WO2020097841A1 (fr) Procédé et appareil de détection d'activité vocale, support d'informations et dispositif électronique
US20230186933A1 (en) Voice noise reduction method, electronic device, non-transitory computer-readable storage medium
US11922933B2 (en) Voice processing device and voice processing method
CN112216285A (zh) 多人会话检测方法、系统、移动终端及存储介质
CN113470621B (zh) 语音检测方法、装置、介质及电子设备
CN113299308A (zh) 一种语音增强方法、装置、电子设备及存储介质
TWI756817B (zh) 語音活動偵測裝置與方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18939865

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18939865

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18939865

Country of ref document: EP

Kind code of ref document: A1