WO2020097841A1

WO2020097841A1 - Voice activity detection method and apparatus, storage medium and electronic device

Info

Publication number: WO2020097841A1
Application number: PCT/CN2018/115601
Authority: WO
Inventors: 陈岩
Original assignee: 深圳市欢太科技有限公司; Oppo广东移动通信有限公司
Priority date: 2018-11-15
Filing date: 2018-11-15
Publication date: 2020-05-22
Also published as: CN112955951A

Abstract

A voice activity detection method and apparatus, a storage medium and an electronic device. The method comprises: obtaining a noisy speech signal (101); performing noise reduction on the noisy speech signal to obtain a noise-reduced speech signal (102); calculating a spectral entropy ratio of the noise-reduced speech signal, and calculating short-time energy of the noise-reduced speech signal (103); and performing voice activity detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-time energy of the noise-reduced speech signal (104).

Description

Voice endpoint detection method, device, storage medium and electronic equipment

Technical field

The present application belongs to the technical field of terminals, and particularly relates to a voice endpoint detection method, device, storage medium, and electronic equipment.

Background technique

With the rapid development of terminal technology, voice processing technologies such as voiceprint wake-up and voice recognition have also developed more and more mature. As an important part of speech preprocessing technology, speech endpoint detection technology greatly affects the performance of speech processing technology. In the related art, the detection premise of the voice endpoint detection technology is to assume that the voice signal is a short-term stationary signal, which leads to a low accuracy of the voice endpoint detection technology in different non-stationary noise environments.

Summary of the invention

Embodiments of the present application provide a voice endpoint detection method, device, storage medium, and electronic equipment, which can improve the accuracy of voice endpoint detection.

In a first aspect, an embodiment of the present application provides a voice endpoint detection method, including:

Obtain noisy speech signals;

Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;

Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;

Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.

In a second aspect, an embodiment of the present application provides a voice endpoint detection device, including:

Acquisition module for acquiring noisy speech signals;

A noise reduction module, configured to perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal;

A calculation module, used to calculate the spectral entropy ratio of the noise-reduced speech signal, and calculate the short-term energy of the noise-reduced speech signal;

The detection module is configured to perform speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.

In a third aspect, an embodiment of the present application provides a storage medium on which a computer program is stored, wherein, when the computer program is executed on a computer, the computer is caused to execute the voice endpoint detection method provided in this embodiment.

According to a fourth aspect, an embodiment of the present application provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:

Obtain noisy speech signals;

In the embodiment of the present application, since the noise-containing voice signal affects the accuracy of the detection of the voice endpoint, the noise-reducing voice signal is processed to make it a noise-reduced voice signal, and then the noise reduction that can improve the accuracy of the voice endpoint detection The spectral entropy ratio and short-term energy of the voice signal are used to detect the voice endpoint of the noise-reduced voice signal, which effectively improves the accuracy of voice endpoint detection.

BRIEF DESCRIPTION

The technical solutions and beneficial effects of the present application will be apparent through the detailed description of the specific implementation of the present application in conjunction with the accompanying drawings.

FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.

2 is a second schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.

FIG. 3 is a third schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.

FIG. 4 is a fourth schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application.

5 is a schematic structural diagram of a voice endpoint detection device provided by an embodiment of the present application.

6 is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.

7 is a second schematic structural diagram of an electronic device provided by an embodiment of the present application.

detailed description

Please refer to the illustration, where the same component symbol represents the same component. The principle of the present application is illustrated by implementation in an appropriate computing environment. The following description is based on the illustrated specific embodiments of the present application, which should not be considered as limiting other specific embodiments not detailed herein.

Please refer to FIG. 1. FIG. 1 is a first schematic flowchart of a voice endpoint detection method provided by an embodiment of the present application. The process of the voice endpoint detection method may include:

Voice endpoint detection refers to detecting the presence or absence of voice in a noisy environment. It is usually used in voice processing systems such as voice coding and voice enhancement to reduce the voice coding rate, save communication bandwidth, reduce mobile device energy consumption, and improve recognition Rate and other effects.

In 101, a noisy speech signal is obtained.

Among them, the noisy speech signal may refer to speech signals in different unstable noise environments. The noisy speech signal can be represented by y (n) = s (n) + u (n), where s (n) is the speech signal and u (n) is the noise signal.

In 102, the noise-reduced speech signal is subjected to noise reduction processing to obtain a noise-reduced speech signal.

In this embodiment, since the noise signal in the noisy speech signal will affect the accuracy of the detection of the voice endpoint, the electronic device may perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal.

In general, the frequency of the noise signal is lower than the frequency of the speech signal, and the lower the frequency of the noise signal, the smaller the impact on the accuracy of the detection of speech endpoints. Therefore, the electronic device can increase the frequency of the voice signal in the noisy voice signal and reduce the frequency of the noise signal in the noisy voice signal, so as to reduce the influence of the noise signal on the accuracy of voice endpoint detection.

It should be noted that the noise reduction processing of the noisy speech signal is not limited to the above method, but may be other methods as long as the purpose of reducing the noise of the noisy speech signal can be achieved.

In 103, the spectral entropy ratio of the noise-reduced speech signal is calculated, and the short-term energy of the noise-reduced speech signal is calculated.

In this embodiment, since the characteristics of two noise-reduced speech signals, such as the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal, can improve the accuracy of detection of the voice endpoint, the electronic device can calculate the noise-reduced speech The spectral entropy ratio of the signal and the short-term energy of the noise-reduced speech signal are calculated, so that speech endpoint detection can be performed according to the characteristics of the two noise-reduced speech signals.

In 104, speech endpoint detection is performed according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.

For example, the electronic device can use the spectral entropy ratio of the noise-reduced speech signal and the short-term energy value of the noise-reduced speech signal as threshold parameters to perform speech endpoint detection, that is, to detect the presence or absence of speech, thereby detecting a valid speech segment.

In this embodiment, the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice; when the short-term energy value is D, it can detect that there is no voice.

It can be understood that, in this embodiment, since the noise-containing voice signal affects the accuracy of voice endpoint detection, the noise-reducing voice signal is subjected to noise reduction processing to make it a noise-reduced voice signal, and then used to improve the accuracy of voice endpoint detection The spectral entropy ratio and short-term energy of the noise-reduced speech signal are used to detect the speech endpoint of the noise-reduced speech signal, which effectively improves the accuracy of speech endpoint detection.

Please refer to FIG. 2, which is a second schematic flowchart of a voice endpoint detection method according to an embodiment of the present application. The process of the voice endpoint detection method may include:

In 201, the electronic device acquires a noisy speech signal.

For example, a noisy speech signal may refer to a speech signal in different unstable noise environments. The noisy speech signal can be represented by y (n) = s (n) + u (n), where s (n) is the speech signal and u (n) is the noise signal. Among them, the noisy speech signal is a time-domain signal.

In 202, the electronic device performs frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time domain signal.

For example, the electronic device performs frame windowing on the noisy speech signal y (n) to obtain a multi-frame windowed time domain signal. Among them, the electronic device can usually take a frame length of 20ms and take a frame shift of 10ms to frame the noisy speech signal. When the electronic device adds a window to the noisy speech signal, it has priority and is not limited, and the window function can select a rectangular window, that is, w (n) = 1.

The windowed time domain signal is a time domain signal, and the windowed time domain signal of each frame may include a noise signal part and a speech signal part. The noise signal and speech signal here are both time-domain signals.

In 203, the electronic device performs Fourier transform on each frame of the windowed time domain signal of the multi-frame windowed time domain signal to obtain a multi-frame frequency domain signal.

It can be understood that performing a Fourier transform on the time domain signal can convert the time domain signal into a frequency domain signal. Therefore, the electronic device performs Fourier transform on the windowed time domain signal of each frame to obtain a multi-frame frequency domain signal. Wherein, each frame frequency domain signal may include a noise signal part and a voice signal part. The noise signal and speech signal here are both frequency domain signals.

In this embodiment, the frequency domain signal of the i-th frame can be expressed as: Y (f, i) = S (f, i) + U (f, i). Among them, f is the frequency component, i is the number of frames.

In 204, the electronic device estimates the Fourier coefficients of the frequency domain signal for each frame.

For example, the Fourier coefficient of the frequency domain signal of the i-th frame can be estimated using the following formula:

among them,

Is the Fourier coefficient of the frequency domain signal of frame i, ζ (f, i) is the estimated prior signal-to-noise ratio, γ (f, i) is the estimated posterior signal-to-noise ratio, and p (f, i) represents The probability of the existence of speech, q (f, i) represents the probability of the absence of speech.

When there is speech, the gain calculation

p (f, i) = 1 and q (f, i) = 0.

When there is no voice,

p (f, i) = 0 and q (f, i) = 1, where G ₀ is a constant.

In 205, the electronic device performs noise reduction processing on the frequency domain signal of each frame according to the Fourier coefficients of the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.

In this embodiment, in order to reduce the influence of the noise signal on the detection of the voice endpoint, the electronic device can reduce the Fourier coefficient of the noise signal portion in the frequency domain signal of each frame by using an appropriate G ₀ . Correspondingly, the electronic device can also increase the Fourier coefficient of the speech signal part in the frequency domain signal of each frame.

In 206, the electronic device calculates the energy spectrum of the noise-reduced frequency domain signal for each frame.

As shown in FIG. 3, in some embodiments, the process 206 may be implemented by the

processes

2061, 2062, 2063, and 2064, which may be:

In 2061, the electronic device acquires the frequency band information of the noise-reduced frequency domain signal for each frame.

For example, the electronic device can acquire the frequency range of the noise-reduced frequency domain signal of the i-th frame.

In 2062, the electronic device divides the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame.

For example, the frequency band of the noise reduction frequency domain signal of the ith frame is 500 Hz to 1400 Hz. The electronic device may divide the noise reduction frequency domain signal of the i-th frame into multiple sub-noise reduction frequency domain signals according to the frequency band range. For example, assuming that the i-th frame noise reduction frequency domain signal is divided into 3 sub-noise reduction frequency domain signals, then the electronic device may divide the first sub-noise reduction frequency domain signal into a frequency band range of 500 Hz to 800 Hz, and the second sub-noise reduction noise The frequency range included in the frequency domain signal is 800 Hz to 1100 Hz, and the frequency range included in the third sub-noise reduction frequency domain signal is 1100 Hz to 1400 Hz.

It should be noted that how to divide the noise reduction frequency domain signal of each frame and how many sub-noise reduction frequency domain signals are divided into noise reduction frequency domain signals of each frame can be determined according to actual needs, and no specific limitation is made here.

In 2063, the electronic device calculates the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals.

For example, the formula for calculating the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:

Among them, E (w, i) represents the energy spectrum of the wth sub-noise reduction frequency domain signal in the i frame, N _b represents the total number of sub-noise reduction frequency domain signals, N can be set according to actual needs, usually set to 2 nth power. For example, N can be set to 256, 512, 1024, etc.

In 2064, the electronic device calculates the energy spectrum of the noise reduction frequency domain signal of each frame according to the energy spectrum of each sub-noise reduction frequency domain signal.

The energy spectrum of the noise reduction frequency domain signal in the i-th frame is the sum of the energy spectrums of all sub-noise reduction frequency domain signals divided in the frame.

That is, the formula for calculating the energy spectrum of the noise-reduced frequency domain signal in the i-th frame can be:

Where w is the wth sub-noise reduction frequency domain signal, N _b is the total number of sub-noise reduction frequency domain signals, E (i) is the energy spectrum of the i-th frame noise reduction frequency domain signal, and E (w, i) is The energy spectrum of the w-th sub-noise frequency-domain signal in the i-th frame.

In 207, the electronic device calculates the spectral entropy of the noise-reduced frequency domain signal for each frame.

As shown in FIG. 4, in some embodiments, the process 207 may be implemented through the process 2071 and the process 2072, which may be:

In 2071, the electronic device calculates the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of each frame of the denoise frequency domain signal.

For example, the calculation formula of the normalized density of the w-th sub-noise reduction frequency domain signal in the i-th frame can be:

Where w represents the wth sub-noise reduction frequency domain signal, N _b represents the total number of sub-noise reduction frequency domain signals, and p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame, E (w, i) represents the energy spectrum of the w-th sub-noise reduction frequency domain signal in the i-th frame, and E (i) represents the energy spectrum of the i-th frame noise reduction frequency-domain signal.

In 2072, the electronic device calculates the spectral entropy of the noise reduction frequency domain signal for each frame according to the normalized probability density of each sub-noise reduction frequency domain signal.

For example, the formula for calculating the spectral entropy of the noise-reduced frequency-domain signal in frame i can be:

Where w represents the wth sub-noise reduction frequency domain signal, N _b represents the total number of sub-noise reduction frequency domain signals, and p (w, i) represents the normalized density of the wth sub-noise reduction frequency domain signal in the i frame, H (i) represents the spectral entropy of the noise-reduced frequency domain signal of the i-th frame.

In 208, the electronic device calculates the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain Spectral entropy ratio.

For example, the formula for calculating the spectral entropy ratio of the noise-reduced frequency-domain signal in frame i can be:

Among them, EER _TEO (i) represents the spectral entropy ratio of the noise reduction frequency domain signal of the i frame, E (i) represents the energy spectrum of the noise reduction frequency domain signal of the i frame, and H (i) represents the noise reduction frequency domain of the i frame The spectral entropy of the signal.

It can be understood that the electronic device can calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the calculation formula of the spectral entropy ratio of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the spectrum of all frames of the noise reduction speech signal Entropy ratio.

In 209, the electronic device performs an inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain a multi-frame noise-reduced time domain signal.

It can be understood that performing inverse Fourier transform on the frequency domain signal can convert the frequency domain signal into a time domain signal. Therefore, the electronic device performs inverse Fourier transform on the noise-reduced frequency domain signal of each frame to obtain multi-frame noise-reduced time domain signal

In 210, the electronic device calculates the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.

For example, the calculation formula of the short-term energy of the noise reduction time-domain signal of the i frame is:

among them,

Represents the short-term energy of the noise reduction time-domain signal of frame i,

Represents the noise reduction time-domain signal of frame i,

Represents the noise-reduced time-domain signal at frame (i + 1),

Represents the noise-reduced time-domain signal at frame (i-1).

It can be understood that the electronic device can calculate the short-term energy of the noise reduction frequency domain signal of each frame according to the calculation formula of the short-term energy of the noise reduction frequency domain signal of the i-th frame above, so as to obtain the short frame of all frames of the noise reduction speech signal时能量。 Time energy.

In 211, the electronic device determines the position of the voice starting point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or the electronic device The spectral entropy ratio of the frame and the short-term energy of all frames of the noise-reduced speech signal determine the position of the speech termination point.

In some embodiments, the process 211 may be: if the electronic device detects that no speech exists according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, detecting the presence of speech, the electronic device determines that the first frame of the second number of frames is located The position is the position of the starting point of the voice.

In some embodiments, the process 212 may be: if the electronic device detects the presence of speech according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and According to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, it is detected that no speech exists, and the electronic device determines The position is the position of the voice termination point.

For example, if the electronic device detects the absence of speech based on the spectral entropy ratio and short-term energy of consecutive frames, and the presence of speech is detected based on the spectral entropy ratio and short-term energy of subsequent frames, the electronic device can determine The position of the first frame in subsequent frames is the position of the starting point of speech.

For example, if the electronic device detects the presence of speech based on the spectral entropy ratio and short-term energy of consecutive frames, and the absence of speech is detected based on the spectral entropy ratio and short-term energy of subsequent frames, the electronic device may determine The position of the first frame in subsequent frames is the position of the voice termination point.

For example, suppose that the noise-reduced speech signal is divided into 20 frames, namely the first frame, the second frame, the third frame ... the 19th frame, the 20th frame.

Among them, the electronic device detects that no speech exists according to the spectral entropy ratio of the first frame to the fifth frame and the short-term energy of the first frame to the fifth frame, and according to the spectral entropy ratio of the sixth frame to the tenth frame and the sixth When the short-term energy from frame to frame 10 detects the presence of voice, the electronic device determines that the position of frame 6 is the position of the starting point of the voice.

The electronic device detects the presence of speech based on the spectral entropy ratio of frames 11 to 15 and the short-term energy of frames 11 to 15 and based on the spectral entropy ratio of frames 16 to 20 and frames 16 to 20. The short-term energy of the 20th frame detects that there is no voice, and the electronic device determines that the position of the 16th frame is the position of the voice termination point.

In this embodiment, the values of the spectral entropy ratio and the short-term energy can be set according to the actual situation. For example, it can be set that when the value of the spectral entropy ratio is A, it can be detected that there is speech, and when the value of the spectral entropy ratio is B, it can be detected that there is no speech. It can be set that when the short-term energy value is C, it can detect the presence of voice, and when the short-term energy value is D, it can detect that there is no voice.

It should be noted that the above is only one example of determining the position of the voice start point and the position of the voice end point proposed in this embodiment. It can be understood that, within the protection scope of the embodiments of the present application, the position of the voice start point and the position of the voice end point may also be determined in other ways, and no specific limitation is made here.

Please refer to FIG. 5, which is a schematic structural diagram of a voice endpoint detection device 300 according to an embodiment of the present application. The voice endpoint detection device 300 may include: an acquisition module 301, a noise reduction module 302, a calculation module 303, and a detection module 304.

The obtaining module 301 is used to obtain a noisy speech signal.

The noise reduction module 302 is configured to perform noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal.

The calculation module 303 is configured to calculate the spectral entropy ratio of the noise-reduced speech signal and calculate the short-term energy of the noise-reduced speech signal.

The detection module 304 is configured to perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.

In some embodiments, the noise reduction module 302 may be used to: perform frame-by-frame windowing processing on the noisy speech signal to obtain a multi-frame windowed time-domain signal; Perform a Fourier transform on the windowed time-domain signal of each frame to obtain a multi-frame frequency domain signal; estimate the Fourier coefficients of the frequency domain signal of each frame; according to the Fourier coefficients of the frequency domain signal of each frame, for each frame frequency The domain signal is subjected to noise reduction processing to obtain a multi-frame noise reduction frequency domain signal.

In some embodiments, the calculation module 303 may be used to: calculate the energy spectrum of the noise reduction frequency domain signal per frame; calculate the spectral entropy of the noise reduction frequency domain signal per frame; according to the noise reduction frequency domain signal per frame The energy spectrum and the spectral entropy of the noise-reduced frequency domain signal per frame calculate the spectral entropy ratio of the noise-reduced frequency domain signal per frame to obtain the spectral entropy ratio of all frames of the noise-reduced speech signal.

In some embodiments, the calculation module 303 may also be used to: obtain frequency band information of the noise reduction frequency domain signal of each frame; divide the noise reduction frequency domain signal of each frame according to the frequency band information to obtain a reduction of each frame A plurality of sub-noise reduction frequency domain signals corresponding to the noise frequency domain signal; calculating an energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; calculating according to the energy spectrum of each sub-noise reduction frequency domain signal The energy spectrum of the noise-reduced frequency domain signal per frame.

In some embodiments, the calculation module 303 may be further used to calculate each sub-noise reduction frequency domain according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal The normalized probability density of the signal; according to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.

In some embodiments, the calculation module 303 can also be used to: perform inverse Fourier transform on the noise reduction frequency domain signal of each frame to obtain a multi-frame noise reduction time domain signal; Time energy, the short-term energy of all frames of the noise-reduced speech signal is obtained.

In some embodiments, the detection module 304 may be used to determine the position of the voice start point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal And / or determine the position of the voice termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.

In some embodiments, the detection module 304 may be further configured to: based on the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, No speech is detected, and according to the spectral entropy ratio of the second number of frames of the noise-reduced speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the second The position of the first frame in the number of frames is the position of the starting point of speech.

In some implementations, the detection module 304 may be further configured to: based on the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, The presence of speech is detected, and according to the spectral entropy ratio of the fourth number of frames of the noise-reduced speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, the absence of speech is detected, and the fourth is determined The position of the first frame in the number of frames is the position of the voice termination point.

An embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed on a computer, the computer is caused to execute the process in the voice endpoint detection method provided in this embodiment .

An embodiment of the present application also provides an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is used to execute the computer program stored in the memory by executing the computer program Process in the voice endpoint detection method.

For example, the aforementioned electronic device may be a mobile terminal such as a tablet computer or a smart phone. Please refer to FIG. 6, which is a first schematic structural diagram of an electronic device provided by an embodiment of the present application.

The mobile terminal 400 may include components such as a microphone 401, a memory 402, and a processor 403. Those skilled in the art may understand that the structure of the mobile terminal shown in FIG. 6 does not constitute a limitation on the mobile terminal, and may include more or fewer components than those illustrated, or combine certain components, or arrange different components.

The microphone 401 can be used to pick up the voice uttered by the user and the like.

The memory 402 may be used to store application programs and data. The application program stored in the memory 402 contains executable code. The application program can form various functional modules. The processor 403 executes application programs stored in the memory 402 to execute various functional applications and data processing.

The processor 403 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 402 and calling data stored in the memory 402 Various functions and processing data to monitor the mobile terminal as a whole.

In this embodiment, the processor 403 in the mobile terminal loads the executable code corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 403 runs and stores the memory in the memory The application in 402, thereby implementing the process:

Obtain noisy speech signals;

Please refer to FIG. 7, which is a second schematic structural diagram of an electronic device according to an embodiment of the present application.

The mobile terminal 500 may include components such as a microphone 501, a memory 502, a processor 503, an input unit 504, an output unit 505, a speaker 506, and the like.

The microphone 501 can be used to pick up the voice uttered by the user and the like.

The memory 502 may be used to store application programs and data. The application program stored in the memory 502 contains executable code. The application program can form various functional modules. The processor 503 executes application programs stored in the memory 502 to execute various functional applications and data processing.

The processor 503 is the control center of the mobile terminal, and uses various interfaces and lines to connect the various parts of the entire mobile terminal, and executes the mobile terminal by running or executing application programs stored in the memory 502 and calling data stored in the memory 502 Various functions and processing data to monitor the mobile terminal as a whole.

The input unit 504 may be used to receive input numbers, character information, or user characteristic information (such as fingerprints), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The output unit 505 can be used to display information input by the user or provided to the user and various graphical user interfaces of the mobile terminal. These graphical user interfaces can be composed of graphics, text, icons, videos, and any combination thereof. The output unit may include a display panel.

In this embodiment, the processor 503 in the mobile terminal will load the executable code corresponding to the process of one or more application programs into the memory 502 according to the following instructions, and the processor 503 will run the stored code in the memory The application in 502, thereby implementing the process:

Obtain noisy speech signals;

In some implementations, when the processor 503 executes the process of performing noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal, it may perform: performing frame-and-window processing on the noise-containing speech signal, Obtain multi-frame windowed time-domain signals; Fourier transform the windowed time-domain signals of each frame of the multi-frame windowed time-domain signals to obtain multi-frame frequency-domain signals; estimate the Fourier of each frame of frequency-domain signals Coefficient; according to the Fourier coefficient of the frequency domain signal of each frame, perform noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.

In some embodiments, when the processor 503 executes the process of calculating the spectral entropy ratio of the noise-reduced speech signal, it may perform: calculating the energy spectrum of the noise-reducing frequency domain signal per frame; Spectral entropy of each frame; calculate the spectral entropy ratio of the noise reduction frequency domain signal per frame according to the energy spectrum of the noise reduction frequency domain signal per frame and the spectral entropy of the noise reduction frequency domain signal per frame to obtain the noise reduction speech signal The ratio of the spectral entropy of all frames.

In some embodiments, when the processor 503 executes the process of calculating the energy spectrum of the noise reduction frequency domain signal per frame, it may perform: acquiring frequency band information of the noise reduction frequency domain signal per frame; according to the frequency band information Divide the noise reduction frequency domain signal of each frame to obtain a plurality of sub-noise reduction frequency domain signals corresponding to each frame of the noise reduction frequency domain signal; calculate the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals; The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.

In some embodiments, when the processor 503 executes the process of calculating the spectral entropy of the noise reduction frequency domain signal per frame, it may perform: according to the energy spectrum of each sub-noise reduction frequency domain signal and the noise reduction per frame The energy spectrum of the frequency domain signal, calculate the normalized probability density of each sub-noise reduction frequency domain signal; calculate the spectral entropy of the noise reduction frequency domain signal per frame according to the normalized probability density of each sub-noise reduction frequency domain signal .

In some embodiments, when the processor 503 executes the process of calculating the short-term energy of the noise-reduced speech signal, it may perform: performing an inverse Fourier transform on each frame of the noise-reduced frequency domain signal to obtain multi-frame noise reduction Time-domain signal; calculate the short-term energy of the noise-reduced time-domain signal of each frame to obtain the short-term energy of all frames of the noise-reduced speech signal.

In some embodiments, when the processor 503 executes the process of performing voice endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal, it may execute: according to the noise reduction Determine the position of the starting point of the speech by the spectral entropy ratio of all frames of the speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or according to the spectral entropy ratio of all frames of the noise-reduced speech signal and all Describe the short-term energy of all frames of the noise-reduced speech signal to determine the position of the speech termination point.

In some embodiments, the processor 503 executes the process of determining the position of the voice start point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the noise reduction The spectral entropy ratio of the second number of frames of the speech signal and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, the position of the first frame in the second number of frames is determined to be speech The position of the starting point.

In some embodiments, the processor 503 executes the process of determining the position of the speech termination point based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal , It can be performed: if according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, the presence of speech is detected, and according to the noise reduction The spectral entropy ratio of the fourth number of frames of the speech signal and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is speech The location of the end point.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed in an embodiment, you can refer to the detailed description of the voice endpoint detection method above, which will not be repeated here.

The voice endpoint detection device provided by the embodiment of the present application and the voice endpoint detection method in the above embodiments belong to the same concept, and any of the voice endpoint detection method embodiments provided on the voice endpoint detection device can be run on the voice endpoint detection device For the method and the specific implementation process, please refer to the embodiments of the voice endpoint detection method, which will not be repeated here.

It should be noted that, for the voice endpoint detection method described in the embodiments of the present application, a person of ordinary skill in the art can understand that all or part of the process of implementing the voice endpoint detection method described in the embodiments of the present application can be controlled by a computer program. Completed by hardware, the computer program may be stored in a computer-readable storage medium, such as stored in a memory, and executed by at least one processor, during the execution process may include the implementation of the voice endpoint detection method Example process. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM, Read Only Memory), a random access memory (RAM, Random Access Memory), and so on.

For the voice endpoint detection device of the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules may be integrated into one module. The above integrated modules may be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer-readable storage medium, such as a read-only memory, magnetic disk, or optical disk, etc. .

The voice endpoint detection method, device, storage medium, and electronic equipment provided in the embodiments of the present application are described in detail above. Specific examples are used in this article to explain the principles and implementation of the present application. It is only used to help understand the method and core ideas of this application; meanwhile, for those skilled in the art, according to the ideas of this application, there will be changes in the specific implementation and application scope. In summary, this The content of the description should not be construed as limiting the application.

Claims

A voice endpoint detection method, including:

Obtain noisy speech signals;

Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;

Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;

Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
The method for detecting a voice endpoint according to claim 1, wherein the performing noise reduction processing on the noise-containing speech signal to obtain a noise-reduced speech signal includes:

Performing frame-by-frame windowing on the noisy speech signal to obtain a multi-frame windowed time-domain signal;

Performing a Fourier transform on each frame of the multi-frame windowed time-domain signal to obtain a multi-frame frequency domain signal;

Estimate the Fourier coefficient of each frame frequency domain signal;

According to the Fourier coefficients of the frequency domain signal of each frame, performing noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
The method for detecting a voice endpoint according to claim 2, wherein the calculating the spectral entropy ratio of the noise-reduced voice signal includes:

Calculate the energy spectrum of the noise-reduced frequency domain signal per frame;

Calculate the spectral entropy of the noise-reduced frequency domain signal per frame;

Calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain the Spectral entropy ratio.
The voice endpoint detection method according to claim 3, wherein the calculating the energy spectrum of the noise-reduced frequency domain signal per frame includes:

Obtain the frequency band information of the noise reduction frequency domain signal of each frame;

Dividing the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame;

Calculating the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals;

The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.
The speech endpoint detection method according to claim 4, wherein the calculating the spectral entropy of the noise-reduced frequency domain signal per frame includes:

Calculating the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal;

According to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.
The voice endpoint detection method according to claim 3, wherein the calculation of the short-term energy of the noise-reduced voice signal includes:

Perform inverse Fourier transform on each frame of noise reduction frequency domain signal to obtain multiframe noise reduction time domain signal;

Calculate the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.
The speech endpoint detection method according to claim 6, wherein the speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal includes:

Determine the position of the speech starting point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or

The position of the speech termination point is determined according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.
The method for detecting a voice endpoint according to claim 7, wherein said determining the starting point of the voice based on the spectral entropy ratio of all frames of the noise-reduced voice signal and the short-term energy of all frames of the noise-reduced voice signal Location, including:

If according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the first The spectral entropy ratio of the second number of frames and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, it is determined that the position of the first frame in the second number of frames is the position of the starting point of the speech .
The method of detecting a voice endpoint according to claim 7, wherein the determining of the voice termination point is based on the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal Location, including:

If the presence of speech is detected according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and according to the third The spectral entropy ratio of the fourth number of frames and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is the position of the voice termination point .
A voice endpoint detection device, including:

Acquisition module for acquiring noisy speech signals;

A noise reduction module, configured to perform noise reduction processing on the noisy speech signal to obtain a noise-reduced speech signal;

A calculation module, used to calculate the spectral entropy ratio of the noise-reduced speech signal, and calculate the short-term energy of the noise-reduced speech signal;

The detection module is configured to perform speech endpoint detection based on the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
A storage medium, wherein a computer program is stored in the storage medium, and when the computer program is run on a computer, the computer is caused to execute the voice endpoint detection method according to any one of claims 1 to 9.
An electronic device, wherein the electronic device includes a processor and a memory, a computer program is stored in the memory, and the processor is used to execute the computer program by calling the computer program stored in the memory:

Obtain noisy speech signals;

Performing noise reduction processing on the noise-containing voice signal to obtain a noise-reduced voice signal;

Calculating the spectral entropy ratio of the noise-reduced speech signal, and calculating the short-term energy of the noise-reduced speech signal;

Perform speech endpoint detection according to the spectral entropy ratio of the noise-reduced speech signal and the short-term energy of the noise-reduced speech signal.
The electronic device according to claim 12, wherein the processor is configured to execute:

Performing frame-by-frame windowing on the noisy speech signal to obtain a multi-frame windowed time-domain signal;

Performing a Fourier transform on each frame of the multi-frame windowed time-domain signal to obtain a multi-frame frequency domain signal;

Estimate the Fourier coefficient of each frame frequency domain signal;

According to the Fourier coefficients of the frequency domain signal of each frame, performing noise reduction processing on the frequency domain signal of each frame to obtain a multi-frame noise reduction frequency domain signal.
The electronic device according to claim 13, wherein the processor is configured to execute:

Calculate the energy spectrum of the noise-reduced frequency domain signal per frame;

Calculate the spectral entropy of the noise-reduced frequency domain signal per frame;

Calculate the spectral entropy ratio of the noise reduction frequency domain signal of each frame according to the energy spectrum of the noise reduction frequency domain signal of each frame and the spectral entropy of the noise reduction frequency domain signal of each frame, to obtain the Spectral entropy ratio.
The electronic device according to claim 14, wherein the processor is configured to execute:

Obtain the frequency band information of the noise reduction frequency domain signal of each frame;

Dividing the noise reduction frequency domain signal of each frame according to the frequency band information to obtain multiple sub-noise reduction frequency domain signals corresponding to the noise reduction frequency domain signal of each frame;

Calculating the energy spectrum of each sub-noise reduction frequency domain signal of the plurality of sub-noise reduction frequency domain signals;

The energy spectrum of the noise reduction frequency domain signal of each frame is calculated according to the energy spectrum of each sub-noise reduction frequency domain signal.
The electronic device according to claim 15, wherein the processor is configured to execute:

Calculating the normalized probability density of each sub-noise reduction frequency domain signal according to the energy spectrum of each sub-noise reduction frequency domain signal and the energy spectrum of the per-frame noise reduction frequency domain signal;

According to the normalized probability density of each sub-noise reduction frequency domain signal, the spectral entropy of the noise reduction frequency domain signal of each frame is calculated.
The electronic device according to claim 14, wherein the processor is configured to execute:

Perform inverse Fourier transform on each frame of noise reduction frequency domain signal to obtain multiframe noise reduction time domain signal;

Calculate the short-term energy of the noise reduction time-domain signal of each frame to obtain the short-term energy of all frames of the noise reduction speech signal.
The electronic device according to claim 17, wherein the processor is configured to execute:

Determine the position of the speech starting point according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal; and / or

The position of the speech termination point is determined according to the spectral entropy ratio of all frames of the noise-reduced speech signal and the short-term energy of all frames of the noise-reduced speech signal.
The electronic device according to claim 18, wherein the processor is configured to execute:

If according to the spectral entropy ratio of the first number of frames of the noise-reduced speech signal and the short-term energy of the first number of frames of the noise-reduced speech signal, it is detected that no speech exists, and according to the first The spectral entropy ratio of the second number of frames and the spectral entropy ratio of the second number of frames of the noise-reduced speech signal, if the presence of speech is detected, it is determined that the position of the first frame in the second number of frames is the position of the starting point of the speech .
The electronic device according to claim 18, wherein the processor is configured to execute:

If the presence of speech is detected according to the spectral entropy ratio of the third number of frames of the noise-reduced speech signal and the short-term energy of the third number of frames of the noise-reduced speech signal, and according to the third The spectral entropy ratio of the fourth number of frames and the short-term energy of the fourth number of frames of the noise-reduced speech signal, if no speech is detected, it is determined that the position of the first frame in the fourth number of frames is the position of the voice termination point .