CN112435691A

CN112435691A - On-line voice endpoint detection post-processing method, device, equipment and storage medium

Info

Publication number: CN112435691A
Application number: CN202011083235.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: Zhuhai Eeasy Electronic Tech Co ltd
Current assignee: Zhuhai Eeasy Electronic Tech Co ltd
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-03-02
Anticipated expiration: 2040-10-12
Also published as: CN112435691B

Abstract

The invention is suitable for the technical field of audio processing, and provides a method, a device, equipment and a storage medium for on-line voice endpoint detection post-processing, wherein the method comprises the following steps: the method comprises the steps of obtaining the door state of the previous audio frame of the current audio frame, judging the door state of the current audio frame according to a door state judging mode matched with the door state of the previous audio frame, and determining the audio frame type of the current audio frame according to the door state of the current audio frame, so that the accuracy of voice/non-voice judgment is improved and the performance of voice recognition is improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into door opening/door closing judgment.

Description

On-line voice endpoint detection post-processing method, device, equipment and storage medium

Technical Field

The invention belongs to the technical field of audio processing, and particularly relates to a method, a device, equipment and a storage medium for on-line voice endpoint detection post-processing.

Background

Voice endpoint Detection (VAD) can be divided into an offline VAD and an online VAD from an application scene, the offline VAD has a main task of accurately locating a start point and an end point of Voice from Voice with noise, all Voice information is possessed before judgment, and the online VAD has a main task of judging whether a Voice part or a non-Voice part is output at the current moment and has a high real-time requirement.

In order to better distinguish between speech and non-speech portions, many deep neural network-based VAD algorithms have been proposed in recent years, and CRNN, i.e. CNN (convolutional neural network) + RNN (recurrent neural network) + DNN (deep neural network) network structure, or CLDNN, i.e. CNN + LSTM (long term short term memory network) + DNN network structure, is the mainstream, and the judgment of speech and non-speech is regarded as a two-classification problem. The unified idea is that firstly, a CNN network is used for feature extraction, the CNN network is different from an image, a voice signal is a sequence with a time sequence, an RNN/LSTM/GRU (gated cycle unit) is used for modeling the voice sequence, and then DNN is used for final output through softmax, but the output result of the model can have the condition of misjudgment of a voice frame/non-voice frame, so that the voice recognition performance is reduced.

Disclosure of Invention

The invention aims to provide a method, a device, equipment and a storage medium for processing after-detection of an online voice endpoint, and aims to solve the problem of low voice recognition performance caused by misjudgment of voice frames/non-voice frames in the prior art.

In one aspect, the present invention provides an online voice endpoint detection post-processing method, including the following steps:

acquiring the gate state of the last audio frame of the current audio frame;

judging the gate state of the current audio frame according to a gate state judgment mode matched with the gate state of the previous audio frame;

and determining the audio frame type of the current audio frame according to the door state of the current audio frame.

Preferably, the step of determining the gate state of the current audio frame according to a gate state determination manner matched with the gate state of the previous audio frame further includes:

if the door state of the previous audio frame is the door opening state, comparing the acquired first voice probability average value with a preset first voice probability average threshold value, wherein the first voice probability average value is used for representing the average value of the voice probability values of all audio frames in the continuous first preset voice length before the current audio frame;

and if the first voice probability average value is larger than or equal to the first voice probability average threshold value, judging that the door state of the current audio frame is the door opening state.

Preferably, after the step of comparing the obtained first speech probability average value with the preset first speech probability average threshold, the method further includes:

if the first voice probability average value is smaller than the first voice probability average threshold, comparing the obtained second voice probability average value with a preset second voice probability average threshold, wherein the second voice probability average value is used for representing the average value of the voice probability values of all audio frames in continuous second preset voice length before the current audio frame, and the second preset voice length is smaller than the first preset voice length;

and if the second voice probability average value is greater than or equal to the second voice probability average threshold value, judging that the door state of the current audio frame is the door opening state.

Preferably, after the step of comparing the obtained second speech probability average value with the preset second speech probability average threshold, the method further includes:

if the second voice probability average value is smaller than the second voice probability average threshold value, comparing the acquired voice probability value of the current audio frame with a third voice probability average value, wherein the third voice probability average value is used for representing the average value of the voice probability values of the audio frames which are kept in a continuous door opening state before the current audio frame;

and if the voice probability value of the current audio frame is greater than or equal to the third voice probability average value, judging that the door state of the current audio frame is the door opening state.

Preferably, after the step of comparing the speech probability value of the current audio frame with the third speech probability average value, the method further includes:

if the voice probability value of the current audio frame is smaller than the third voice probability average value, judging whether the current audio frame is an initial audio frame;

if the current audio frame is the initial audio frame, judging that the door state of the current audio frame is the door opening state;

and if the current audio frame is not the initial audio frame, judging that the door state of the current audio frame is a door closing state.

if the door state of the previous audio frame is a door closing state, comparing the voice probability value of the current audio frame with a preset voice probability threshold value;

if the voice probability value of the current audio frame is larger than or equal to the voice probability threshold, judging that the door state of the current audio frame is the door opening state;

and if the voice probability value of the current audio frame is smaller than the voice probability threshold, judging that the door state of the current audio frame is a door-closed state.

Preferably, the step of determining the audio frame type of the current audio frame according to the gate state of the current audio frame includes:

if the door state of the current audio frame is the door opening state, determining that the current audio frame is a voice frame;

and if the door state of the current audio frame is the door closing state, determining that the current audio frame is a non-speech frame.

In another aspect, the present invention provides an online voice endpoint detection post-processing apparatus, including:

the gate state acquisition unit is used for acquiring the gate state of the last audio frame of the current audio frame;

the gate state judging unit is used for judging the gate state of the current audio frame according to a gate state judging mode matched with the gate state of the previous audio frame; and

and the audio frame type determining unit is used for determining the audio frame type of the current audio frame according to the door state of the current audio frame.

In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method when executing the computer program.

In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.

The method and the device acquire the door state of the last audio frame of the current audio frame, judge the door state of the current audio frame according to a door state judging mode matched with the door state of the last audio frame, and determine the audio frame type of the current audio frame according to the door state of the current audio frame, so that the accuracy of voice/non-voice judgment is improved and the performance of voice recognition is further improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into the judgment of two states of opening/closing the door.

Drawings

Fig. 1 is a flowchart illustrating an implementation of a method for processing after detecting an online voice endpoint according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a method for determining a gate state of a current audio frame according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of an online voice endpoint detection post-processing apparatus according to a third embodiment of the present invention; and

fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:

the first embodiment is as follows:

fig. 1 shows an implementation flow of an online voice endpoint detection post-processing method provided in an embodiment of the present invention, and for convenience of description, only the relevant parts related to the embodiment of the present invention are shown, which is detailed as follows:

in step S101, the gate state of the audio frame immediately preceding the current audio frame is acquired.

The embodiment of the invention is suitable for online voice endpoint detection, and particularly can be applied to electronic equipment with an operation function, such as a mobile phone, a bracelet, a tablet computer, a portable computer, a desktop computer and the like. The present invention divides the voice endpoint detection into three parts, namely a pre-processing part, a model part and a post-processing part, wherein the pre-processing part is used to extract audio features, the extracted audio features are used as the input of a model, the part usually includes windowing, framing, STFT (short time fourier transform), etc., the model part is used to predict and output the probability value that the current audio frame belongs to a voice frame, the input of the model is generally M-dimensional mel spectrum of N frames, the buffer (buffer register) corresponding to N frames buffers (e.g., 200 ms) audio data, M is generally 40 or 64, and the post-processing part is the method described in this embodiment. Considering that the online endpoint detection requires that the output result is controlled within 200ms (combined with the size of the buffer) relative to the input delay, namely the real-time requirement is higher, the method converts the judgment of the voice/non-voice in the online endpoint detection post-processing into the judgment of two states of opening/closing the door, and carries out comprehensive judgment by combining the voice probability value of the current audio frame, the voice probability value of the audio frame before the current audio frame, the door state and other multidimensional factors, thereby improving the performance of voice recognition. The door state includes a door opening state and a door closing state, and in a specific implementation, the door state can be represented by 0 and 1.

In step S102, the gate state of the current audio frame is determined according to the gate state determination manner matched with the gate state of the previous audio frame.

In the embodiment of the present invention, for a scene such as chatting by using voice input, considering that when a speaker speaks for a long time, a short pause exists between voice segments, so preferably, when the gate state of the current audio frame is determined according to a gate state determination manner matched with the gate state of the previous audio frame, if the gate state of the previous audio frame is the open gate state, the obtained first voice probability average value is compared with a preset first voice probability average threshold, and if the first voice probability average value is greater than or equal to the first voice probability average threshold, the gate state of the current audio frame is determined to be the open gate state, at this time, the gate state does not need to be converted, and the open gate state is continuously maintained, so as to ensure continuity and integrity of the voice segments of the speaker. Wherein, the speech probability value generally refers to a probability value of an audio frame being predicted as a speech frame, and the speech probability value is generally predicted by a neural network, for example, a CRNN network or a CLDNN network; the first voice probability average threshold value can be flexibly set according to the actual voice environment; the first voice probability average value is used for representing the average value of the voice probability values of all audio frames in continuous first preset voice length before the current audio frame.

For scenes such as voice awakening, considering that the speaking time of a speaker is usually short, preferably, after the obtained first voice probability average value is compared with a preset first voice probability average threshold, if the first voice probability average value is smaller than the first voice probability average threshold, the obtained second voice probability average value is compared with a preset second voice probability average threshold, if the second voice probability average value is greater than or equal to the second voice probability average threshold, the door state of the current audio frame is determined to be the door-open state, and at this time, the door-open state is continuously maintained, so that the misjudgment rate of the audio frame is further reduced. The second voice probability average value is used for representing the average value of the voice probability values of all the audio frames in the continuous second preset voice length before the current audio frame, and the second preset voice length is smaller than the first preset voice length.

It should be noted that the first preset speech length and the second preset speech length may be determined according to an empirical value of an actual application scenario.

Considering that the probability value of the audio frame being predicted as the speech frame is not fixed when different people speak in different environments, for example, in a quiet environment, the probability value of the predicted speech frame is generally high, and in a complex environment, especially when the signal-to-noise ratio is low, the probability value of the predicted speech frame is generally low, so that preferably, after comparing the obtained second speech probability average value with a preset second speech probability average threshold value, if the second speech probability average value is smaller than the second speech probability average threshold value, the obtained speech probability value of the current audio frame is compared with a third speech probability average value, and if the speech probability value of the current audio frame is greater than or equal to the third speech probability average value, the door state of the current audio frame is determined to be the door-open state, so as to improve the accuracy of the determination result when the signal-to-noise ratio is low, and further improves the performance of voice recognition in a low signal-to-noise ratio environment. And the third voice probability average value is used for representing the average value of the voice probability values of the audio frames which are kept in the continuously open door state before the current audio frame. It should be noted that, if the door state before the current audio frame is the closed door state, it indicates that there is no continuous audio frame before the current audio frame, and the third speech probability average value is zero.

Considering that the obtained voice information is less when the system is just started, the door state obtained by the method may be unstable, so preferably, after the voice probability value is compared with the third voice probability average value, if the voice probability value of the current audio frame is smaller than the third voice probability average value, whether the current audio frame is an initial audio frame is judged, if the current audio frame is the initial audio frame, the door state of the current audio frame is judged to be an open door state, if the current audio frame is not the initial audio frame, the door state of the current audio frame is judged to be a closed door state, and at the moment, the door state is converted from the open door state to the closed door state, so that the accuracy of the voice/non-voice judgment result is improved, and the voice recognition performance is improved.

When the door state of the previous audio frame is the closed door state, in order to filter the non-speech frames as far as possible without losing the speech frames, preferably, when the door state of the current audio frame is judged according to a door state judgment mode matched with the door state of the previous audio frame, if the door state of the previous audio frame is the closed door state, the speech probability value of the current audio frame is compared with a preset speech probability threshold value, if the speech probability value of the current audio frame is greater than or equal to the speech probability threshold value, the door state of the current audio frame is judged to be the open door state, at the moment, the door state is converted from the closed door state to the open door state, if the speech probability value of the current audio frame is less than the speech probability threshold value, the door state of the current audio frame is judged to be the closed door state, at the moment, the door state is kept to be the closed door state, thereby simplifying the speech/non-speech judgment process, the performance of speech recognition is improved.

Of course, in practical applications, the speech probability value may also be a probability value of the audio frame being predicted as a non-speech frame, and accordingly, the determination methods of the comparison parameters and the specific gate states need to be adjusted accordingly, but the basic determination method is substantially the same as the method described in this embodiment.

In step S103, the audio frame type of the current audio frame is determined according to the gate state of the current audio frame.

In the embodiment of the present invention, different gate states may represent different data types of the audio frame, and may be specifically set according to actual needs. The audio frame type in the embodiment of the invention comprises a voice frame and a non-voice frame, wherein the door opening state represents that the audio frame type of the current audio frame is a voice frame, and the door closing state represents that the audio frame type of the current audio frame is a non-voice frame, so that preferably, if the door state of the current audio frame is the door opening state, the current audio frame is determined to be the voice frame, and if the door state of the current audio frame is the door closing state, the current audio frame is determined to be the non-voice frame, thereby reducing the misjudgment rate of voice/non-voice, and further improving the performance of voice identification.

In the embodiment of the invention, the door state of the last audio frame of the current audio frame is obtained, the door state of the current audio frame is judged according to the door state judgment mode matched with the door state of the last audio frame, and the audio frame type of the current audio frame is determined according to the door state of the current audio frame, so that the voice/non-voice misjudgment rate is reduced and the voice recognition performance is improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into the judgment of two states of opening/closing the door.

Example two:

fig. 2 is a flowchart illustrating a method for determining a gate state of a current audio frame according to a second embodiment of the present invention, which only shows a portion related to the second embodiment of the present invention for convenience of description, and the following details are described below:

in fig. 2, the door state of the current audio frame is represented by door _ state, and the door state of the previous audio frame of the current audio frame is represented by last _ state, where 1 represents the door-open state and 0 represents the door-closed state; the first speech probability average is represented by S1, and the first speech probability average threshold is represented by T1; the second speech probability average is represented by S2, the second speech probability average threshold is represented by T2, and the third speech probability average is represented by S3; the voice probability value of the current audio frame is represented by label, and the voice probability threshold value is represented by thred; the relationship between the current audio frame and the initial audio frame is represented by begin _ flag, where begin _ flag =1 represents that the current audio frame is the initial audio frame, and begin _ flag =0 represents that the current audio frame is the non-initial audio frame.

In step S201, determining whether a door state of a previous audio frame of a current audio frame is a door-closed state, if not, performing S202, and if so, performing S206;

in step S202, determining whether the first speech probability average value is greater than or equal to a first speech probability average threshold, if not, performing S203, and if so, performing S207;

in step S203, determining whether the second speech probability average value is greater than or equal to a second speech probability average threshold, if not, performing S204, and if so, performing S207;

in step S204, determining whether the speech probability value of the current audio frame is greater than or equal to the third speech probability average value, if not, performing S205, and if so, performing S207;

in step S205, it is determined whether the current audio frame is an initial audio frame, if not, step S208 is executed, and if so, step S207 is executed;

in step S206, it is determined whether the speech probability value of the current audio frame is greater than or equal to a preset speech probability threshold, if not, step S208 is executed, and if so, step S207 is executed

In step S207, it is determined that the door state of the current audio frame is the door open state;

in step S208, it is determined that the door state of the current audio frame is the closed door state.

Example three:

fig. 3 shows a structure of an online voice endpoint detection post-processing apparatus provided in a third embodiment of the present invention, and for convenience of description, only the parts related to the third embodiment of the present invention are shown, where the parts include:

a gate state acquiring unit 31 configured to acquire a gate state of an audio frame immediately preceding the current audio frame;

a gate state judgment unit 32, configured to judge the gate state of the current audio frame according to a gate state judgment manner matched with the gate state of the previous audio frame; and

a frame type determining unit 33, configured to determine the audio frame type of the current audio frame according to the gate state of the current audio frame.

Preferably, the apparatus further comprises:

the first comparison unit is used for comparing the acquired first voice probability average value with a preset first voice probability average threshold value if the door state of the previous audio frame is the door opening state, wherein the first voice probability average value is used for representing the average value of the voice probability values of all audio frames in the continuous first preset voice length before the current audio frame; and

and the first state determining unit is used for judging that the door state of the current audio frame is the door opening state if the first voice probability average value is greater than or equal to the first voice probability average threshold value.

Preferably, the apparatus further comprises:

the second comparison unit is used for comparing the obtained second voice probability average value with a preset second voice probability average threshold value if the first voice probability average value is smaller than the first voice probability average threshold value, wherein the second voice probability average value is used for representing the average value of the voice probability values of all audio frames in continuous second preset voice lengths before the current audio frame, and the second preset voice length is smaller than the first preset voice length; and

and the second state determining unit is used for judging that the door state of the current audio frame is the door opening state if the second voice probability average value is greater than or equal to the second voice probability average threshold value.

Preferably, the apparatus further comprises:

the third comparison unit is used for comparing the acquired voice probability value of the current audio frame with a third voice probability average value if the second voice probability average value is smaller than a second voice probability average threshold value, wherein the third voice probability average value is used for representing the average value of the voice probability values of the audio frames which are kept in the continuous door opening state before the current audio frame; and

and the third state determining unit is used for judging that the door state of the current audio frame is the door opening state if the voice probability value of the current audio frame is greater than or equal to the third voice probability average value.

Preferably, the apparatus further comprises:

the fourth comparison unit is used for judging whether the current audio frame is the initial audio frame or not if the voice probability value of the current audio frame is smaller than the third voice probability average value; and

the fourth state determining unit is used for judging that the door state of the current audio frame is the door opening state if the audio frame is the initial audio frame;

and the fifth state determining unit is used for judging that the door state of the current audio frame is the door closing state if the audio frame is not the initial audio frame.

Preferably, the apparatus further comprises:

the fifth comparison unit is used for comparing the voice probability value of the current audio frame with a preset voice probability threshold value if the door state of the previous audio frame is a door closing state; and

a sixth state determining unit, configured to determine that the door state of the current audio frame is the open door state if the voice probability value of the current audio frame is greater than or equal to the voice probability threshold;

and the seventh state determining unit is used for judging that the door state of the current audio frame is the door closing state if the voice probability value of the current audio frame is smaller than the voice probability threshold.

Preferably, the frame type determining unit includes:

the first determining subunit is used for determining that the current audio frame is a voice frame if the door state of the current audio frame is the door opening state; and

and the second determining subunit is used for determining that the current audio frame is a non-speech frame if the door state of the current audio frame is the door closing state.

In the embodiment of the present invention, each unit of the online voice endpoint detection post-processing apparatus may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. For the specific implementation of each unit of the online voice endpoint detection post-processing apparatus, reference may be made to the description of the foregoing method embodiment, which is not described herein again.

Example four:

fig. 4 shows a structure of an electronic device according to a fourth embodiment of the present invention, and only a part related to the fourth embodiment of the present invention is shown for convenience of description.

The electronic device 4 of an embodiment of the invention comprises a processor 40, a memory 41 and a computer program 42 stored in the memory 41 and executable on the processor 40. The processor 40, when executing the computer program 42, implements the steps in the above-described method embodiments, such as the steps S101 to S103 shown in fig. 1. Alternatively, the processor 40, when executing the computer program 42, implements the functions of the units in the above-described device embodiments, such as the functions of the units 31 to 33 shown in fig. 3.

Example five:

in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiment, for example, steps S101 to S103 shown in fig. 1. Alternatively, the computer program may be adapted to perform the functions of the units of the above-described device embodiments, such as the functions of the units 31 to 33 shown in fig. 3, when executed by the processor.

The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. An online voice endpoint detection post-processing method, characterized in that the method comprises the following steps:

acquiring the gate state of the last audio frame of the current audio frame;

2. The method as claimed in claim 1, wherein the step of determining the gate state of the current audio frame according to a gate state determination manner matching the gate state of the previous audio frame further comprises:

3. The method according to claim 2, wherein the step of comparing the obtained first speech probability average value with a preset first speech probability average threshold value further comprises:

4. The method according to claim 3, wherein the step of comparing the obtained second speech probability average value with a preset second speech probability average threshold value further comprises:

5. The method of claim 4, wherein the step of comparing the speech probability value of the current audio frame to a third speech probability average further comprises:

6. The method as claimed in claim 4, wherein the step of determining the gate state of the current audio frame according to a gate state determination manner matching the gate state of the previous audio frame further comprises:

7. The method of claim 1, wherein the step of determining the audio frame type of the current audio frame based on the gate state of the current audio frame comprises:

8. An apparatus for processing after online voice endpoint detection, the apparatus comprising:

and the frame type determining unit is used for determining the audio frame type of the current audio frame according to the door state of the current audio frame.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.