CN112435691A - On-line voice endpoint detection post-processing method, device, equipment and storage medium - Google Patents

On-line voice endpoint detection post-processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN112435691A
CN112435691A CN202011083235.3A CN202011083235A CN112435691A CN 112435691 A CN112435691 A CN 112435691A CN 202011083235 A CN202011083235 A CN 202011083235A CN 112435691 A CN112435691 A CN 112435691A
Authority
CN
China
Prior art keywords
audio frame
voice
state
current audio
door
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011083235.3A
Other languages
Chinese (zh)
Other versions
CN112435691B (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Eeasy Electronic Tech Co ltd
Original Assignee
Zhuhai Eeasy Electronic Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Eeasy Electronic Tech Co ltd filed Critical Zhuhai Eeasy Electronic Tech Co ltd
Priority to CN202011083235.3A priority Critical patent/CN112435691B/en
Publication of CN112435691A publication Critical patent/CN112435691A/en
Application granted granted Critical
Publication of CN112435691B publication Critical patent/CN112435691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L2025/783Detection of presence or absence of voice signals based on threshold decision

Abstract

The invention is suitable for the technical field of audio processing, and provides a method, a device, equipment and a storage medium for on-line voice endpoint detection post-processing, wherein the method comprises the following steps: the method comprises the steps of obtaining the door state of the previous audio frame of the current audio frame, judging the door state of the current audio frame according to a door state judging mode matched with the door state of the previous audio frame, and determining the audio frame type of the current audio frame according to the door state of the current audio frame, so that the accuracy of voice/non-voice judgment is improved and the performance of voice recognition is improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into door opening/door closing judgment.

Description

On-line voice endpoint detection post-processing method, device, equipment and storage medium
Technical Field
The invention belongs to the technical field of audio processing, and particularly relates to a method, a device, equipment and a storage medium for on-line voice endpoint detection post-processing.
Background
Voice endpoint Detection (VAD) can be divided into an offline VAD and an online VAD from an application scene, the offline VAD has a main task of accurately locating a start point and an end point of Voice from Voice with noise, all Voice information is possessed before judgment, and the online VAD has a main task of judging whether a Voice part or a non-Voice part is output at the current moment and has a high real-time requirement.
In order to better distinguish between speech and non-speech portions, many deep neural network-based VAD algorithms have been proposed in recent years, and CRNN, i.e. CNN (convolutional neural network) + RNN (recurrent neural network) + DNN (deep neural network) network structure, or CLDNN, i.e. CNN + LSTM (long term short term memory network) + DNN network structure, is the mainstream, and the judgment of speech and non-speech is regarded as a two-classification problem. The unified idea is that firstly, a CNN network is used for feature extraction, the CNN network is different from an image, a voice signal is a sequence with a time sequence, an RNN/LSTM/GRU (gated cycle unit) is used for modeling the voice sequence, and then DNN is used for final output through softmax, but the output result of the model can have the condition of misjudgment of a voice frame/non-voice frame, so that the voice recognition performance is reduced.
Disclosure of Invention
The invention aims to provide a method, a device, equipment and a storage medium for processing after-detection of an online voice endpoint, and aims to solve the problem of low voice recognition performance caused by misjudgment of voice frames/non-voice frames in the prior art.
In one aspect, the present invention provides an online voice endpoint detection post-processing method, including the following steps:
acquiring the gate state of the last audio frame of the current audio frame;
judging the gate state of the current audio frame according to a gate state judgment mode matched with the gate state of the previous audio frame;
and determining the audio frame type of the current audio frame according to the door state of the current audio frame.
Preferably, the step of determining the gate state of the current audio frame according to a gate state determination manner matched with the gate state of the previous audio frame further includes:
if the door state of the previous audio frame is the door opening state, comparing the acquired first voice probability average value with a preset first voice probability average threshold value, wherein the first voice probability average value is used for representing the average value of the voice probability values of all audio frames in the continuous first preset voice length before the current audio frame;
and if the first voice probability average value is larger than or equal to the first voice probability average threshold value, judging that the door state of the current audio frame is the door opening state.
Preferably, after the step of comparing the obtained first speech probability average value with the preset first speech probability average threshold, the method further includes:
if the first voice probability average value is smaller than the first voice probability average threshold, comparing the obtained second voice probability average value with a preset second voice probability average threshold, wherein the second voice probability average value is used for representing the average value of the voice probability values of all audio frames in continuous second preset voice length before the current audio frame, and the second preset voice length is smaller than the first preset voice length;
and if the second voice probability average value is greater than or equal to the second voice probability average threshold value, judging that the door state of the current audio frame is the door opening state.
Preferably, after the step of comparing the obtained second speech probability average value with the preset second speech probability average threshold, the method further includes:
if the second voice probability average value is smaller than the second voice probability average threshold value, comparing the acquired voice probability value of the current audio frame with a third voice probability average value, wherein the third voice probability average value is used for representing the average value of the voice probability values of the audio frames which are kept in a continuous door opening state before the current audio frame;
and if the voice probability value of the current audio frame is greater than or equal to the third voice probability average value, judging that the door state of the current audio frame is the door opening state.
Preferably, after the step of comparing the speech probability value of the current audio frame with the third speech probability average value, the method further includes:
if the voice probability value of the current audio frame is smaller than the third voice probability average value, judging whether the current audio frame is an initial audio frame;
if the current audio frame is the initial audio frame, judging that the door state of the current audio frame is the door opening state;
and if the current audio frame is not the initial audio frame, judging that the door state of the current audio frame is a door closing state.
Preferably, the step of determining the gate state of the current audio frame according to a gate state determination manner matched with the gate state of the previous audio frame further includes:
if the door state of the previous audio frame is a door closing state, comparing the voice probability value of the current audio frame with a preset voice probability threshold value;
if the voice probability value of the current audio frame is larger than or equal to the voice probability threshold, judging that the door state of the current audio frame is the door opening state;
and if the voice probability value of the current audio frame is smaller than the voice probability threshold, judging that the door state of the current audio frame is a door-closed state.
Preferably, the step of determining the audio frame type of the current audio frame according to the gate state of the current audio frame includes:
if the door state of the current audio frame is the door opening state, determining that the current audio frame is a voice frame;
and if the door state of the current audio frame is the door closing state, determining that the current audio frame is a non-speech frame.
In another aspect, the present invention provides an online voice endpoint detection post-processing apparatus, including:
the gate state acquisition unit is used for acquiring the gate state of the last audio frame of the current audio frame;
the gate state judging unit is used for judging the gate state of the current audio frame according to a gate state judging mode matched with the gate state of the previous audio frame; and
and the audio frame type determining unit is used for determining the audio frame type of the current audio frame according to the door state of the current audio frame.
In another aspect, the present invention also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method when executing the computer program.
In another aspect, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.
The method and the device acquire the door state of the last audio frame of the current audio frame, judge the door state of the current audio frame according to a door state judging mode matched with the door state of the last audio frame, and determine the audio frame type of the current audio frame according to the door state of the current audio frame, so that the accuracy of voice/non-voice judgment is improved and the performance of voice recognition is further improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into the judgment of two states of opening/closing the door.
Drawings
Fig. 1 is a flowchart illustrating an implementation of a method for processing after detecting an online voice endpoint according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a method for determining a gate state of a current audio frame according to a second embodiment of the present invention;
fig. 3 is a schematic structural diagram of an online voice endpoint detection post-processing apparatus according to a third embodiment of the present invention; and
fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The following detailed description of specific implementations of the present invention is provided in conjunction with specific embodiments:
the first embodiment is as follows:
fig. 1 shows an implementation flow of an online voice endpoint detection post-processing method provided in an embodiment of the present invention, and for convenience of description, only the relevant parts related to the embodiment of the present invention are shown, which is detailed as follows:
in step S101, the gate state of the audio frame immediately preceding the current audio frame is acquired.
The embodiment of the invention is suitable for online voice endpoint detection, and particularly can be applied to electronic equipment with an operation function, such as a mobile phone, a bracelet, a tablet computer, a portable computer, a desktop computer and the like. The present invention divides the voice endpoint detection into three parts, namely a pre-processing part, a model part and a post-processing part, wherein the pre-processing part is used to extract audio features, the extracted audio features are used as the input of a model, the part usually includes windowing, framing, STFT (short time fourier transform), etc., the model part is used to predict and output the probability value that the current audio frame belongs to a voice frame, the input of the model is generally M-dimensional mel spectrum of N frames, the buffer (buffer register) corresponding to N frames buffers (e.g., 200 ms) audio data, M is generally 40 or 64, and the post-processing part is the method described in this embodiment. Considering that the online endpoint detection requires that the output result is controlled within 200ms (combined with the size of the buffer) relative to the input delay, namely the real-time requirement is higher, the method converts the judgment of the voice/non-voice in the online endpoint detection post-processing into the judgment of two states of opening/closing the door, and carries out comprehensive judgment by combining the voice probability value of the current audio frame, the voice probability value of the audio frame before the current audio frame, the door state and other multidimensional factors, thereby improving the performance of voice recognition. The door state includes a door opening state and a door closing state, and in a specific implementation, the door state can be represented by 0 and 1.
In step S102, the gate state of the current audio frame is determined according to the gate state determination manner matched with the gate state of the previous audio frame.
In the embodiment of the present invention, for a scene such as chatting by using voice input, considering that when a speaker speaks for a long time, a short pause exists between voice segments, so preferably, when the gate state of the current audio frame is determined according to a gate state determination manner matched with the gate state of the previous audio frame, if the gate state of the previous audio frame is the open gate state, the obtained first voice probability average value is compared with a preset first voice probability average threshold, and if the first voice probability average value is greater than or equal to the first voice probability average threshold, the gate state of the current audio frame is determined to be the open gate state, at this time, the gate state does not need to be converted, and the open gate state is continuously maintained, so as to ensure continuity and integrity of the voice segments of the speaker. Wherein, the speech probability value generally refers to a probability value of an audio frame being predicted as a speech frame, and the speech probability value is generally predicted by a neural network, for example, a CRNN network or a CLDNN network; the first voice probability average threshold value can be flexibly set according to the actual voice environment; the first voice probability average value is used for representing the average value of the voice probability values of all audio frames in continuous first preset voice length before the current audio frame.
For scenes such as voice awakening, considering that the speaking time of a speaker is usually short, preferably, after the obtained first voice probability average value is compared with a preset first voice probability average threshold, if the first voice probability average value is smaller than the first voice probability average threshold, the obtained second voice probability average value is compared with a preset second voice probability average threshold, if the second voice probability average value is greater than or equal to the second voice probability average threshold, the door state of the current audio frame is determined to be the door-open state, and at this time, the door-open state is continuously maintained, so that the misjudgment rate of the audio frame is further reduced. The second voice probability average value is used for representing the average value of the voice probability values of all the audio frames in the continuous second preset voice length before the current audio frame, and the second preset voice length is smaller than the first preset voice length.
It should be noted that the first preset speech length and the second preset speech length may be determined according to an empirical value of an actual application scenario.
Considering that the probability value of the audio frame being predicted as the speech frame is not fixed when different people speak in different environments, for example, in a quiet environment, the probability value of the predicted speech frame is generally high, and in a complex environment, especially when the signal-to-noise ratio is low, the probability value of the predicted speech frame is generally low, so that preferably, after comparing the obtained second speech probability average value with a preset second speech probability average threshold value, if the second speech probability average value is smaller than the second speech probability average threshold value, the obtained speech probability value of the current audio frame is compared with a third speech probability average value, and if the speech probability value of the current audio frame is greater than or equal to the third speech probability average value, the door state of the current audio frame is determined to be the door-open state, so as to improve the accuracy of the determination result when the signal-to-noise ratio is low, and further improves the performance of voice recognition in a low signal-to-noise ratio environment. And the third voice probability average value is used for representing the average value of the voice probability values of the audio frames which are kept in the continuously open door state before the current audio frame. It should be noted that, if the door state before the current audio frame is the closed door state, it indicates that there is no continuous audio frame before the current audio frame, and the third speech probability average value is zero.
Considering that the obtained voice information is less when the system is just started, the door state obtained by the method may be unstable, so preferably, after the voice probability value is compared with the third voice probability average value, if the voice probability value of the current audio frame is smaller than the third voice probability average value, whether the current audio frame is an initial audio frame is judged, if the current audio frame is the initial audio frame, the door state of the current audio frame is judged to be an open door state, if the current audio frame is not the initial audio frame, the door state of the current audio frame is judged to be a closed door state, and at the moment, the door state is converted from the open door state to the closed door state, so that the accuracy of the voice/non-voice judgment result is improved, and the voice recognition performance is improved.
When the door state of the previous audio frame is the closed door state, in order to filter the non-speech frames as far as possible without losing the speech frames, preferably, when the door state of the current audio frame is judged according to a door state judgment mode matched with the door state of the previous audio frame, if the door state of the previous audio frame is the closed door state, the speech probability value of the current audio frame is compared with a preset speech probability threshold value, if the speech probability value of the current audio frame is greater than or equal to the speech probability threshold value, the door state of the current audio frame is judged to be the open door state, at the moment, the door state is converted from the closed door state to the open door state, if the speech probability value of the current audio frame is less than the speech probability threshold value, the door state of the current audio frame is judged to be the closed door state, at the moment, the door state is kept to be the closed door state, thereby simplifying the speech/non-speech judgment process, the performance of speech recognition is improved.
Of course, in practical applications, the speech probability value may also be a probability value of the audio frame being predicted as a non-speech frame, and accordingly, the determination methods of the comparison parameters and the specific gate states need to be adjusted accordingly, but the basic determination method is substantially the same as the method described in this embodiment.
In step S103, the audio frame type of the current audio frame is determined according to the gate state of the current audio frame.
In the embodiment of the present invention, different gate states may represent different data types of the audio frame, and may be specifically set according to actual needs. The audio frame type in the embodiment of the invention comprises a voice frame and a non-voice frame, wherein the door opening state represents that the audio frame type of the current audio frame is a voice frame, and the door closing state represents that the audio frame type of the current audio frame is a non-voice frame, so that preferably, if the door state of the current audio frame is the door opening state, the current audio frame is determined to be the voice frame, and if the door state of the current audio frame is the door closing state, the current audio frame is determined to be the non-voice frame, thereby reducing the misjudgment rate of voice/non-voice, and further improving the performance of voice identification.
In the embodiment of the invention, the door state of the last audio frame of the current audio frame is obtained, the door state of the current audio frame is judged according to the door state judgment mode matched with the door state of the last audio frame, and the audio frame type of the current audio frame is determined according to the door state of the current audio frame, so that the voice/non-voice misjudgment rate is reduced and the voice recognition performance is improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into the judgment of two states of opening/closing the door.
Example two:
fig. 2 is a flowchart illustrating a method for determining a gate state of a current audio frame according to a second embodiment of the present invention, which only shows a portion related to the second embodiment of the present invention for convenience of description, and the following details are described below:
in fig. 2, the door state of the current audio frame is represented by door _ state, and the door state of the previous audio frame of the current audio frame is represented by last _ state, where 1 represents the door-open state and 0 represents the door-closed state; the first speech probability average is represented by S1, and the first speech probability average threshold is represented by T1; the second speech probability average is represented by S2, the second speech probability average threshold is represented by T2, and the third speech probability average is represented by S3; the voice probability value of the current audio frame is represented by label, and the voice probability threshold value is represented by thred; the relationship between the current audio frame and the initial audio frame is represented by begin _ flag, where begin _ flag =1 represents that the current audio frame is the initial audio frame, and begin _ flag =0 represents that the current audio frame is the non-initial audio frame.
In step S201, determining whether a door state of a previous audio frame of a current audio frame is a door-closed state, if not, performing S202, and if so, performing S206;
in step S202, determining whether the first speech probability average value is greater than or equal to a first speech probability average threshold, if not, performing S203, and if so, performing S207;
in step S203, determining whether the second speech probability average value is greater than or equal to a second speech probability average threshold, if not, performing S204, and if so, performing S207;
in step S204, determining whether the speech probability value of the current audio frame is greater than or equal to the third speech probability average value, if not, performing S205, and if so, performing S207;
in step S205, it is determined whether the current audio frame is an initial audio frame, if not, step S208 is executed, and if so, step S207 is executed;
in step S206, it is determined whether the speech probability value of the current audio frame is greater than or equal to a preset speech probability threshold, if not, step S208 is executed, and if so, step S207 is executed
In step S207, it is determined that the door state of the current audio frame is the door open state;
in step S208, it is determined that the door state of the current audio frame is the closed door state.
Example three:
fig. 3 shows a structure of an online voice endpoint detection post-processing apparatus provided in a third embodiment of the present invention, and for convenience of description, only the parts related to the third embodiment of the present invention are shown, where the parts include:
a gate state acquiring unit 31 configured to acquire a gate state of an audio frame immediately preceding the current audio frame;
a gate state judgment unit 32, configured to judge the gate state of the current audio frame according to a gate state judgment manner matched with the gate state of the previous audio frame; and
a frame type determining unit 33, configured to determine the audio frame type of the current audio frame according to the gate state of the current audio frame.
Preferably, the apparatus further comprises:
the first comparison unit is used for comparing the acquired first voice probability average value with a preset first voice probability average threshold value if the door state of the previous audio frame is the door opening state, wherein the first voice probability average value is used for representing the average value of the voice probability values of all audio frames in the continuous first preset voice length before the current audio frame; and
and the first state determining unit is used for judging that the door state of the current audio frame is the door opening state if the first voice probability average value is greater than or equal to the first voice probability average threshold value.
Preferably, the apparatus further comprises:
the second comparison unit is used for comparing the obtained second voice probability average value with a preset second voice probability average threshold value if the first voice probability average value is smaller than the first voice probability average threshold value, wherein the second voice probability average value is used for representing the average value of the voice probability values of all audio frames in continuous second preset voice lengths before the current audio frame, and the second preset voice length is smaller than the first preset voice length; and
and the second state determining unit is used for judging that the door state of the current audio frame is the door opening state if the second voice probability average value is greater than or equal to the second voice probability average threshold value.
Preferably, the apparatus further comprises:
the third comparison unit is used for comparing the acquired voice probability value of the current audio frame with a third voice probability average value if the second voice probability average value is smaller than a second voice probability average threshold value, wherein the third voice probability average value is used for representing the average value of the voice probability values of the audio frames which are kept in the continuous door opening state before the current audio frame; and
and the third state determining unit is used for judging that the door state of the current audio frame is the door opening state if the voice probability value of the current audio frame is greater than or equal to the third voice probability average value.
Preferably, the apparatus further comprises:
the fourth comparison unit is used for judging whether the current audio frame is the initial audio frame or not if the voice probability value of the current audio frame is smaller than the third voice probability average value; and
the fourth state determining unit is used for judging that the door state of the current audio frame is the door opening state if the audio frame is the initial audio frame;
and the fifth state determining unit is used for judging that the door state of the current audio frame is the door closing state if the audio frame is not the initial audio frame.
Preferably, the apparatus further comprises:
the fifth comparison unit is used for comparing the voice probability value of the current audio frame with a preset voice probability threshold value if the door state of the previous audio frame is a door closing state; and
a sixth state determining unit, configured to determine that the door state of the current audio frame is the open door state if the voice probability value of the current audio frame is greater than or equal to the voice probability threshold;
and the seventh state determining unit is used for judging that the door state of the current audio frame is the door closing state if the voice probability value of the current audio frame is smaller than the voice probability threshold.
Preferably, the frame type determining unit includes:
the first determining subunit is used for determining that the current audio frame is a voice frame if the door state of the current audio frame is the door opening state; and
and the second determining subunit is used for determining that the current audio frame is a non-speech frame if the door state of the current audio frame is the door closing state.
In the embodiment of the present invention, each unit of the online voice endpoint detection post-processing apparatus may be implemented by a corresponding hardware or software unit, and each unit may be an independent software or hardware unit, or may be integrated into a software or hardware unit, which is not limited herein. For the specific implementation of each unit of the online voice endpoint detection post-processing apparatus, reference may be made to the description of the foregoing method embodiment, which is not described herein again.
Example four:
fig. 4 shows a structure of an electronic device according to a fourth embodiment of the present invention, and only a part related to the fourth embodiment of the present invention is shown for convenience of description.
The electronic device 4 of an embodiment of the invention comprises a processor 40, a memory 41 and a computer program 42 stored in the memory 41 and executable on the processor 40. The processor 40, when executing the computer program 42, implements the steps in the above-described method embodiments, such as the steps S101 to S103 shown in fig. 1. Alternatively, the processor 40, when executing the computer program 42, implements the functions of the units in the above-described device embodiments, such as the functions of the units 31 to 33 shown in fig. 3.
In the embodiment of the invention, the door state of the last audio frame of the current audio frame is obtained, the door state of the current audio frame is judged according to the door state judgment mode matched with the door state of the last audio frame, and the audio frame type of the current audio frame is determined according to the door state of the current audio frame, so that the voice/non-voice misjudgment rate is reduced and the voice recognition performance is improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into the judgment of two states of opening/closing the door.
Example five:
in an embodiment of the present invention, a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the steps in the above-described method embodiment, for example, steps S101 to S103 shown in fig. 1. Alternatively, the computer program may be adapted to perform the functions of the units of the above-described device embodiments, such as the functions of the units 31 to 33 shown in fig. 3, when executed by the processor.
In the embodiment of the invention, the door state of the last audio frame of the current audio frame is obtained, the door state of the current audio frame is judged according to the door state judgment mode matched with the door state of the last audio frame, and the audio frame type of the current audio frame is determined according to the door state of the current audio frame, so that the voice/non-voice misjudgment rate is reduced and the voice recognition performance is improved by converting the voice/non-voice judgment in the on-line endpoint detection post-processing into the judgment of two states of opening/closing the door.
The computer readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program code, a recording medium, such as a ROM/RAM, a magnetic disk, an optical disk, a flash memory, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (10)

1. An online voice endpoint detection post-processing method, characterized in that the method comprises the following steps:
acquiring the gate state of the last audio frame of the current audio frame;
judging the gate state of the current audio frame according to a gate state judgment mode matched with the gate state of the previous audio frame;
and determining the audio frame type of the current audio frame according to the door state of the current audio frame.
2. The method as claimed in claim 1, wherein the step of determining the gate state of the current audio frame according to a gate state determination manner matching the gate state of the previous audio frame further comprises:
if the door state of the previous audio frame is the door opening state, comparing the acquired first voice probability average value with a preset first voice probability average threshold value, wherein the first voice probability average value is used for representing the average value of the voice probability values of all audio frames in the continuous first preset voice length before the current audio frame;
and if the first voice probability average value is larger than or equal to the first voice probability average threshold value, judging that the door state of the current audio frame is the door opening state.
3. The method according to claim 2, wherein the step of comparing the obtained first speech probability average value with a preset first speech probability average threshold value further comprises:
if the first voice probability average value is smaller than the first voice probability average threshold, comparing the obtained second voice probability average value with a preset second voice probability average threshold, wherein the second voice probability average value is used for representing the average value of the voice probability values of all audio frames in continuous second preset voice length before the current audio frame, and the second preset voice length is smaller than the first preset voice length;
and if the second voice probability average value is greater than or equal to the second voice probability average threshold value, judging that the door state of the current audio frame is the door opening state.
4. The method according to claim 3, wherein the step of comparing the obtained second speech probability average value with a preset second speech probability average threshold value further comprises:
if the second voice probability average value is smaller than the second voice probability average threshold value, comparing the acquired voice probability value of the current audio frame with a third voice probability average value, wherein the third voice probability average value is used for representing the average value of the voice probability values of the audio frames which are kept in a continuous door opening state before the current audio frame;
and if the voice probability value of the current audio frame is greater than or equal to the third voice probability average value, judging that the door state of the current audio frame is the door opening state.
5. The method of claim 4, wherein the step of comparing the speech probability value of the current audio frame to a third speech probability average further comprises:
if the voice probability value of the current audio frame is smaller than the third voice probability average value, judging whether the current audio frame is an initial audio frame;
if the current audio frame is the initial audio frame, judging that the door state of the current audio frame is the door opening state;
and if the current audio frame is not the initial audio frame, judging that the door state of the current audio frame is a door closing state.
6. The method as claimed in claim 4, wherein the step of determining the gate state of the current audio frame according to a gate state determination manner matching the gate state of the previous audio frame further comprises:
if the door state of the previous audio frame is a door closing state, comparing the voice probability value of the current audio frame with a preset voice probability threshold value;
if the voice probability value of the current audio frame is larger than or equal to the voice probability threshold, judging that the door state of the current audio frame is the door opening state;
and if the voice probability value of the current audio frame is smaller than the voice probability threshold, judging that the door state of the current audio frame is a door-closed state.
7. The method of claim 1, wherein the step of determining the audio frame type of the current audio frame based on the gate state of the current audio frame comprises:
if the door state of the current audio frame is the door opening state, determining that the current audio frame is a voice frame;
and if the door state of the current audio frame is the door closing state, determining that the current audio frame is a non-speech frame.
8. An apparatus for processing after online voice endpoint detection, the apparatus comprising:
the gate state acquisition unit is used for acquiring the gate state of the last audio frame of the current audio frame;
the gate state judging unit is used for judging the gate state of the current audio frame according to a gate state judging mode matched with the gate state of the previous audio frame; and
and the frame type determining unit is used for determining the audio frame type of the current audio frame according to the door state of the current audio frame.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202011083235.3A 2020-10-12 2020-10-12 Online voice endpoint detection post-processing method, device, equipment and storage medium Active CN112435691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011083235.3A CN112435691B (en) 2020-10-12 2020-10-12 Online voice endpoint detection post-processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011083235.3A CN112435691B (en) 2020-10-12 2020-10-12 Online voice endpoint detection post-processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112435691A true CN112435691A (en) 2021-03-02
CN112435691B CN112435691B (en) 2024-03-12

Family

ID=74690571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011083235.3A Active CN112435691B (en) 2020-10-12 2020-10-12 Online voice endpoint detection post-processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112435691B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421595A (en) * 2021-08-25 2021-09-21 成都启英泰伦科技有限公司 Voice activity detection method using neural network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6873953B1 (en) * 2000-05-22 2005-03-29 Nuance Communications Prosody based endpoint detection
US20120173234A1 (en) * 2009-07-21 2012-07-05 Nippon Telegraph And Telephone Corp. Voice activity detection apparatus, voice activity detection method, program thereof, and recording medium
CN108766418A (en) * 2018-05-24 2018-11-06 百度在线网络技术(北京)有限公司 Sound end recognition methods, device and equipment
CN109119070A (en) * 2018-10-19 2019-01-01 科大讯飞股份有限公司 A kind of sound end detecting method, device, equipment and storage medium
CN109935241A (en) * 2017-12-18 2019-06-25 上海智臻智能网络科技股份有限公司 Voice information processing method
WO2019149108A1 (en) * 2018-01-31 2019-08-08 腾讯科技(深圳)有限公司 Identification method and device for voice keywords, computer-readable storage medium, and computer device
CN110634507A (en) * 2018-06-06 2019-12-31 英特尔公司 Speech classification of audio for voice wakeup
CN110648687A (en) * 2019-09-26 2020-01-03 广州三人行壹佰教育科技有限公司 Activity voice detection method and system
CN110827858A (en) * 2019-11-26 2020-02-21 苏州思必驰信息科技有限公司 Voice endpoint detection method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6873953B1 (en) * 2000-05-22 2005-03-29 Nuance Communications Prosody based endpoint detection
US20120173234A1 (en) * 2009-07-21 2012-07-05 Nippon Telegraph And Telephone Corp. Voice activity detection apparatus, voice activity detection method, program thereof, and recording medium
CN109935241A (en) * 2017-12-18 2019-06-25 上海智臻智能网络科技股份有限公司 Voice information processing method
WO2019149108A1 (en) * 2018-01-31 2019-08-08 腾讯科技(深圳)有限公司 Identification method and device for voice keywords, computer-readable storage medium, and computer device
CN108766418A (en) * 2018-05-24 2018-11-06 百度在线网络技术(北京)有限公司 Sound end recognition methods, device and equipment
CN110634507A (en) * 2018-06-06 2019-12-31 英特尔公司 Speech classification of audio for voice wakeup
CN109119070A (en) * 2018-10-19 2019-01-01 科大讯飞股份有限公司 A kind of sound end detecting method, device, equipment and storage medium
CN110648687A (en) * 2019-09-26 2020-01-03 广州三人行壹佰教育科技有限公司 Activity voice detection method and system
CN110827858A (en) * 2019-11-26 2020-02-21 苏州思必驰信息科技有限公司 Voice endpoint detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
HEGDE, R.: "Voice Activity Detection Using Novel Teager Energy Based Band Spectral Entropy", <2019 INTERNATIONAL CONFERENCE ON COMMUNICATION AND ELECTRONICS SYSTEMS (ICCES)> *
张敏: "基于ACAM和传统分类模型的语音端点检测研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421595A (en) * 2021-08-25 2021-09-21 成都启英泰伦科技有限公司 Voice activity detection method using neural network

Also Published As

Publication number Publication date
CN112435691B (en) 2024-03-12

Similar Documents

Publication Publication Date Title
CN110364143B (en) Voice awakening method and device and intelligent electronic equipment
JP6903129B2 (en) Whispering conversion methods, devices, devices and readable storage media
CN108010515B (en) Voice endpoint detection and awakening method and device
CN110136749B (en) Method and device for detecting end-to-end voice endpoint related to speaker
WO2019149108A1 (en) Identification method and device for voice keywords, computer-readable storage medium, and computer device
US9202462B2 (en) Key phrase detection
US9818407B1 (en) Distributed endpointing for speech recognition
EP2994911B1 (en) Adaptive audio frame processing for keyword detection
TW201935464A (en) Method and device for voiceprint recognition based on memorability bottleneck features
CN109360572B (en) Call separation method and device, computer equipment and storage medium
US11676625B2 (en) Unified endpointer using multitask and multidomain learning
CN110232933B (en) Audio detection method and device, storage medium and electronic equipment
WO2021082572A1 (en) Wake-up model generation method, smart terminal wake-up method, and devices
US20130080165A1 (en) Model Based Online Normalization of Feature Distribution for Noise Robust Speech Recognition
WO2014182460A2 (en) Method and apparatus for detecting a target keyword
US11100932B2 (en) Robust start-end point detection algorithm using neural network
EP3739583B1 (en) Dialog device, dialog method, and dialog computer program
CN113628612A (en) Voice recognition method and device, electronic equipment and computer readable storage medium
US20200312305A1 (en) Performing speaker change detection and speaker recognition on a trigger phrase
CN112802498B (en) Voice detection method, device, computer equipment and storage medium
CN111128174A (en) Voice information processing method, device, equipment and medium
CN112435691B (en) Online voice endpoint detection post-processing method, device, equipment and storage medium
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
US20230223014A1 (en) Adapting Automated Speech Recognition Parameters Based on Hotword Properties
WO2020073839A1 (en) Voice wake-up method, apparatus and system, and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant