CN114255742A

CN114255742A - Method, device, equipment and storage medium for voice endpoint detection

Info

Publication number: CN114255742A
Application number: CN202111401623.6A
Authority: CN
Inventors: 李良斌; 陈孝良
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-03-29

Abstract

The invention discloses a method, a device, equipment and a storage medium for voice endpoint detection, wherein the method comprises the following steps: acquiring a plurality of normal dialogue corpora, training the normal dialogue corpora, and acquiring a first detection model; traversing all dialogue corpora, and training all dialogue corpora to obtain a second detection model; detecting the audio to be detected based on the first detection model, and outputting a first detection result; detecting the audio to be detected through a second detection model based on the first detection result, and outputting a second detection result; and determining the end point of the audio to be detected based on the first detection result and the second detection result. According to the technical scheme, on the basis of the first detection model and the second detection model, rapid VAD model judgment stopping or judgment is realized under the condition that a user speaks a complete sentence, interaction time delay is reduced, and user experience and satisfaction are improved.

Description

Method, device, equipment and storage medium for voice endpoint detection

Technical Field

The present invention belongs to the field of speech recognition technology, and in particular, to a method, an apparatus, a device and a storage medium for speech endpoint detection.

Background

In speech recognition systems, voice endpoint detection (VAD) is a very important technology, and is also commonly referred to as Voice Activity Detection (VAD). The voice end point detection refers to finding out a starting point and an end point of a voice part in a continuous sound signal.

VAD model detection is an important link of speech recognition. Especially in an intelligent voice conversation system, the end point of the tail part of the voice stream can be found through VAD model detection, and then the complete voice is delivered to a recognition engine for integral processing. The accuracy of the VAD model detection effect can directly influence the voice recognition effect: if the VAD model is cut early, the voice is cut off, the recognition result is incomplete, and the natural language understanding effect is influenced; the VAD model late switching causes the increase of the interactive delay, and simultaneously, a large amount of silence or noise is mixed in the tail end of the voice signal, which wastes computation and affects the recognition effect.

However, the conventional VAD model detection is usually performed by smoothing with the energy of the audio signal, and when the energy is lower than the threshold for a while, it is determined that the VAD model is ended; the problem of the prior art that simply uses the acoustic signal to determine the end of the VAD model is that: on one hand, the VAD model is easily affected by noise, and on the other hand, if the speaker is hesitant to stop, the VAD model is also early cut.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art. Therefore, an object of the present invention is to provide a method, an apparatus, a device and a storage medium for voice endpoint detection.

In order to solve the above technical problem, an embodiment of the present invention provides the following technical solutions:

a method of voice endpoint detection, comprising:

acquiring a plurality of normal dialogue corpora, and training the normal dialogue corpora to acquire a first detection model;

traversing all dialogue corpora, and training all the dialogue corpora to obtain a second detection model;

detecting the audio to be detected based on the first detection model, and outputting a first detection result; wherein, the detecting the audio to be detected based on the first detection model comprises: preprocessing the audio to be detected to obtain a text to be detected;

based on the first detection result, detecting the audio to be detected through the second detection model, and outputting a second detection result;

and determining the end point of the audio to be detected based on the first detection result and the second detection result.

Optionally, the audio to be detected is preprocessed to obtain a text to be detected:

acquiring the audio to be detected;

identifying the audio to be detected to obtain an identification result;

and acquiring the text to be detected based on the identification result.

Optionally, the obtaining a plurality of normal dialog corpora, training the plurality of normal dialog corpora, and obtaining a first detection model includes:

acquiring a positive sample set, wherein the positive sample set comprises a plurality of normal dialogue corpora;

removing the tail of each normal dialogue corpus to obtain a plurality of abnormal dialogue corpuses;

acquiring a negative sample set based on a plurality of abnormal dialogue corpora;

and acquiring the first detection model based on the positive sample set and the negative sample set.

Optionally, after detecting the audio to be detected based on the first detection model and outputting a first detection result, the method further includes:

if the text to be detected output by the first detection model is a negative example, the first detection model waits for a preset fixed time delay;

and if the text to be detected output by the first detection model is a positive example, acquiring an output result of the second detection model.

Optionally, the traversing all the dialog corpuses, and training all the dialog corpuses to obtain a second detection model, including:

processing all the dialogue corpora based on a preset principle to obtain training corpora;

labeling each training corpus to obtain labeled corpuses;

and acquiring a second detection model based on the labeling corpus.

Optionally, the labeling each training corpus to obtain a labeled corpus includes:

determining a starting part, a middle part and an ending part of each training corpus;

and labeling the initial part, the middle part and the middle part to obtain a labeled corpus.

Optionally, the labeling the initial portion, the middle portion, and the middle portion to obtain a labeled corpus includes:

labeling the initial part to obtain a first label;

labeling the middle part to obtain a second label;

and labeling the ending part to obtain a third label.

Optionally, the detecting, based on the first detection result, multiple frames of the text to be detected by using a second detection model, and outputting a second detection result includes:

the second detection model outputs the first label, the second label or the third label.

Optionally, after the second detection model outputs the first label, the second label, or the third label, the method further includes:

if the first detection model outputs the first label or the second label, waiting for a preset time delay; determining the end point of the text to be detected based on the waiting result;

and if the second label is output by the second detection model, determining the end point of the text to be detected.

The embodiment of the present invention further provides a device for optimizing voice endpoint detection, including:

the first training module is used for acquiring a plurality of normal dialogue corpora and training the normal dialogue corpora to acquire a first detection model;

the second training module is used for traversing all dialogue corpora and training all the dialogue corpora to obtain a second detection model;

the first detection module is used for detecting the audio to be detected based on the first detection model and outputting a first detection result; wherein, the detecting the audio to be detected based on the first detection model comprises: preprocessing the audio to be detected to obtain a text to be detected;

the second detection module is used for detecting the audio to be detected through the second detection model based on the first detection result and outputting a second detection result;

and the determining module is used for determining the end point of the audio to be detected based on the first detection result and the second detection result.

Embodiments of the present invention also provide an electronic device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the method as described above when executing the computer program.

Embodiments of the present invention also provide a computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method as described above.

The embodiment of the invention has the following technical effects:

according to the technical scheme, whether the current recognition result is a complete sentence or not is respectively detected based on the first detection model and the second detection model, whether the current recognition result is a prefix of a longer sentence or not is possible under the condition of the complete sentence, the first detection model and the second detection model are comprehensively judged, the VAD model is optimized by combining with acoustic signal detection, and therefore the VAD model is rapidly judged and stopped or judged under the condition that a user speaks the complete sentence, and the interaction delay is reduced; when the user hesitates and pauses, the VAD model is not cut off, and the complete time is given to the user to finish the expression; when the sentence spoken by the user is complete but possibly a prefix, the waiting time delay is determined according to the proportion of the sentence, and in the statistical sense, the speech recognition accuracy and the interaction speed are optimized simultaneously, and the user experience and the satisfaction degree are improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

Fig. 1 is a flowchart illustrating a method for detecting a voice endpoint according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a device for detecting a voice endpoint according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In order to facilitate understanding of the embodiments by those skilled in the art, some terms of the embodiments are explained:

(1) VAD: voice activity detection.

(2) NLP: natural Language Processing.

(3) test cnn model: cnnnn for text analysis; cnn is a convolutional neural network.

(4) seq2seq model: sequence-to-sequence models, in particular models that are used when the length of the output is uncertain.

(5) lstm + crf model: under the LSTM + CRF model, the output will no longer be mutually independent tags, but the optimal tag sequence.

(6) ASR: automatic Speech Recognition, a technique for converting human Speech into text.

As shown in fig. 1, an embodiment of the present invention provides a method for detecting a voice endpoint, including:

step S1: acquiring a plurality of normal dialogue corpora, and training the normal dialogue corpora to acquire a first detection model;

specifically, the VAD model comprises a processor, and the processor is respectively connected with the first detection model and the second detection model; after the VAD model obtains the text to be detected, the first detection model and the second detection model simultaneously detect the text to be detected, the detection result is sent to the processor to be processed, and the processor controls the VAD model based on the received first detection result and the second detection result.

Wherein, the obtaining a plurality of normal dialogue corpora and training a plurality of normal dialogue corpora to obtain a first detection model includes: acquiring a positive sample set, wherein the positive sample set comprises a plurality of normal dialogue corpora; removing the tail of each normal dialogue corpus to obtain a plurality of abnormal dialogue corpuses; acquiring a negative sample set based on a plurality of abnormal dialogue corpora; and acquiring the first detection model based on the positive sample set and the negative sample set.

In an actual application scenario, a specific process of training to obtain the first detection model includes:

taking all normal dialogue corpora as positive samples to generate a positive sample set;

then, removing characters one by one from the tail of each normal dialogue corpus of the positive sample set, if the new dialogue corpus after one or more characters are removed is an abnormal dialogue corpus or an incomplete dialogue corpus, classifying the abnormal dialogue corpus into negative samples, and storing the negative samples as a negative sample set separately; and if the new dialogue linguistic data after one or more characters are removed is still the normal dialogue linguistic data, continuously placing the new dialogue linguistic data into the positive sample set.

Wherein, the test cnn model can be adopted to train the normal dialogue corpus.

For example: if the 'playing the three kingdoms rehearsal subject music' is a positive sample, putting the 'playing the three kingdoms rehearsal subject music' into a positive sample set; the 'playing the three kingdoms rehearsal theme' and 'playing the three kingdoms rehearsal main' are negative samples and are put into a negative sample set; the 'play three kingdoms of the speech' is a positive sample, and the positive sample set is put into the sample set again.

And acquiring a first detection model based on the positive sample set and the negative sample set, namely detecting the audio to be detected by the first detection model based on the positive sample set and the negative sample set.

Step S2: traversing all dialogue corpora, and training all the dialogue corpora to obtain a second detection model;

specifically, the traversing all the dialogue corpora, and training all the dialogue corpora to obtain a second detection model, including: processing all the dialogue corpora based on a preset principle to obtain training corpora; labeling each training corpus to obtain labeled corpuses; and acquiring a second detection model based on the labeling corpus.

Wherein, to every the training corpus label, obtain the labeling corpus, include:

determining a starting part, a middle part and an ending part of each training corpus; labeling the initial part, the middle part and the middle part to obtain a labeled corpus;

the labeling the initial part, the middle part and the middle part to obtain a labeled corpus comprises: labeling the initial part to obtain a first label; labeling the middle part to obtain a second label; and labeling the ending part to obtain a third label.

Wherein, the first label can be represented by B, the second label can be represented by I, and the third label can be represented by E.

The specific process of training to obtain the second detection model comprises the following steps:

for example: and traversing all dialogue linguistic data (including normal dialogue linguistic data and abnormal dialogue linguistic data), and if a certain sentence x is a prefix of another sentence y, removing the x from the training linguistic data.

The finally retained dialog corpus which is not removed can be regarded as a sentence which can not be expanded for a longer time again.

The corpus of dialogues obtained from these training sessions are labeled with B, I, E schemes, wherein B, I, E represents the beginning, middle and end of the sentence, respectively.

In a practical application scenario, a model structure of seq2seq may be adopted, for example: the lstm + crf model trains the spoken material, and then a second detection model is obtained through training.

For example: "play the three kingdoms rehearsal theme song", will play and label as B, will play the three kingdoms rehearsal and label as I, label as E the theme song.

Step S3: detecting the audio to be detected based on the first detection model, and outputting a first detection result; wherein, the detecting the audio to be detected based on the first detection model comprises: preprocessing the audio to be detected to obtain a text to be detected;

specifically, the acquiring the text to be detected includes: acquiring audio; identifying the audio to obtain an identification result; and acquiring the text to be detected based on the identification result.

In an actual application scene, firstly, various intelligent terminals perform voice interaction with a user; after receiving the voice, the intelligent terminal performs framing processing on the voice, and converts the voice into a text based on an ASR technology.

For example: the intelligent terminal can be an intelligent device such as an intelligent mobile phone and an intelligent television which can perform voice interaction with a user.

The present application may detect the start endpoint in VAD based on any one of the following six methods.

(1) Double thresholds: detecting by short-time average energy, detecting by short-time average zero crossing rate, wherein the consonant frequency is higher; in consideration of the influence of noise, smoothing is generally performed.

For example: and (6) median filtering.

(2) Correlation: the algorithm mainly utilizes the difference of the signal and noise correlation coefficients by calculating the correlation coefficients of the signal.

For example: 1) normalizing by using a correlation function, and obtaining a main/auxiliary peak ratio; 2) the correlation function of the audio has a certain periodicity, and can be converted into cosine solution → endpoint detection of cosine angle value of the autocorrelation function.

(3) Variance: speech and noise differ greatly in the spectral domain, with speech frames: the variation with frequency band is large and the variation with noise is small.

For example: and (4) uniform sub-band division.

(4) Spectral entropy: the entropy is a measure for measuring uncertainty, noise is distributed more uniformly in a frequency spectrum, and the entropy is larger; the speech distribution is uneven, the entropy is small, the probability density is obtained through the normalized energy, and the entropy is calculated accordingly.

(5) Energy to zero ratio: the ratio of the short-term energy to the short-term zero-crossing rate;

(6) energy-entropy ratio: ratio of short-time energy to spectral entropy.

The detecting the audio to be detected based on the first detection model and outputting a first detection result includes:

In an actual application scene, the intelligent terminal performs ASR recognition processing on each frame of audio to be detected to acquire a text to be detected;

inputting a text to be detected into a first detection model, and waiting for the first detection model to input a first detection result;

the first detection model queries a positive sample set or a negative sample set based on the text to be detected, and outputs a positive example when a positive sample matched with the text to be detected is found in the positive sample set;

when a negative sample matched with the text to be detected is searched in the negative sample set, the first detection model outputs a negative example; indicating that the end point of the audio to be detected is not reached, waiting for a period of time, namely, presetting a fixed time delay (for example, 5ms, 10ms and the like), wherein the preset fixed time delay can be modified or reset according to the actual detection requirement;

in the process of waiting for the preset fixed time delay, if a new text to be detected is received or the original text to be detected is updated, the first detection model detects the newly received text, and further possibly outputs a first detection result of a positive case or a negative case;

and if the first detection model still does not receive a new text to be detected after waiting for a preset fixed time delay (for example, 5ms) or the original text to be detected is not updated, immediately performing VAD model judgment and stopping.

According to the embodiment of the invention, the first detection model allows the user to interact with the intelligent terminal in a hesitation or pause state, so that the user experience is improved.

Step S4: based on the first detection result, detecting the audio to be detected through the second detection model, and outputting a second detection result;

specifically, the detecting, based on the first detection result, multiple frames of the audio to be detected through the second detection model, and outputting a second detection result includes:

For example: "play the three kingdoms rehearsal theme song", output B when detecting to play, label I when detecting the three kingdoms rehearsal, output E when detecting the theme song.

Step S5: and determining the end point of the audio to be detected based on the first detection result and the second detection result.

Specifically, the determining the end point of the audio to be detected based on the first detection result and the second detection result includes:

if the second detection model outputs the first label or the second label, waiting for a preset time delay; determining the end point of the text to be detected based on the waiting result; and if the second detection model outputs the third label, determining the end point of the text to be detected.

In an actual application scene, if the first detection result output by the first detection model is a positive example, the VAD model acquires a second detection result of the second detection model;

if the second detection model outputs B or I, the terminal point of the text to be detected is indicated to possibly appear, and the current text to be detected is an integral sentence but belongs to a prefix or a part of another sentence; at this time, waiting for a preset time delay, wherein the preset time delay can be determined or preset according to the probability of the current sentence and the suffix sentence which may be added, and can be modified or reset according to actual needs (for example, the preset time delay is obtained according to the statistical speaking habits of a plurality of users or the speaking habits of the users in a period of time); then, detecting whether the text to be detected is not updated all the time within a preset time delay, namely detecting whether the text to be detected continuously conforms to a VAD model mode, if the text to be detected is not updated all the time within the preset time delay, indicating that the text to be detected is complete, namely the text to be detected continuously conforms to the VAD model mode, and judging and stopping the VAD model; if the text to be detected is updated within the preset time delay, the text to be detected is incomplete and belongs to the prefix or a part of another sentence, the detection process is repeated, and the end point of the text to be detected is judged again.

And if the second detection model outputs E, the text to be detected is complete and the length of the sentence can not be expanded, and then VAD model judgment is immediately carried out.

In the embodiment of the invention, based on two NLP models (a first detection model and a second detection model), whether the current recognition result is a complete sentence or not is respectively detected, and whether the current recognition result is a prefix of a longer sentence or not is possible under the condition of the complete sentence, the two are comprehensively judged, and the VAD model is optimized by combining with the detection of an acoustic signal (the sound of a user), so that the rapid judgment and stopping of the VAD model are realized under the condition that the user speaks the complete sentence, and the interaction delay is reduced; when the user hesitates and pauses, the VAD model is not cut off, and the complete time is given to the user to finish the expression; when the sentence spoken by the user is complete but possibly a prefix, the waiting time delay is determined according to the proportion of the sentence, and in the statistical sense, the speech recognition accuracy and the interaction speed are optimized simultaneously, and the user experience and the satisfaction degree are improved.

As shown in fig. 2, an embodiment of the present invention further provides an apparatus 200 for voice endpoint detection, including:

the first training module 201 is configured to obtain a plurality of normal dialogue corpora, train the plurality of normal dialogue corpora, and obtain a first detection model;

the second training module 202 is configured to traverse all the dialogue corpora, train all the dialogue corpora, and obtain a second detection model;

the first detection module 203 is configured to detect an audio to be detected based on the first detection model, and output a first detection result; wherein, the detecting the audio to be detected based on the first detection model comprises: preprocessing the audio to be detected to obtain a text to be detected;

the second detection module 204 is configured to detect the audio to be detected through the second detection model based on the first detection result, and output a second detection result;

a determining module 205, configured to determine an end point of the audio to be detected based on the first detection result and the second detection result.

In an optional embodiment of the present invention, the apparatus further includes a processing module, wherein the processing module is respectively connected to the first detecting module 202 and the second detecting module 203; after the text to be detected is sent to the VAD model, the first detection module 202 and the second detection module 203 detect the text to be detected at the same time, and send the detection result to the processing module for processing, and the processing module controls the VAD model based on the received first detection result and the second detection result.

acquiring the audio to be detected;

identifying the audio to be detected to obtain an identification result;

and acquiring the text to be detected based on the identification result.

labeling each training corpus to obtain labeled corpuses;

and acquiring a second detection model based on the labeling corpus.

labeling the initial part to obtain a first label;

labeling the middle part to obtain a second label;

and labeling the ending part to obtain a third label.

In addition, other configurations and functions of the device according to the embodiment of the present invention are known to those skilled in the art, and are not described herein in detail to reduce redundancy.

It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of voice endpoint detection, comprising:

2. The method according to claim 1, wherein the audio to be detected is preprocessed to obtain the text to be detected:

acquiring the audio to be detected;

identifying the audio to be detected to obtain an identification result;

and acquiring the text to be detected based on the identification result.

3. The method according to claim 1, wherein the obtaining a plurality of normal dialog corpuses and training the plurality of normal dialog corpuses to obtain a first detection model comprises:

4. The method according to claim 3, wherein after detecting the audio to be detected based on the first detection model and outputting the first detection result, the method further comprises:

5. The method of claim 1, wherein the traversing all the corpus of dialogues and training all the corpus of dialogues to obtain a second detection model comprises:

labeling each training corpus to obtain labeled corpuses;

and acquiring a second detection model based on the labeling corpus.

6. The method according to claim 5, wherein the labeling each corpus to obtain a labeled corpus comprises:

7. The method according to claim 6, wherein the labeling the initial portion, the middle portion and the middle portion to obtain a labeled corpus comprises:

labeling the initial part to obtain a first label;

labeling the middle part to obtain a second label;

and labeling the ending part to obtain a third label.

8. The method according to claim 7, wherein the detecting a plurality of frames of the audio to be detected through a second detection model based on the first detection result and outputting a second detection result comprises:

9. The method of claim 8, wherein after the second detection model outputs the first label, the second label, or the third label, the method further comprises:

if the second detection model outputs the first label or the second label, waiting for a preset time delay; determining the end point of the text to be detected based on the waiting result;

10. An apparatus for voice endpoint detection, comprising:

11. An electronic device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the method of any of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method of any of claims 1-9.