CN114155839A

CN114155839A - Voice endpoint detection method, device, equipment and storage medium

Info

Publication number: CN114155839A
Application number: CN202111535332.6A
Authority: CN
Inventors: 张儒瑞; 李永超
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-03-08

Abstract

The application provides a voice endpoint detection method, a voice endpoint detection device, equipment and a storage medium, wherein the voice endpoint detection method can judge whether an audio frame contained in audio data to be detected is a mute frame, a noise frame or a voice frame, namely, the application can detect the relatively accurate attribute of the audio frame contained in the audio data, detect a voice front endpoint and a voice rear endpoint on the basis, and can obtain a relatively accurate detection result. On the basis of realizing voice endpoint detection, the method and the device can acquire the recognition text of the voice segment, determine the semantic scene of the recognition text according to the semantics of the recognition text, and further set a proper post-mute timeout threshold according to the semantic scene of the recognition text, so that a post-mute timeout event is triggered based on the proper post-mute timeout threshold, and user experience is improved.

Description

Voice endpoint detection method, device, equipment and storage medium

Technical Field

The present application relates to the field of voice endpoint detection technologies, and in particular, to a method, an apparatus, a device, and a storage medium for voice endpoint detection.

Background

Voice Activity Detection (VAD) is an important link of Voice recognition. It can be understood that, audio frames in the audio data are not all voice frames, and voice endpoints and then voice segments can be obtained by detecting the voice endpoints.

Most of the current voice endpoint detection schemes are as follows: predicting whether an audio frame contained in audio data to be detected is a mute frame or a non-mute frame, if the audio frame is the mute frame, determining that the audio frame is a non-speech frame, if the audio frame is the non-mute frame, determining that the audio frame is a speech frame, and after obtaining a detection result that the audio frame is the speech frame or the non-speech frame, detecting a speech endpoint according to the detection result.

However, in an actual application scenario, the non-silent frame is not necessarily a speech frame, and determining the non-silent frame as the speech frame would make it difficult to obtain a more accurate speech endpoint detection result.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a device and a storage medium for detecting a voice endpoint, so as to solve the problem of low detection accuracy of the existing voice endpoint detection scheme, and the technical scheme is as follows:

a voice endpoint detection method, comprising:

after audio data to be detected are obtained, first information and second information corresponding to audio frames contained in the audio data are obtained, wherein the first information can indicate whether the corresponding audio frames are mute frames or non-mute frames, and the second information is pronunciation information of the corresponding audio frames;

taking first information and second information corresponding to audio frames contained in the audio data as a basis, and judging the audio frames contained in the audio data by using mute frames, noise frames and voice frames;

and detecting a voice front end point and a voice rear end point according to a judgment result corresponding to an audio frame contained in the audio data.

Optionally, the voice endpoint detection method further includes:

after the voice front end point is detected, recognizing a voice section starting from the voice front end point as a text to obtain a recognition text and a confidence coefficient corresponding to the recognition text;

setting a post-mute timeout threshold according to the semantics of the recognized text and the confidence corresponding to the recognized text, and triggering a post-mute timeout event based on the set post-mute timeout threshold.

Optionally, the obtaining first information and second information corresponding to audio frames included in the audio data includes:

predicting acoustic scores of audio frames contained in the audio data, namely a mute frame and a non-mute frame respectively, and a full-phoneme acoustic score corresponding to the audio frames contained in the audio data by using a pre-established multitask joint model;

the method comprises the steps that an audio frame is used as first information corresponding to the audio frame, acoustic scores of a mute frame and a non-mute frame are respectively used as second information corresponding to the audio frame, a total phoneme acoustic score corresponding to the audio frame is used as second information corresponding to the audio frame, and the total phoneme acoustic score comprises acoustic scores corresponding to all phonemes contained in a language to which audio data belongs.

Optionally, the determining, based on the first information and the second information corresponding to the audio frame included in the audio data, the audio frame included in the audio data according to the mute frame, the noise frame, and the speech frame includes:

aiming at an audio frame to be distinguished, distinguishing a mute frame and a non-mute frame of the audio frame according to first information corresponding to the audio frame;

and when the audio frame is judged to be a non-silent frame, judging the noise frame and the voice frame of the audio frame according to the second information corresponding to the audio frame.

Optionally, the determining, according to the second information corresponding to the audio frame, a noise frame and a speech frame of the audio frame includes:

if the maximum acoustic score in the full-phoneme acoustic scores corresponding to the audio frame is smaller than the preset acoustic score threshold value, judging the audio frame to be a noise frame;

and if the maximum acoustic score in the full-phoneme acoustic scores corresponding to the audio frame is greater than or equal to a preset acoustic score threshold value, judging that the audio frame is a voice frame.

Optionally, the recognizing, as a text, a speech segment starting from the speech front-end point to obtain a recognition text and a confidence corresponding to the recognition text includes:

decoding second information corresponding to a voice frame through a pre-constructed phoneme-level network from the voice front end point, and finishing decoding after detecting the voice rear end point and finishing decoding the second information corresponding to the voice rear end point, wherein the decoding is synchronously performed with the voice rear end point detection, the phoneme-level network is constructed according to a first corpus in a first corpus set and a second corpus in a second corpus set, the first corpus is a corpus which has incomplete semantics and needs to wait for a long time to trigger a mute timeout event, and the second corpus is a corpus which has complete semantics and can trigger the mute timeout event without waiting for a long time;

and obtaining an optimal decoding result and a confidence coefficient corresponding to the optimal decoding result by backtracking the optimal decoding path, wherein the optimal decoding result and the confidence coefficient corresponding to the optimal decoding result are used as the recognition text and the confidence coefficient corresponding to the recognition text.

Optionally, constructing a phoneme-level network according to a first corpus in the first corpus set and a second corpus in the second corpus set, including:

connecting a first corpus in the first corpus and a second corpus in the second corpus in parallel to obtain a sentence-level network, wherein each corpus is a node in the sentence-level network;

expanding each corpus in the sentence-level network into a single word to obtain an initial word-level network, wherein each single word is a node in the initial word-level network;

merging the nodes and paths in the initial word-level network to obtain a final word-level network;

expanding each single character in the final word-level network into a phoneme to obtain an initial phoneme-level network, wherein each phoneme is a node in the initial phoneme-level network;

and combining the nodes and the paths in the initial phoneme level network to obtain a final phoneme level network.

Optionally, the setting a post-mute timeout threshold according to the semantics of the recognized text and the confidence corresponding to the recognized text includes:

determining a semantic scene of the recognition text from set semantic scenes according to the semantics of the recognition text and the confidence corresponding to the recognition text;

setting a mute timeout threshold according to the semantic scene of the recognized text;

the set semantic scenes comprise a first scene, a second scene and a default scene, each scene has a corresponding post-mute timeout threshold, the post-mute timeout threshold corresponding to the first scene is greater than the default post-mute timeout threshold corresponding to the default scene, and the post-mute timeout threshold corresponding to the second scene is less than the default post-mute timeout threshold corresponding to the default scene.

Optionally, determining the semantic scene of the recognition text from the set semantic scenes according to the semantics of the recognition text and the confidence corresponding to the recognition text, including:

determining whether the recognition text is credible or not according to the confidence corresponding to the recognition text;

if the recognition text is not credible, determining that the semantic scene of the recognition text is a default scene;

if the recognition text is credible, determining the semantic similarity between the recognition text and a first corpus in the first corpus set and the semantic similarity between the recognition text and a second corpus in the second corpus set;

and determining the semantic scene of the recognition text from the set semantic scenes according to the determined semantic similarity.

Optionally, the determining the semantic scene of the recognition text from the set semantic scenes according to the determined semantic similarity includes:

if the maximum semantic similarity in the determined semantic similarities is greater than or equal to a preset similarity threshold and the maximum semantic similarity is the similarity between the recognition text and the first corpus, determining that the semantic scene of the recognition text is the first scene;

if the maximum semantic similarity is greater than or equal to the preset similarity threshold and the maximum semantic similarity is the semantic similarity between the recognition text and a second corpus, determining that the semantic scene of the recognition text is the second scene;

and if the maximum semantic similarity is smaller than the preset similarity threshold, determining the semantic scene of the recognition text result as the default scene.

Optionally, the setting of the post-mute timeout threshold according to the semantic scene of the recognized text includes:

if the semantic scene of the recognized text is the first scene, setting a post-mute timeout threshold as a post-mute timeout threshold corresponding to the first scene;

if the semantic scene of the recognized text is the second scene, setting a post-mute timeout threshold as a post-mute timeout threshold corresponding to the second scene;

and if the semantic scene of the recognized text is the default scene, setting the post-mute overtime threshold as the default post-mute overtime threshold corresponding to the default scene.

A voice endpoint detection apparatus comprising: the device comprises a discrimination information acquisition module, an audio frame discrimination module and a voice endpoint detection module;

the judgment information acquisition module is used for acquiring first information and second information corresponding to audio frames contained in the audio data after the audio data to be detected is acquired, wherein the first information can indicate whether the corresponding audio frames are mute frames or non-mute frames, and the second information is pronunciation information of the corresponding audio frames;

the audio frame judging module is used for judging a mute frame, a noise frame and a voice frame of the audio frame contained in the audio data according to the first information and the second information corresponding to the audio frame contained in the audio data;

and the voice endpoint detection module is used for detecting a voice front endpoint and a voice rear endpoint according to the judgment result corresponding to the audio frame contained in the audio data.

Optionally, the voice endpoint detecting apparatus further includes: the device comprises a voice section identification module, a post-mute overtime threshold setting module and a post-mute overtime event triggering module;

the voice segment recognition module is configured to, after the voice endpoint detection module detects the voice front endpoint, recognize a voice segment starting from the voice front endpoint as a text to obtain a recognition text and a confidence corresponding to the recognition text;

the post-mute overtime threshold setting module is used for setting a post-mute overtime threshold according to the semantics of the recognized text and the confidence coefficient corresponding to the recognized text;

and the post-mute timeout event triggering module is used for triggering the post-mute timeout event based on the post-mute timeout threshold set by the post-mute timeout threshold setting module.

A voice endpoint detection apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the voice endpoint detection method described in any one of the above.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the voice endpoint detection method of any of the above.

After audio data to be detected are obtained, first information (the first information can indicate whether the corresponding audio frame is a mute frame or a non-mute frame) and second information (the second information is pronunciation information of the corresponding audio frame) corresponding to the audio frame contained in the audio data are firstly obtained, then the audio frame contained in the audio data is distinguished by taking the first information and the second information corresponding to the audio frame contained in the audio data as a basis, and finally a voice front end point and a voice rear end point are detected according to a distinguishing result corresponding to the audio frame contained in the audio data. The voice endpoint detection method provided by the application can judge whether the audio frame contained in the audio data is a mute frame, a noise frame or a voice frame, namely, the voice endpoint detection method can detect the more accurate attribute of the audio frame contained in the audio data, and can detect the voice front endpoint and the voice rear endpoint on the basis, so that the more accurate detection result can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a voice endpoint detection method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of another voice endpoint detection method according to an embodiment of the present application;

FIG. 3 is a diagram illustrating an example of a sentence-level network provided by an embodiment of the present application;

fig. 4 is a schematic diagram of an initial word-level network obtained by expanding each corpus in the sentence-level network shown in fig. 3 into a single word according to an embodiment of the present application;

fig. 5 is a result of node and path merging performed on the initial word-level network shown in fig. 4 according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a voice endpoint detection apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a voice endpoint detection device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the process of implementing the application, the applicant finds that: most of the existing voice endpoint detection schemes are feature-based detection methods and model-based detection methods, wherein the feature-based detection method has a starting point of finding a feature capable of representing the difference between a mute frame and a non-mute frame to distinguish the mute frame from the non-mute frame, and the model-based detection method distinguishes whether an audio frame is a mute frame or a non-mute frame by modeling the mute frame and the non-mute frame. It can be seen that, regardless of the feature-based detection method or the model-based detection method, for an audio frame in audio data, only a silent frame or a non-silent frame can be detected, but the non-silent frame is not necessarily a speech frame, and the non-silent frame is used as a speech frame for speech endpoint detection, which will cause the detected speech endpoint to be inaccurate, that is, the existing speech endpoint detection scheme cannot accurately detect the attribute of the audio frame, so that it is difficult to obtain an accurate speech endpoint detection result.

In view of the fact that the existing voice endpoint detection accuracy is not high, the applicant tries to provide a voice endpoint detection method with high detection accuracy, and researches the method, and through continuous research, a voice endpoint detection method is finally provided, and the voice endpoint detection method can judge whether an audio frame contained in audio data to be detected is a mute frame, a noise frame or a voice frame, namely, can determine the attribute of the audio frame to be accurate, detect the voice endpoint on the basis, and obtain a more accurate detection result. On the basis of realizing the voice endpoint detection, the applicant further proposes that a voice section from the voice front endpoint to the voice back endpoint is recognized as a text, a proper back-mute timeout threshold is set according to the semantics of the recognized text, and a back-mute timeout event is triggered based on the proper back-mute timeout threshold, so that the user experience is improved.

The voice endpoint detection method provided by the application can be applied to electronic equipment with data processing capacity, the electronic equipment can be a terminal used by a user side, such as a smart phone, a PC, a notebook, a PAD, an intelligent household appliance, a vehicle-mounted terminal and the like, the electronic equipment can also be a server (a single server or a plurality of servers or a server cluster) on a network side, and the electronic equipment can detect a more accurate voice endpoint according to the voice endpoint detection method provided by the application. The following embodiments are provided to describe the voice endpoint detection method provided in the present application.

First embodiment

Referring to fig. 1, a schematic flow chart of a voice endpoint detection method provided in an embodiment of the present application is shown, where the method may include:

step S101: after the audio data to be detected are obtained, first information and second information corresponding to audio frames contained in the audio data are obtained.

The first information can indicate whether the corresponding audio frame is a mute frame or a non-mute frame, and the second information is pronunciation information of the corresponding audio frame.

Optionally, the first information corresponding to an audio frame may include scores of the audio frame being a mute frame and an un-mute frame, respectively (i.e., an acoustic score of the audio frame being a mute frame and an acoustic score of the audio frame being an un-mute frame).

Optionally, the second information may be a full-phoneme acoustic score corresponding to the audio frame, where the full-phoneme acoustic score refers to acoustic scores corresponding to all phonemes included in a language to which the audio data belongs, and the second information includes acoustic scores corresponding to N phonemes, respectively, assuming that all phonemes included in the language to which the audio data belongs are N.

Optionally, the process of acquiring the first information and the second information corresponding to the audio frames included in the audio data may include: and predicting acoustic scores of audio frames contained in the audio data, namely the mute frames and the non-mute frames respectively, and full-phoneme acoustic scores corresponding to the audio frames contained in the audio data by using a pre-established multi-task joint model.

Specifically, the process of predicting the acoustic scores of the audio frames included in the audio data as the silent frames and the non-silent frames respectively by using the pre-established multitask joint model, and the acoustic score of the whole phoneme corresponding to the audio frame included in the audio data may include:

step a1, obtaining audio features of audio frames contained in the audio data.

The audio features may be, but are not limited to, filterbank features, mfcc features, and the like.

Step a2, predicting acoustic scores of audio frames contained in the audio data, which are respectively a mute frame and an un-mute frame, and full-pel acoustic scores corresponding to the audio frames contained in the audio data, based on the audio features of the audio frames contained in the audio data by using a pre-established multitask joint model.

Specifically, the audio features of the audio frames included in the audio data are input into the multitask joint model, and the multitask joint model predicts acoustic scores of the corresponding audio frames, namely the mute frames and the non-mute frames, and full-phoneme acoustic scores corresponding to the corresponding audio frames according to the input audio features.

The multitask joint model comprises an input layer, a hidden layer, a first output layer and a second output layer, wherein the first output layer and the second output layer share the input layer and the hidden layer, the input layer is used for inputting audio features of audio frames contained in audio data, the hidden layer is used for processing the audio features input by the input layer, the first output layer is used for predicting acoustic scores of the corresponding audio frames which are mute frames and non-mute frames respectively according to the output of the hidden layer, and the second output layer is used for predicting full-phoneme acoustic scores corresponding to the corresponding audio frames according to the output of the hidden layer.

It should be noted that, the first output layer of the multitask joint model predicts the probabilities that the corresponding audio frames are the mute frames and the non-mute frames respectively according to the output of the hidden layer, then determines the acoustic scores that the corresponding audio frames are the mute frames and the non-mute frames respectively according to the probabilities that the corresponding audio frames are the mute frames and the non-mute frames respectively, and similarly, the second output layer predicts the probabilities that the phonemes corresponding to the audio frames are the respective phonemes in the whole phonemes according to the output of the hidden layer, and then determines the acoustic scores that the phonemes corresponding to the audio frames are the respective phonemes in the whole phonemes, that is, the whole phoneme acoustic scores.

The multitask joint model is obtained by training audio data and a first class label and a second class label corresponding to each audio frame contained in the training audio data, wherein the first class label corresponding to one audio frame is used for indicating whether the audio frame is a mute frame or a non-mute frame, and the second class label is used for indicating which phoneme in a full phoneme corresponding to the audio frame. During training, the multitask joint model has two tasks, wherein one task is to learn the classification of silence and non-silence, the other task is to learn the classification of full phonemes, two loss functions are set aiming at the two tasks, and model parameters are updated according to the two loss functions, wherein one of the two loss functions is determined according to the probability that an audio frame predicted by the multitask joint model is a silence frame, the probability that the audio frame is a non-silence frame and a first class label corresponding to the audio frame, and the other loss function is determined according to the probability that a phoneme corresponding to the audio frame predicted by the multitask joint model is each phoneme in the full phonemes and a second class label corresponding to the audio frame.

Step S102: and judging the audio frame contained in the audio data by taking the first information and the second information corresponding to the audio frame contained in the audio data as a basis.

Specifically, the process of determining the silence frame, the noise frame, and the speech frame in the audio frame included in the audio data according to the first information and the second information corresponding to the audio frame included in the audio data may include:

step b1, for each audio frame to be distinguished, distinguishing the audio frame from a mute frame and a non-mute frame according to the first information corresponding to the audio frame.

In view of the above, for the audio frame to be determined, whether the audio frame is a mute frame or a non-mute frame may be determined according to the acoustic scores of the audio frame that is a mute frame and a non-mute frame, specifically, if the acoustic score of the audio frame that is a mute frame is greater than the acoustic score of the audio frame that is a non-mute frame, the audio frame is determined as a mute frame, otherwise, the audio frame is determined as a non-mute frame.

Optionally, the first information may further include frame energy of a corresponding audio frame, and in view of this, for an audio frame to be determined, it may be determined whether the audio frame is a silent frame or a non-silent frame according to acoustic scores of the audio frame and the frame energy of the audio frame, respectively.

And b2, when the audio frame is judged to be a non-silent frame, judging the noise frame and the speech frame according to the second information corresponding to the audio frame.

In view of the above, for an audio frame to be determined, whether the audio frame is a noise frame or a speech frame may be determined according to the full-pel score corresponding to the audio frame, specifically, if the maximum acoustic score in the full-pel acoustic scores corresponding to the audio frame is less than a preset acoustic score threshold, the audio frame is determined to be a noise frame, and if the maximum acoustic score in the full-pel acoustic scores corresponding to the audio frame is greater than or equal to the preset acoustic score threshold, the audio frame is determined to be a speech frame.

Step S103: and detecting a voice front end point and a voice rear end point according to the judgment result corresponding to the audio frame contained in the audio data.

Specifically, when a continuous first preset frame number of voice frames appears in the audio data, a voice front end point is determined to be detected, a first voice frame in the continuous first preset frame number of voice frames is determined as the voice front end point, after the voice front end point is detected, if a continuous second preset frame number of non-voice frames appears in the audio data, a voice rear end point is determined to be detected, and a forward adjacent voice frame in the continuous second preset frame number of non-voice frames is determined as a voice rear end point. Optionally, the first preset number of frames may be, but is not limited to, an integer within the interval [10, 20], and the second preset number of frames may be, but is not limited to, an integer within the interval [30, 40 ].

Illustratively, the first preset frame number is 20, the second preset frame number is 40, and assuming that 11 th to 30 th audio frames of the audio data are judged to be speech frames, it may be determined that a speech front end point is detected, and the 11 th audio frame is determined to be a speech front end point, after the speech front end point is detected, assuming that 40 consecutive audio frames after the 60 th audio frame is detected are non-speech frames, it may be determined that a speech rear end point is detected, and the 60 th audio frame is determined to be a speech rear end point.

It should be noted that, if the voice front end point is not detected, it is determined whether the front mute time length exceeds a preset front mute timeout threshold, and if the front mute time length exceeds the preset front mute timeout threshold, the detection is ended.

According to the voice endpoint detection method provided by the embodiment of the application, after audio data to be detected is obtained, first information (the first information can indicate whether the corresponding audio frame is a mute frame or a non-mute frame) and second information (the second information is pronunciation information of the corresponding audio frame) corresponding to the audio frame contained in the audio data are obtained, then the audio frame contained in the audio data is judged according to the first information and the second information corresponding to the audio frame contained in the audio data, and finally a voice front endpoint and a voice rear endpoint are detected according to a judgment result corresponding to the audio frame contained in the audio data. The voice endpoint detection method provided by the embodiment of the application can judge whether the audio frame contained in the audio data is a mute frame, a noise frame or a voice frame, namely, the embodiment of the application can detect the more accurate attribute of the audio frame contained in the audio data, and on the basis, the detection of the voice front endpoint and the voice rear endpoint is carried out, so that the more accurate voice endpoint detection result can be obtained. Compared with the existing voice endpoint detection method, the voice endpoint detection method provided by the embodiment of the application improves the detection accuracy of the voice endpoint and lays a solid foundation for the use of subsequent voice sections.

Second embodiment

In some application scenarios, after a voice back end point is detected, a back mute time duration needs to be detected, and when the detected back mute time duration is greater than a set back mute time-out threshold, a back mute time-out event is triggered, so that some operation is executed after the application receives the back mute time-out event.

For example, in a recording scenario, a voice assistant on an electronic device generally needs to be able to automatically stop recording after a user finishes a sentence with complete semantics to identify the user speaking content, and then perform semantic identification and subsequent operations on the identified content. At present, statistics of user behaviors of the current network is found, based on a mute timeout event triggered after 800ms, about 95% of users say that the words have complete semantics, about 4% of users say that the semantics are incomplete, if most of the users say that the words have complete semantics, the post-mute timeout event can be triggered more quickly, and a longer post-mute timeout threshold can be set for the user who does not express the complete semantics, so that the time for the user to think can be prolonged, and thus, the user experience can be improved.

In view of the above, the present application provides another method for detecting a voice endpoint, please refer to fig. 2, which shows a flowchart of the method for detecting a voice endpoint, and the method may include:

step S201: after the audio data to be detected are obtained, first information and second information corresponding to audio frames contained in the audio data are obtained.

The first information can indicate whether the corresponding audio frame is a mute frame or a non-mute frame, and the second information is pronunciation information of the corresponding audio frame. Optionally, the first information may be scores of a mute frame and a non-mute frame corresponding to the audio frame, respectively, and the second information may be a full-phoneme acoustic score corresponding to the corresponding audio frame.

Step S202: and judging the audio frame contained in the audio data by taking the first information and the second information corresponding to the audio frame contained in the audio data as a basis.

Step S203: and detecting the voice front end point according to the judgment result corresponding to the audio frame contained in the audio data.

It should be noted that, if the voice front end point is not detected, it is determined whether the pre-mute time length exceeds a preset pre-mute timeout threshold, if so, the detection is ended, and if the voice front end point is detected, the subsequent steps are executed.

Step S204: after the voice front end point is detected, recognizing a voice section starting from the voice front end point as a text, synchronously detecting a voice rear end point according to a judgment result corresponding to an audio frame contained in audio data in a recognition process, and after the recognition from the voice front end point to the voice rear end point is completed, finishing the recognition to obtain a recognition text and a confidence coefficient corresponding to the recognition text.

It should be noted that, in steps S201 to S203 and S204, the specific implementation process and the related explanation of "detecting the rear end point of the voice according to the determination result corresponding to the audio frame included in the audio data" may refer to the specific implementation process and the related explanation of steps S101 to S103 in the foregoing embodiment, which is not described herein again.

The process of recognizing a speech segment starting from a speech front end point as a text to obtain a recognition text and a confidence corresponding to the recognition text may include:

and c1, starting from the voice front end point, decoding the second information corresponding to the voice frame (such as the whole phoneme score corresponding to the voice frame) through the pre-constructed phoneme level network.

Alternatively, the decoding may be performed using any one of decoding algorithms such as Viterbi (Viterbi), DTW, and the like.

It should be noted that, the voice backend point is synchronously detected during the decoding process, and after the voice backend point is detected and the decoding of the second information (for example, the full phoneme score corresponding to the voice frame) corresponding to the voice backend point is completed, the decoding is completed.

The phoneme-level network in step c1 is constructed according to a first corpus in a first corpus set and a second corpus in a second corpus set, where the first corpus set includes a plurality of first corpora, the first corpus is a corpus that has incomplete semantics and needs to be triggered again for a long time, such as "i want to see", "i want to find one", and the like, the second corpus set includes a plurality of second corpora, and the second corpus is a corpus that has complete semantics and can be triggered without a long time, such as "call alarm", "shut down immediately", and the like.

Specifically, the process of constructing the phoneme-level network according to the first corpus in the first corpus and the second corpus in the second corpus may include:

and d1, connecting the first language material in the first language material set and the second language material in the second language material set in parallel to obtain a sentence-level network.

Each corpus is a node in the sentence-level network.

It should be noted that, in step d1, all the corpora in the two corpora sets are connected in parallel, and the first corpus and the second corpus do not need to be processed separately.

Referring to fig. 3, a schematic diagram of an example of a sentence-level network is shown, where the sentence-level network shown in fig. 3 is obtained by connecting the corpora "call zhang san" and the corpora "call zhui si four" in parallel, the beginning of the sentence-level network is a mute node, and each corpus in the sentence-level network ends with a mute node.

And d2, expanding each corpus in the sentence-level network into a single word to obtain an initial word-level network.

Wherein, each single character is a node in the initial word-level network.

Referring to fig. 4, a schematic diagram of an initial word-level network obtained by expanding each corpus in the sentence-level network shown in fig. 3 into a single word is shown, and as shown in fig. 4, the expansion of "call to zhang san" into a single word is the same as "call", "play", "third", and "call to lie four", and each node in the initial word-level network except for the mute node is a word in the corpus.

And d3, merging the nodes and paths in the initial word-level network to obtain a final word-level network.

Since the same node exists in some paths in the initial word-level network, the present embodiment merges the same nodes of different paths, and path merging is inevitably performed when the nodes are merged. Alternatively, a directed graph merging algorithm may be used to merge nodes and paths in the initial word-level network.

As shown in fig. 4, the two paths in fig. 4 have the same node "on", "electric", "telephone", "sil", for this reason, path and node merging needs to be performed on the initial word-level network, and fig. 5 shows the result of performing node and path merging on the initial word-level network shown in fig. 4, and as shown in fig. 5, redundant nodes and paths are removed by the merging of nodes and paths, thereby simplifying the network structure of the word-level network.

And d3, expanding each single word in the final word-level network into a phoneme to obtain an initial phoneme-level network.

Wherein each phoneme is a node in the initial phoneme level network.

Optionally, each single word in the final word-level network may be expanded into a single-phoneme, the node "typing" in fig. 5 is taken as an example, and the "typing" is expanded into the single-phoneme of "d" and "a", and this embodiment is not limited thereto, but each single word in the final word-level network may also be expanded into a multi-phoneme, such as a diphone phoneme and a triphone phoneme, and the node "typing" in fig. 5 is also taken as an example, and the "typing" is expanded into the diphone of "sil-d", "d + a", and the "typing" is expanded into the triphone of "sil-d + a", "d-a + d".

And d4, combining the nodes and paths in the initial phoneme level network to obtain a final phoneme level network.

Similar to the initial word-level network, the same nodes may exist on different paths in the initial phoneme-level network, and for this reason, the present embodiment combines the same nodes on different paths of the initial phoneme-level network, and inevitably performs path combination when combining the nodes. Alternatively, a directed graph merging algorithm may be used to merge nodes and paths in the initial phoneme level network.

And c2, obtaining the optimal decoding result and the confidence coefficient corresponding to the optimal decoding result by backtracking the optimal decoding path, taking the optimal decoding result as the recognition text, and taking the confidence coefficient corresponding to the optimal decoding result as the confidence coefficient corresponding to the recognition text.

Step S205: setting a post-mute timeout threshold according to the semantics of the recognized text and the confidence corresponding to the recognized text, and triggering a post-mute timeout event based on the set post-mute timeout threshold.

Specifically, according to the semantics of the recognized text and the confidence corresponding to the recognized text, the process of setting the post-mute timeout threshold may include:

and e1, determining the semantic scene of the recognized text from the set semantic scenes according to the semantics of the recognized text and the confidence corresponding to the recognized text.

The set semantic scenes may include a first scene, a second scene and a default scene, each scene has a corresponding post-mute timeout threshold, the post-mute timeout threshold corresponding to the first scene is greater than the default post-mute timeout threshold corresponding to the default scene, and the post-mute timeout threshold corresponding to the second scene is less than the default post-mute timeout threshold corresponding to the default scene. It should be noted that the default post-mute timeout threshold corresponding to the default scene may be set according to a specific service scene, for example, the recording scene described above, and the default post-mute timeout threshold corresponding to the default scene may be set to 800ms, based on which, the post-mute timeout threshold corresponding to the first scene is set to be greater than 800ms, and the post-mute timeout threshold corresponding to the second scene is set to be less than 800 ms.

Specifically, the process of determining the semantic scene of the recognized text from the set semantic scenes according to the semantics of the recognized text and the confidence corresponding to the recognized text may include:

and e1-1, determining whether the recognized text is credible according to the confidence corresponding to the recognized text.

Specifically, if the confidence corresponding to the recognized text is greater than or equal to the preset confidence threshold, the recognized text is determined to be trusted, and if the confidence corresponding to the recognized text is less than the preset confidence threshold, the recognized text is determined to be not trusted.

And e1-2a, if the recognized text is not credible, determining that the semantic scene of the recognized text is a default scene.

And e1-2b-1, if the recognized text is credible, determining the semantic similarity between the recognized text and the first corpus in the first corpus set and the semantic similarity between the recognized text and the second corpus in the second corpus set.

Specifically, when determining the similarity between the recognition text and a corpus, the method may first obtain an expression vector of each word in the recognition text and an expression vector of each word in the corpus, then encode the expression vector of each word in the recognition text to obtain a semantic representation vector of the recognition text, similarly encode the expression vector of each word in the corpus to obtain a semantic representation vector of the corpus, and finally determine the semantic similarity between the recognition text and the corpus according to the semantic representation vector of the recognition text and the semantic representation vector of the corpus. Alternatively, any one of MLP, CNN, RNN, Self-attribute, Transformer encoder, BERT, and the like may be used to encode the expression vector of each word in the recognition text, and to encode the expression vector of each word in the corpus. When the similarity between the recognition text and the corpus is determined according to the semantic representation vector of the recognition text and the semantic representation vector of the corpus, the cosine similarity, the Gaussian distance and the like between the semantic representation vector of the recognition text and the semantic representation vector of the corpus can be calculated.

And e1-2b-2, determining the semantic scene of the recognized text from the set semantic scenes according to the determined semantic similarity.

Specifically, according to the determined semantic similarity, the process of determining the semantic scene for recognizing the text from the set semantic scenes may include: if the maximum semantic similarity in the determined semantic similarities is greater than or equal to a preset similarity threshold, determining that the semantic scene of the recognized text is one of a first scene and a second scene, further, if the maximum semantic similarity is the similarity between the recognized text and the first corpus, determining that the semantic scene of the recognized text is the first scene, and if the maximum semantic similarity is the semantic similarity between the recognized text and the second corpus, determining that the semantic scene of the recognized text is the second scene; and if the maximum semantic similarity is smaller than a preset similarity threshold, determining that the semantic scene of the text recognition result is a default scene.

And e2, setting a mute timeout threshold according to the semantic scene of the recognized text.

Specifically, according to the semantic scene of the recognized text, the process of setting the post-mute timeout threshold may include: if the semantic scene of the text is identified as a first scene, setting a post-mute overtime threshold as a post-mute overtime threshold corresponding to the first scene; if the semantic scene of the text is identified as a second scene, setting a post-mute overtime threshold as a post-mute overtime threshold corresponding to the second scene; and if the semantic scene of the text is identified as a default scene, setting the post-mute overtime threshold as a default post-mute overtime threshold corresponding to the default scene. It should be noted that when the semantic scene of the text is identified as the first scene, the post-mute timeout threshold is set as the post-mute timeout threshold corresponding to the first scene, so that the post-mute timeout event can be triggered more quickly.

After the post-mute timeout threshold is set according to the semantic scene of the recognized text, the post-mute timeout event can be triggered based on the set post-mute timeout threshold, specifically, whether the post-mute duration exceeds the set post-mute timeout threshold is judged, if yes, the detection is finished, the post-mute timeout event is triggered, it needs to be stated that if the post-mute duration does not exceed the set post-mute timeout threshold, the voice front end point is continuously detected, if the voice front end point is detected, the step S204 is continuously executed, if the voice front end point is not detected, the step S204 is continuously executed to judge whether the post-mute duration exceeds the set post-mute timeout threshold until the post-mute duration exceeds the set post-mute timeout threshold or no detectable audio data exists after the detection.

The voice endpoint detection method provided by the embodiment of the application can judge whether the audio frame contained in the audio data is a mute frame, a noise frame or a voice frame, namely, the embodiment of the application can detect the more accurate attribute of the audio frame contained in the audio data, and on the basis, the detection of the voice front endpoint and the voice rear endpoint can be carried out, so that the more accurate detection result can be obtained. On the basis of realizing voice endpoint detection, the method and the device can acquire the recognition text of the voice segment, determine the semantic scene of the recognition text according to the semantics of the recognition text, and further set a proper post-mute timeout threshold according to the semantic scene of the recognition text, so that a post-mute timeout event is triggered based on the proper post-mute timeout threshold, and user experience is improved.

Third embodiment

The following describes the voice endpoint detection apparatus provided in the embodiments of the present application, and the voice endpoint detection apparatus described below and the voice endpoint detection method described above may be referred to in correspondence with each other.

Referring to fig. 6, a schematic structural diagram of a voice endpoint detection apparatus provided in the embodiment of the present application is shown, which may include: a discrimination information acquisition module 601, an audio frame discrimination module 602, and a voice endpoint detection module 603.

The judgment information obtaining module 601 is configured to obtain first information and second information corresponding to audio frames included in audio data after the audio data to be detected is obtained.

The audio frame determination module 602 is configured to determine a mute frame, a noise frame, and a speech frame for an audio frame included in the audio data according to the first information and the second information corresponding to the audio frame included in the audio data.

The voice endpoint detection module 603 is configured to detect a voice front endpoint and a voice back endpoint according to a determination result corresponding to an audio frame included in the audio data.

Optionally, the voice endpoint detection apparatus provided in this embodiment of the present application may further include: the device comprises a voice section identification module, a post-mute overtime threshold setting module and a post-mute overtime event triggering module.

The voice segment recognition module is configured to, after the voice endpoint detection module detects the voice front endpoint, recognize a voice segment starting from the voice front endpoint as a text to obtain a recognition text and a confidence corresponding to the recognition text.

And the post-mute timeout threshold setting module is used for setting a post-mute timeout threshold according to the semantics of the recognized text and the confidence coefficient corresponding to the recognized text.

Optionally, when the discrimination information obtaining module 601 obtains the first information and the second information corresponding to the audio frame included in the audio data, the discrimination information obtaining module is specifically configured to:

and predicting acoustic scores of audio frames contained in the audio data, namely a mute frame and an un-mute frame respectively, and a full-phoneme acoustic score corresponding to the audio frames contained in the audio data by using a pre-established multitask joint model.

Optionally, the audio frame determination module 602 includes a first determination sub-module and a second determination sub-module.

The first discrimination submodule is used for discriminating the mute frame and the non-mute frame of the audio frame according to the first information corresponding to the audio frame aiming at the audio frame to be discriminated.

And the second judging submodule is used for judging the noise frame and the voice frame of the audio frame according to the second information corresponding to the audio frame when the audio frame is judged to be the non-silent frame.

Optionally, when the second determining sub-module determines the noise frame and the speech frame of the audio frame according to the second information corresponding to the audio frame, the second determining sub-module is specifically configured to:

if the maximum acoustic score in the full-phoneme acoustic scores corresponding to the audio frame is smaller than the preset acoustic score threshold value, judging the audio frame to be a noise frame; and if the maximum acoustic score in the full-phoneme acoustic scores corresponding to the audio frame is greater than or equal to a preset acoustic score threshold value, judging that the audio frame is a voice frame.

Optionally, the speech segment recognition module includes: the device comprises a decoding module and a decoding result acquisition module.

The decoding module is used for decoding second information corresponding to a voice frame through a pre-constructed phoneme-level network from the voice front end point, and ending decoding after the voice rear end point is detected and the decoding of the second information corresponding to the voice rear end point is finished, wherein the decoding and the voice rear end point detection are synchronously performed, the phoneme-level network is constructed according to a first corpus in a first corpus set and a second corpus in a second corpus set, the first corpus is a corpus which is incomplete in semantics and needs to wait for a long-time to trigger a post-mute timeout event, and the second corpus is a corpus which is complete in semantics and can trigger the post-mute timeout event without waiting for a long time.

The decoding result obtaining module is configured to obtain an optimal decoding result and a confidence level corresponding to the optimal decoding result by backtracking an optimal decoding path, and use the optimal decoding result and the confidence level as the recognition text and the confidence level corresponding to the recognition text.

Optionally, the voice endpoint detection apparatus provided in this embodiment of the present application may further include: and a phoneme level network construction module.

The phoneme level network building module is used for building a phoneme level network according to a first corpus in the first corpus set and a second corpus in the second corpus set.

Optionally, when the phoneme-level network building module builds the phoneme-level network according to the first corpus in the first corpus set and the second corpus in the second corpus set, the phoneme-level network building module is specifically configured to:

connecting a first corpus in the first corpus and a second corpus in the second corpus in parallel to obtain a sentence-level network, wherein each corpus is a node in the sentence-level network; expanding each corpus in the sentence-level network into a single word to obtain an initial word-level network, wherein each single word is a node in the initial word-level network; merging the nodes and paths in the initial word-level network to obtain a final word-level network; expanding each single character in the final word-level network into a phoneme to obtain an initial phoneme-level network, wherein each phoneme is a node in the initial phoneme-level network; and combining the nodes and the paths in the initial phoneme level network to obtain a final phoneme level network.

Optionally, the post-mute timeout threshold setting module includes: a semantic scene determining submodule and a post-mute timeout threshold setting submodule.

And the semantic scene determining submodule is used for determining the semantic scene of the recognition text from the set semantic scenes according to the semantics of the recognition text and the confidence corresponding to the recognition text.

And the post-mute timeout threshold setting submodule is used for setting a post-mute timeout threshold according to the semantic scene of the recognized text.

Optionally, when determining the semantic scene of the recognized text from the set semantic scenes according to the semantics of the recognized text and the confidence corresponding to the recognized text, the semantic scene determining submodule is specifically configured to:

determining whether the recognition text is credible or not according to the confidence corresponding to the recognition text; if the recognition text is not credible, determining that the semantic scene of the recognition text is a default scene; if the recognition text is credible, determining the semantic similarity between the recognition text and a first corpus in the first corpus set and the semantic similarity between the recognition text and a second corpus in the second corpus set; and determining the semantic scene of the recognition text from the set semantic scenes according to the determined semantic similarity.

Optionally, when determining the semantic scene of the recognition text from the set semantic scenes according to the determined semantic similarity, the semantic scene determining sub-module is specifically configured to:

if the maximum semantic similarity in the determined semantic similarities is greater than or equal to a preset similarity threshold and the maximum semantic similarity is the similarity between the recognition text and the first corpus, determining that the semantic scene of the recognition text is the first scene; if the maximum semantic similarity is greater than or equal to the preset similarity threshold and the maximum semantic similarity is the semantic similarity between the recognition text and a second corpus, determining that the semantic scene of the recognition text is the second scene; and if the maximum semantic similarity is smaller than the preset similarity threshold, determining the semantic scene of the recognition text result as the default scene.

Optionally, the post-mute timeout threshold setting sub-module is specifically configured to, when setting the post-mute timeout threshold according to the semantic scene of the recognized text:

if the semantic scene of the recognized text is the first scene, setting a post-mute timeout threshold as a post-mute timeout threshold corresponding to the first scene; if the semantic scene of the recognized text is the second scene, setting a post-mute timeout threshold as a post-mute timeout threshold corresponding to the second scene; and if the semantic scene of the recognized text is the default scene, setting the post-mute overtime threshold as the default post-mute overtime threshold corresponding to the default scene.

The voice endpoint detection device provided by the embodiment of the application can judge whether the audio frame contained in the audio data is a mute frame, a noise frame or a voice frame according to the first information (the first information can indicate whether the corresponding audio frame is a mute frame or a non-mute frame) and the second information (the second information is pronunciation information of the corresponding audio frame) corresponding to the audio frame contained in the audio data to be detected. On the basis of realizing voice endpoint detection, the voice endpoint detection device provided by the embodiment of the application can acquire the recognition text of the voice segment, determine the semantic scene of the recognition text according to the semantics of the recognition text, and further set a proper post-mute timeout threshold according to the semantic scene of the recognition text, so that a post-mute timeout event is triggered based on the proper post-mute timeout threshold, and user experience is improved.

Fourth embodiment

An embodiment of the present application further provides a voice endpoint detection device, please refer to fig. 7, which shows a schematic structural diagram of the voice endpoint detection device, where the voice endpoint detection device may include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704;

in the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703 and the communication bus 704 is at least one, and the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704;

the processor 701 may be a central processing unit CPU, or an application Specific Integrated circuit (asic), or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 703 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Fifth embodiment

Embodiments of the present application further provide a computer-readable storage medium, which may store a program adapted to be executed by a processor, where the program is configured to:

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for voice endpoint detection, comprising:

2. The voice endpoint detection method according to claim 1, further comprising:

3. The method according to claim 1 or 2, wherein the obtaining the first information and the second information corresponding to the audio frames included in the audio data comprises:

4. The method according to claim 3, wherein the determining the audio frame contained in the audio data according to the first information and the second information corresponding to the audio frame contained in the audio data includes determining a mute frame, a noise frame, and a speech frame, including:

5. The method of claim 4, wherein the determining the noise frame or the speech frame according to the second information corresponding to the audio frame comprises:

6. The method according to claim 2, wherein the recognizing the speech segment starting from the speech front-end point as text to obtain a confidence level of the recognized text corresponding to the recognized text comprises:

7. The method according to claim 6, wherein constructing the phone level network according to the first corpus in the first corpus and the second corpus in the second corpus comprises:

8. The method according to claim 2, wherein the setting a post-mute timeout threshold according to the semantics of the recognized text and the corresponding confidence level of the recognized text comprises:

9. The method according to claim 8, wherein determining semantic scenes of the recognized text from the set semantic scenes according to the semantics of the recognized text and the corresponding confidence of the recognized text comprises:

10. The method according to claim 9, wherein the determining semantic scenes of the recognized text from the set semantic scenes according to the determined semantic similarity comprises:

11. The method according to claim 8, wherein the setting a post-mute timeout threshold according to the semantic scene of the recognized text comprises:

12. A voice endpoint detection apparatus, comprising: the device comprises a discrimination information acquisition module, an audio frame discrimination module and a voice endpoint detection module;

13. The voice endpoint detection apparatus according to claim 12, further comprising: the device comprises a voice section identification module, a post-mute overtime threshold setting module and a post-mute overtime event triggering module;

14. A voice endpoint detection device, comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the voice endpoint detection method according to any one of claims 1 to 11.

15. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method for detecting a voice endpoint according to any one of claims 1 to 11.