CN114495981A

CN114495981A - Method, device, equipment, storage medium and product for judging voice endpoint

Info

Publication number: CN114495981A
Application number: CN202111596502.1A
Authority: CN
Inventors: 公评; 甄泽阳; 陶健; 莫骁
Original assignee: Guang Dong Ming Chuang Software Technology Corp ltd
Current assignee: Guang Dong Ming Chuang Software Technology Corp ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-05-13

Abstract

The embodiment of the application discloses a method, a device, equipment, a storage medium and a product for judging a voice endpoint, belonging to the technical field of voice recognition. The method comprises the following steps: acquiring audio collected by a microphone; determining input integrity of a user's voice in audio in the event that voice energy of the user's voice is below an energy threshold; and in the case that the input integrity does not meet the judgment condition of the voice ending endpoint, judging the voice ending endpoint based on the voice speed characteristic and the content characteristic. By adopting the scheme provided by the embodiment of the application, the dynamic voice ending endpoint judgment can be realized based on the integrity of the input content, the voice input habits of different users and the currently input voice content, and compared with the judgment of the voice ending endpoint based on the fixed VAD duration, the judgment of the voice ending endpoint is beneficial to improving the judgment accuracy of the voice ending endpoint.

Description

Method, device, equipment, storage medium and product for judging voice endpoint

Technical Field

The embodiment of the application relates to the technical field of voice recognition, in particular to a method, a device, equipment, a storage medium and a product for judging a voice endpoint.

Background

Voice Activity Detection (VAD) is an important link in a Voice recognition process, and is used for locating a Voice start endpoint and a Voice end endpoint from Voice with noise.

In the voice recognition process, after the voice starting endpoint and the voice ending endpoint are located by the electronic equipment through the VAD, the voice signal between the voice starting endpoint and the voice ending endpoint is used as voice recognition input, and therefore the corresponding voice control function is achieved based on the voice recognition result. When the VAD derived end-of-speech endpoint is inaccurate, the speech signal input for speech recognition may not contain complete speech control information, resulting in speech control failure.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a storage medium and a product for judging a voice endpoint. The technical scheme is as follows:

in one aspect, an embodiment of the present application provides a method for determining a voice endpoint, where the method includes:

acquiring audio collected by a microphone;

determining input integrity of a user's voice in audio in the event that voice energy of the user's voice is below an energy threshold;

and in the case that the input integrity does not meet the judgment condition of the voice ending endpoint, judging the voice ending endpoint based on the voice speed characteristic and the content characteristic.

In another aspect, an embodiment of the present application provides an apparatus for determining a voice endpoint, where the apparatus includes:

the audio acquisition module is used for acquiring audio collected by the microphone;

the integrity determination module is used for determining the input integrity of the user voice under the condition that the voice energy of the user voice in the audio is lower than an energy threshold value;

and the judging module is used for judging the voice ending endpoint based on the speech speed characteristic and the content characteristic under the condition that the input integrity does not meet the judgment condition of the voice ending endpoint.

In another aspect, an embodiment of the present application provides a computer device, which includes a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a method of determining a voice endpoint as described in the above aspect.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, where the storage medium stores at least one instruction for execution by a processor to implement the method for determining a voice endpoint according to the foregoing aspect.

In another aspect, a computer program product is provided, which includes at least one instruction loaded and executed by a processor to implement the method for determining a voice endpoint as described in the above aspect.

In the embodiment of the application, under the condition that the voice energy is lower than the energy threshold, a voice ending endpoint is judged based on the input integrity of the voice of the user, and under the condition that the input integrity does not meet the judgment condition of the voice ending endpoint, the voice ending endpoint is judged based on the voice speed characteristic of the current user and the content characteristic of the voice of the user. By adopting the scheme provided by the embodiment of the application, the dynamic voice ending endpoint judgment can be realized based on the integrity of the input content, the voice input habits of different users and the currently input voice content, and compared with the judgment of the voice ending endpoint based on the fixed VAD duration, the judgment of the voice ending endpoint is beneficial to improving the judgment accuracy of the voice ending endpoint.

Drawings

FIG. 1 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart illustrating a method for determining a voice endpoint provided by an exemplary embodiment of the present application;

FIG. 3 is a schematic diagram illustrating an implementation of a voice interaction process according to an exemplary embodiment of the present application;

FIG. 4 illustrates a flow chart of an input integrity determination process provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of an implementation of an input integrity determination process provided by an exemplary embodiment of the present application;

FIG. 6 is a flow chart illustrating a dynamic VAD arbitration process according to an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of an implementation of a dynamic VAD arbitration process according to an exemplary embodiment of the present application;

FIG. 8 is a flow chart illustrating a decision process based on decision confidence according to an exemplary embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of a speech endpoint determination apparatus according to an embodiment of the present application;

fig. 10 shows a block diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Referring to fig. 1, a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application is shown. The implementation environment includes a terminal 110 and a server 120. The terminal 110 and the server 120 perform data communication through a communication network, which may be at least one of a local area network, a metropolitan area network, and a wide area network, and may alternatively be a wired network or a wireless network.

The terminal 110 is an electronic device with a voice interaction function, the electronic device may be a smart phone, a tablet computer, a smart home device (such as a smart television, a smart speaker, a smart refrigerator, etc.), a wearable device (such as a smart watch, a smart band, smart glasses, etc.), and the voice interaction function may be implemented as a voice assistant in the terminal 110.

In the voice interaction process, the terminal 110 collects audio through the microphone and recognizes the user voice from the collected audio, so as to perform intention recognition on the user voice, and further implement a corresponding function according to a recognition result. In order to improve the recognition accuracy, a voice starting end point and a voice ending end point of the voice of the user need to be determined in the audio acquisition process, so that the audio between the voice starting end point and the voice ending end point is recognized.

The server 120 is a server cluster or a cloud computing center formed by one server and a plurality of servers, and in this embodiment, the server 120 is a background server with a voice interaction function.

In some embodiments, before performing voice interaction with the terminal 110, the user first triggers the terminal 110 to enable the voice interaction function by using a wake-up word, and after the voice interaction function of the terminal 110 is enabled, a data connection is established with the server 120, so that the collected audio is transmitted to the server 120 through the data connection, and the server 120 performs voice endpoint determination. When the server 120 determines that the end of speech endpoint is present, a stop-capture instruction is sent to the terminal 110 instructing the terminal 110 to stop audio capture.

A brief pause in the user's speech may occur, which does not mean the end of the conversation. In order to improve the integrity of the voice acquisition of the user, a VAD detection duration is usually set, and if the voice of the user is not detected within the VAD detection duration after the pause, a voice end point is determined, and the audio acquisition is stopped. However, when the waiting time is set to be too short, the collected user voice may be incomplete, so that the subsequent intention recognition fails, and the corresponding function cannot be realized; when the waiting time is set to be too long, although the integrity of the collected user voice can be ensured, the user may finish the voice indication, but the terminal still records the voice, so that the function response has hysteresis and the use experience of the user is influenced.

It can be seen that the accuracy of determining the end point of the voice in the voice interaction process directly affects the implementation quality of the voice interaction function. In consideration of the speaking habits of different users and different speaking contents, in the embodiment of the application, the voice ending endpoint is dynamically determined from the semantic completeness, the speech speed characteristics and the content characteristics of the user instead of the fixed VAD detection duration, so that the accuracy of determining the voice ending endpoint is improved.

Note that, in fig. 1, the server 120 is described as an example of determining the end point of speech, but in another possible embodiment, the terminal 110 may determine the end point of speech, and this embodiment is not limited to this. For convenience of description, the following embodiments describe an example in which the method for determining a voice endpoint is executed by a computer device.

Referring to fig. 2, a flowchart of a method for determining a voice endpoint according to an exemplary embodiment of the present application is shown. The process comprises the following steps:

step 201, acquiring audio collected by a microphone.

In some embodiments, when the terminal performs the voice endpoint determination, the terminal acquires audio collected by a microphone; when the server determines the voice endpoint, the terminal transmits the audio collected by the microphone to the server, and accordingly, the server receives the audio transmitted by the terminal.

Step 202, determining the input integrity of the user voice under the condition that the voice energy of the user voice in the audio is lower than an energy threshold value.

In some embodiments, the user speech is speech of a specific user, and the computer device recognizes the user speech of the specific user based on a voiceprint, or the user speech is speech of an arbitrary user, which is not limited in this embodiment.

Optionally, in the audio acquisition process, the terminal performs real-time voice energy detection on the audio, and instructs the server to perform voice end endpoint determination when detecting that the voice energy of the voice of the user is lower than an energy threshold; or the server detects the voice energy of the received audio and judges the end point of the voice when detecting that the voice energy of the voice of the user is lower than an energy threshold value.

In the embodiment of the present application, the determination of the voice ending endpoint is divided into two stages. In the first phase, the computer device first determines the input integrity of the user's speech, and thereby makes an end-of-speech endpoint determination based on the input integrity.

In general, when a user stops speaking, the semantics of the user voice before the voice stop node is incomplete, and when the user finishes speaking, the semantics of the user voice before the voice stop node is complete; and, there is also a more obvious difference in acoustic characteristics between complete user speech and non-complete speech. Thus, in one possible implementation, the computer device determines the input integrity of the user's speech based on at least one dimension of semantics and acoustics.

Optionally, the computer device detects whether the input integrity of the user voice meets the determination condition of the voice ending endpoint, and if not, continues to perform the second-stage voice ending endpoint determination through step 203 described below. If yes, judging the end point of the voice. When the judgment condition of the voice end point is met, the input integrity of the user voice is indicated to reach the integrity requirement (that is, the probability of the user voice input integrity is higher).

Optionally, after determining the end point of the voice based on the input integrity, the computer device instructs the microphone to stop audio acquisition, and performs a functional response based on the voice of the user in time. When the computer equipment is a server, the server sends a collection stopping instruction to the terminal.

And step 203, judging the voice ending endpoint based on the speech speed characteristic and the content characteristic under the condition that the input integrity does not meet the judgment condition of the voice ending endpoint.

Since different users have different speaking habits (e.g., some users have slower speech speed and some users have faster speech speed), and there are differences in the expression of different contents, in the second stage, the computer device determines the speech speed characteristics of the current user based on the speech of the user, determines the content characteristics based on the speech content of the speech of the user, and thereby performs the end-of-speech determination based on the speech speed characteristics and the content characteristics.

Optionally, the speech rate feature is used to characterize the speech rate of the current user, and the content feature is used to characterize the domain to which the dialog content belongs, the contextual relevance of the dialog content, and the like.

In the process of judging the voice ending end point, the speed of a current user and the conversation content in the current conversation scene are comprehensively considered, so that the scene that whether the opposite side finishes speaking or not can be judged in the process of the human-to-human conversation can be simulated more truly (in the real conversation scene, the user also judges whether the opposite side finishes speaking or not according to the speaking habit of the opposite side and the current conversation content), and the natural degree of the voice interaction process between the user and the terminal is improved. In addition, compared with the voice ending endpoint judgment based on the fixed VAD detection duration, the voice ending endpoint judgment is dynamically carried out for different users and different conversation contents, the judgment accuracy of the voice ending endpoint is higher, the problems that voice interaction fails before the voice ending endpoint judgment is over and long-time waiting of the users is caused after the voice ending endpoint judgment is over are solved, and the voice interaction quality is improved.

In summary, in the embodiment of the present application, when the voice energy is lower than the energy threshold, the voice end point is determined based on the input integrity of the user voice, and when the input integrity does not satisfy the determination condition of the voice end point, the voice end point is determined based on the speed characteristic of the current user and the content characteristic of the user voice. By adopting the scheme provided by the embodiment of the application, the dynamic voice ending endpoint judgment can be realized based on the integrity of the input content, the voice input habits of different users and the currently input voice content, and compared with the judgment of the voice ending endpoint based on the fixed VAD duration, the judgment of the voice ending endpoint is beneficial to improving the judgment accuracy of the voice ending endpoint.

In an illustrative example, as shown in fig. 3, when a user needs to set an reminder by voice, first wake up the voice assistant of the terminal 31 by voice ("small cloth" in fig. 3), and further instruct the terminal 31 to set the reminder by voice "remind me to go to hospital".

The terminal 31 sends the collected audio to the server 32, and the server 32 performs input integrity judgment on the user voice "remind me to go to the hospital" when detecting that the voice energy of the user voice is lower than an energy threshold. When the input integrity of the user voice meets the determination condition of the voice ending endpoint, the server 32 sends a collection stopping instruction to the terminal 31, instructs the terminal 31 to stop collecting the audio, and responds to the current user voice, in fig. 3, the terminal 31 issues a question "when is to remind? ".

When the user answers the question, the terminal 31 continues to perform audio acquisition and transmits the audio to the server 32. When detecting that the voice energy is lower than the energy threshold after the user finishes saying '9 am in the morning', the server 32 performs semantic integrity judgment on the user voice '9 am in the morning'. When the input integrity of the user voice does not satisfy the determination condition of the voice end point, the server 32 further performs determination based on the speed characteristic of the current user and the content characteristic of the dialog content, and instructs the terminal 31 to stop collecting the audio and respond to the current user voice when the voice end point is determined, and in fig. 3, the terminal 31 sets a reminder based on the item information and the reminder time.

In one possible implementation, the computer device determines the input integrity of the user voice based on the text features and the audio features of the user voice at the same time, that is, determines the integrity of the user voice input according to the multi-dimensional features, so as to improve the accuracy of the input integrity judgment. The following description will be made using exemplary embodiments.

Referring to fig. 4, a flow chart of an input integrity determination process provided by an exemplary embodiment of the present application is shown. The process comprises the following steps:

step 401, performing voice recognition on the user voice to obtain a voice text corresponding to the user voice.

In one possible implementation, the computer device converts the user's Speech into phonetic text via Automatic Speech Recognition (ASR) technology. Of course, the computer device may also implement speech-to-text conversion by other technologies, which is not limited in this embodiment.

In other possible embodiments, a terminal that collects the audio may also perform speech-to-text conversion, and send the converted speech text to the server, which is not limited in this embodiment.

Further, the computer device determines the input integrity of the user's voice based on the audio characteristics of the user's voice and the text characteristics of the speech text (steps 402 to 405).

And step 402, performing feature extraction on the user voice through an audio feature extraction model to obtain audio features.

Optionally, an audio feature extraction model for performing feature extraction on the user speech from an acoustic dimension is provided in the computer device. In some embodiments, before performing feature extraction on the user speech by using the audio feature extraction model, the computer device first performs pre-processing on the user speech to obtain a speech feature parameter of the user speech, where the pre-processing may be Mel-Frequency Cepstral Coefficients (MFCCs), and the obtained speech feature parameter is an MFCC parameter.

Regarding the model structure of the audio feature extraction model, in one possible implementation, the audio feature extraction model is composed of sincenet, a first Long Short-Term Memory (LSTM) network, and a first fully connected layer. The first LSTM network is used for carrying out long-term and short-term memory processing on the extracted features of the SincNet, and the first full-connection layer is used for carrying out full-connection processing on the result output by the first LSTM network. As shown in fig. 5, after the user voice is processed by the MFCC to obtain the MFCC parameter, the MFCC parameter sequentially passes through the sincenet, the first LSTM network, and the first full connection layer to obtain the audio feature.

The model structure of the audio feature extraction model is merely an exemplary illustration, and is not limited to a specific model structure.

And 403, performing feature extraction on the voice text through a text feature extraction model to obtain text features.

Optionally, a text feature extraction model for performing feature extraction on the speech text from a text dimension is set in the computer device. In some embodiments, before performing feature extraction on a voice text by using a text feature extraction model, a computer device first performs a labeling process on the voice text to obtain a labeled text corresponding to the voice text, where the labeling process includes labeling IDs (based on a mapping relationship between words and IDs) of words in the voice text, positions of the labeled words, sentences to which the labeled words belong, and so on.

Regarding the model structure of the text feature extraction model, in one possible implementation, the text feature extraction model is composed of a word embedding (word embedding) network, a second LSTM network, and a second fully-connected layer. The word embedding network is used for carrying out word embedding processing on the marked text to obtain a word embedding vector, the second LSTM network is used for carrying out long-term and short-term memory processing on the word embedding vector output by the word embedding network, and the second full-connection layer is used for carrying out full-connection processing on a result output by the second LSTM network. As shown in fig. 5, after the user speech is converted into the speech text by ASR, the speech text is marked to obtain a marked text, and the marked text sequentially passes through the word embedding network, the second LSTM network, and the second full connection layer to obtain text features.

Optionally, in the training process of the text feature extraction model, the output of the RoBERTa model is used to supervise the text feature extraction model, so that the output of the text feature extraction model is fitted with the RoBERTa model, thereby achieving the purpose of learning the representation of the RoBERTa model on the text and improving the feature extraction quality of the text feature extraction model.

The model structure of the text feature extraction model is merely an example, and is not limited to a specific model structure.

And step 404, performing feature fusion on the audio features and the text features to obtain fusion features.

Further, the computer equipment performs feature fusion on the extracted audio features and text features to obtain fusion features, and subsequently, input integrity judgment is performed according to the fusion features. Regarding the manner of feature fusion, in one possible implementation, the computer device splices (concat) the audio feature and the text feature, and determines the spliced feature as the fusion feature.

In other possible embodiments, in order to further improve the accuracy of the subsequent input integrity determination, the computer device extracts the high-order features of the user speech, so as to perform feature fusion on the high-order features, the audio features and the text features to obtain fusion features. The high-order feature may include a fundamental frequency, a signal-to-noise ratio, and the like, which is not limited in this embodiment.

Schematically, as shown in fig. 5, the computer device fuses the high-order features obtained by feature extraction, the audio features output by the audio feature extraction model, and the text features output by the text feature extraction model.

Step 405, determining the complete probability of the user's voice based on the fused features.

Further, the computer device inputs the fusion feature into a classifier (classifier) to obtain a complete probability of the user voice output by the classifier, wherein the probability range of the complete probability is 0 to 1. Illustratively, as shown in fig. 5, the computer device inputs the fused features into the classifier to obtain the complete probability of the classifier output.

Optionally, when the integrity probability is greater than a probability threshold (for example, 0.85), the computer device determines that the input integrity of the user voice meets the determination condition of the voice end point; and when the integrity probability is lower than the probability threshold, determining that the input integrity of the user voice does not meet the judgment condition of the voice end endpoint.

Regarding the training process of the audio feature extraction model and the text feature extraction model, in a possible implementation manner, the computer device obtains the training user speech including the integrity label (for example, label 1 represents that the training user speech is complete, and label 2 represents that the training user speech is incomplete) and the corresponding training speech text, so as to take the training user speech and the training speech text as model training inputs to obtain a predicted integrity probability, and then take the integrity label as supervision of the predicted integrity probability to train the audio feature extraction model and the text feature extraction model.

In this embodiment, the computer device performs audio and text feature extraction by using the audio feature extraction model and the text feature extraction model obtained by pre-training, and fuses the extracted features, so as to predict the integrity probability of the user speech based on the fused features, thereby improving the accuracy of the judgment on the integrity of the user speech input. In addition, high-order characteristics of user voice are further fused, so that the characteristic dimensionality of the characteristics is improved, and the accuracy of input integrity judgment is further improved.

In the embodiment of the present application, the computer device further obtains the speech speed characteristics of the current user and the content characteristics of the dialog content in the dialog process, and dynamically determines the VAD duration, thereby performing dynamic determination based on the dynamic VAD duration. As shown in fig. 6, the process of the computer device performing the dynamic VAD decision may include the following steps.

Step 601, determining a speech rate characteristic based on the current speech rate of the user speech and the historical speech rate of the user corresponding to the user speech.

In one possible implementation, the computer device counts the speech rate of the user during each conversation, and updates the historical speech rate of the user based on the statistical result, wherein the historical speech rate is used for representing the speaking habit of the user.

When the voice ending endpoint is judged, the computer equipment acquires the historical speech rate of the current user and determines the speech rate characteristics in the current round of conversation process by combining the current speech rate of the user.

Where the speech rate may be expressed as v ═ text word count/audio duration, the text word count being determined based on the speech recognition result.

Optionally, the speech rate feature is a weighted speech rate determined by the historical speech rate and the current speech rate. In one possible implementation, the computer device calculates the speech rate feature based on a first speech rate weight corresponding to the current speech rate and a second speech rate weight corresponding to the historical speech rate.

In one illustrative example, the current user's current speech rate is v1, calendarWhen the speeches are v2, the speech rate characteristic F_v(alpha v1+ beta v2), alpha and beta are preset parameters.

Schematically, as shown in fig. 7, the computer device performs ASR recognition on the user speech to obtain a speech text, calculates a current speech rate based on the speech text and an audio duration corresponding to the text, and performs weighted calculation on the current speech rate and a historical speech rate to obtain a speech rate characteristic of the current user.

In some possible cases, if the computer device cannot obtain the historical speech rate of the current user (for example, the current user uses the voice interaction function for the first time), the computer device may determine the speech rate feature only according to the current speech rate, or determine the speech rate feature based on the current speech rate and the big data speech rate statistical result, which is not limited in this embodiment.

Step 602, in case of multiple rounds of dialog, inputting the historical dialog contents and the voice contents of the user voice into the dialog system to obtain the content characteristics.

Generally, when a user performs multiple rounds of conversations with a terminal, there is usually a certain correlation or continuity between the conversation contents of the multiple rounds of conversations. In the embodiment of the application, the computer equipment determines the content characteristics of the conversation content according to the historical conversation content and the voice content of the voice of the user to be judged and stopped, and takes the content characteristics as one of judgment and stop bases. Optionally, the computer device performs domain, intent, and slot predictions for the historical dialog content and the voice content by the dialog system to determine content characteristics based on the predictions.

In one possible implementation, the computer device performs slot position prediction on the historical conversation content through the conversation system to obtain a predicted slot position value. Further, the computer device determines a first content feature based on the actual slot position value corresponding to the voice content and the predicted slot position value, wherein the first content feature is used for representing the matching degree of the actual slot position value and the predicted slot position value.

Optionally, the computer device determines, through the dialog system, a domain to which a previous dialog in the historical dialog content belongs, and performs slot value prediction based on the domain. When the actual slot position value is matched with the predicted slot position value, the computer equipment sets a first content characteristic as slot position matching; when the actual slot value does not match the predicted slot value, the computer device sets a first content characteristic as a slot mismatch.

In one illustrative example, the computer device determines that the last round of the conversation "remind me to go to hospital" determined the field as a schedule and asks the slot as time through the conversation system to determine the predicted slot value as time information. When the speech content of the user's speech is "today", the computer device determines that the actual slot location value matches the predicted slot location value, setting the first content characteristic to "1".

And when the fact that the field is music is determined to be in the last round of conversation 'song playing xxx' through the conversation system and the inquiry slot position is the song name is determined, determining the predicted slot position value as song name information. Since the actual slot value time information corresponding to the voice content "today" does not coincide with the predicted slot value, the computer device sets the first content characteristic to "0".

In some scenarios, the actual slot value corresponding to the voice content may not exactly match the predicted slot value, but there is some continuity between the two, i.e., the intention between the historical conversation content and the voice content can be transferred correctly. In one possible implementation, in addition to the slot prediction, the computer device performs intent recognition on the historical dialog content and the voice content through the dialog system to obtain a first intent corresponding to the historical dialog content and a second intent corresponding to the voice content. Further, the computer device determines whether the first intent and the second intent determine that the dialog state can be properly transitioned, resulting in the second content characteristic.

Illustratively, when the dialog state supports transition, the computer device sets the second content characteristic to "1"; when the dialog state does not support a transition, the computer device sets the second content characteristic to "0".

Further, the computer device determines the first content characteristic and the second content characteristic as content characteristics for a subsequent content VAD decision. Optionally, the content feature may be characterized using a feature vector, for example, the content feature vector (1, 1) represents a slot bit value match and is intended to support dialog state transition.

Illustratively, as shown in fig. 7, the computer device inputs the speech content (which may be converted text) of the user's speech and the historical dialogue content into the dialogue system, resulting in the content characteristics output by the dialogue system.

Step 603, determining the end point of the speech based on the speech rate characteristics and the content characteristics.

Since different users have different speaking habits, a uniform and fixed VAD duration may not fit all users. For example, for the same VAD duration, it may be too long for a user with a fast speech rate, which results in a delay in determining the end point of speech, and it may be too short for a user with a slow speech rate, which results in a delay in determining the end point of speech before affecting the integrity of subsequent recognition. Therefore, in the embodiment of the present application, the computer device dynamically determines the VAD duration based on the speech speed characteristics of the current user, and dynamically determines the VAD duration based on the dynamic VAD duration. As shown in fig. 8, this step may include the following sub-steps.

Step 603A, determining the duration of the dynamic VAD based on the speech speed characteristics, wherein the duration of the dynamic VAD is in negative correlation with the speech speed characteristics; a first confidence of arbitration is determined based on the audio over the dynamic VAD duration.

In one possible implementation, the computer device calculates the dynamic VAD duration based on the speech rate characteristics and the default VAD duration. This process can be denoted as VAD_{Dynamic state}＝VAD_{By default}×F_v. For example, when VAD_{By default}600ms, and F_vWhen 0.8, the computer device determines VAD_{Dynamic state}＝480ms。

It can be seen from the above determining process of the duration of the dynamic VAD that the faster the speech speed of the current user (i.e. the greater the speech speed characteristic is), the smaller the duration of the dynamic VAD is, the faster the speech termination endpoint is judged and stopped; the slower the speech speed of the current user (i.e. the smaller the speech speed characteristic), the longer the duration of the dynamic VAD, the slower the speech termination endpoint judgment, and the dynamic judgment based on the speaking habit of the user is realized.

Due to input of speechThe voice input habit of the user can be influenced to a certain extent by different fields of information, for example, setting an alarm clock related instruction requires the user to input time information, and the pause time of the user during the time information input can be prolonged; when the user requests a song, the pause time when the name of the singer or the name of the song is input is relatively short, so that in order to further improve the accuracy of the determined dynamic VAD time length, the computer equipment can determine the dynamic VAD time length based on the speech speed characteristics, and can further correct the dynamic VAD time length according to the domain characteristics. In one possible implementation, this process may be denoted as VAD_{Dynamic state}＝VAD_{By default}×F_v+VAD_FIELDTherein, VAD_FIELDBased on actual user voice determination in different domains, and VAD_FIELDAnd may be positive or negative.

For example, when VAD_{By default}＝600ms，F_vWhen the current speaking content belongs to the field of the schedule field, the computer equipment determines VAD corresponding to the schedule field_{Schedule board}Is +50ms, the final calculation yields VAD_{Dynamic state}＝530ms。

After the dynamic VAD time length is determined, the computer device extracts audio based on the dynamic VAD time length (namely, extracts audio of the dynamic VAD time length before the current time), so that the first judgment confidence level of judgment at the current time is obtained based on the extracted window audio. In a possible implementation manner, the computer device determines the first confidence level of the determination based on the short-term energy and the short-term zero-crossing rate of the window audio, which is not described in this embodiment.

Illustratively, as shown in fig. 7, after determining the duration of the dynamic VAD based on the speech rate feature, the computer device extracts a window audio, and performs speech end point determination on the window audio to obtain a first confidence of determination.

Step 603B, determining a second confidence level of the arbitration based on the content characteristics.

In addition to making decisions based on windowed audio, computer devices also need to make decisions in conjunction with content features. In one possible implementation, the computer device determines the second confidence level as the first value in the event that the content feature indicates that the slot values match, or that a session state transition is supported. In a case where the content features indicate that the slot location values do not match and the dialog state transition is not supported, the computer device determines the second confidence level as a second numerical value, the first numerical value being greater than the second numerical value.

For example, when the previous round of conversation is "remind me to go to hospital", and the slot position is queried as time, if the voice of the user is "today" in the current round of conversation, the computer device determines that the slot position values are matched, and further determines that the second judgment confidence level is 0.3; when the previous round of conversation is 'play birthday song of xxx', the computer equipment determines that the slot position values are not matched and do not support conversation state transition, and further determines that the second judgment confidence level is 0.

Illustratively, as shown in fig. 7, the computer device determines to obtain the second confidence level of the arbitration based on the content feature output by the dialog system.

Step 603C, determining a total arbitration confidence level based on the first arbitration confidence level and the second arbitration confidence level.

In a possible implementation manner, the computer device performs weighted summation on the first confidence level and the second confidence level to obtain a total confidence level, wherein confidence weights corresponding to the respective confidence levels are preset.

In other possible embodiments, to further improve the determination accuracy, the computer device performs voice type recognition on the audio within the duration of the dynamic VAD to obtain a voice type recognition result. The computer device determines a third confidence level of the arbitration based on the voice type recognition result, and further determines a total confidence level of the arbitration based on the first confidence level of the arbitration, the second confidence level of the arbitration, and the third confidence level of the arbitration.

Optionally, the sound type recognition result may include a human voice, a background sound, a noise, and the like, and the computer device sets a third confidence level for different sound types in advance, for example, the third confidence level_Noise(s)Third confidence level_{Background sound}Third confidence level_{Human voice}。

Illustratively, as shown in fig. 7, the computer device identifies the extracted window audio, obtains a sound type identification result, and determines a third confidence level of the decision based on the result. Further, the computer device performs weighted calculation on the first judgment and stop confidence level, the second judgment and stop confidence level and the third judgment and stop confidence level to obtain a total judgment and stop confidence level.

Step 603D, determining the voice ending endpoint when the total confidence level of the judgment is higher than the confidence threshold.

Optionally, the computer device detects whether the total confidence level of the judgment stop is higher than a confidence threshold (for example, 0.8), and if the total confidence level is higher than the confidence threshold, it determines that the current time is a voice end point; if the total judgment confidence level is lower than the confidence threshold, determining the total judgment confidence level in the next judgment and stop detection period (since the first judgment confidence level is continuously improved along with the increase of the user mute time, the total judgment confidence level is continuously improved), and detecting whether the total judgment and stop confidence level is higher than the confidence threshold. For example, the computer device performs a round of arbitration detection every 20 ms.

In this embodiment, the computer device dynamically determines the VAD duration based on the speech speed characteristics of the current user, so as to implement dynamic VAD judgment, implement adaptive judgment for different users, and improve the judgment accuracy. In addition, in the embodiment, the computer device determines the total stopping confidence level of the current time based on the stopping confidence levels of multiple dimensions, so that the stopping accuracy is further improved.

Referring to fig. 9, a block diagram of a device for determining a voice endpoint according to an embodiment of the present application is shown. The apparatus may be implemented as all or part of a computer device in software, hardware, or a combination of both. The device includes:

an audio acquiring module 901, configured to acquire an audio acquired by a microphone;

an integrity determination module 902, configured to determine input integrity of a user voice in audio when voice energy of the user voice is lower than an energy threshold;

a determining module 903, configured to determine, when the input integrity does not satisfy a determination condition of an end-of-speech endpoint, the end-of-speech endpoint based on the speech rate feature and the content feature.

Optionally, the method further includes:

a speech rate feature determination module, configured to determine the speech rate feature based on a current speech rate of the user speech and a historical speech rate of a user corresponding to the user speech;

and the content characteristic determining module is used for inputting the historical conversation content and the voice content of the user voice into the conversation system under the condition that multiple rounds of conversations exist, so as to obtain the content characteristic.

Optionally, the speech rate feature determining module is configured to:

and calculating the speech rate characteristics based on a first speech rate weight corresponding to the current speech rate and a second speech rate weight corresponding to the historical speech rate.

Optionally, the content feature determination module is configured to:

performing slot position prediction on the historical conversation content through the conversation system to obtain a predicted slot position value; determining a first content feature based on the predicted slot position value and an actual slot position value corresponding to the voice content, wherein the first content feature is used for representing the matching degree of the actual slot position value and the predicted slot position value;

performing intention recognition on the historical conversation content and the voice content through the conversation system to obtain a first intention corresponding to the historical conversation content and a second intention corresponding to the voice content; determining a second content feature based on the first intention and the second intention, wherein the second content feature is used for representing the support condition of the dialogue state transition between the second intention and the first intention;

determining the first content characteristic and the second content characteristic as the content characteristic.

Optionally, the determining module 903 includes:

the first confidence coefficient determining unit is used for determining the duration of the dynamic VAD based on the speech speed characteristics, and the duration of the dynamic VAD and the speech speed characteristics are in a negative correlation relationship; determining a first arbitration confidence level based on audio within the dynamic VAD duration;

a second confidence determining unit configured to determine a second confidence of the decision based on the content feature;

a total confidence determining unit configured to determine a total arbitration confidence level based on the first arbitration confidence level and the second arbitration confidence level;

and the judging unit is used for judging the voice ending endpoint under the condition that the total judgment confidence level is higher than a confidence threshold.

Optionally, the second confidence determining unit is configured to:

determining the second confidence as a first numerical value under the condition that the content features indicate that slot position values are matched or conversation state transition is supported;

determining the second confidence level as a second numerical value if the content features indicate slot value mismatches and the dialog state transition is not supported, the first numerical value being greater than the second numerical value.

Optionally, the determining module 903 further includes:

the type identification unit is used for carrying out voice type identification on the audio in the duration of the dynamic VAD to obtain a voice type identification result;

a third confidence determining unit configured to determine a third confidence of the decision based on the voice type recognition result;

the total confidence determining unit is further configured to determine the total confidence based on the first confidence, the second confidence, and the third confidence.

Optionally, the integrity determining module 902 includes:

the voice recognition unit is used for carrying out voice recognition on the user voice to obtain a voice text corresponding to the user voice;

an integrity determination unit configured to determine the input integrity of the user speech based on an audio feature of the user speech and a text feature of the speech text.

Optionally, the integrity determination unit is configured to:

carrying out feature extraction on the user voice through an audio feature extraction model to obtain the audio features;

performing feature extraction on the voice text through a text feature extraction model to obtain text features;

performing feature fusion on the audio features and the text features to obtain fusion features;

determining a complete probability of the user speech based on the fused features.

Optionally, the audio feature extraction model is composed of a sincenet, a first LSTM network, and a first full-link layer, and the text feature extraction model is composed of a word-embedded network, a second LSTM network, and a second full-link layer;

the integrity determination unit is configured to:

splicing the audio features output by the first full connection layer and the text features output by the second full connection layer to obtain the fusion features;

and inputting the fusion features into a classifier to obtain the complete probability output by the classifier.

Optionally, the apparatus further comprises:

the feature extraction module is used for extracting high-order features of the user voice;

the integrity determination unit is further configured to:

and performing feature fusion on the high-order features, the audio features and the text features to obtain the fusion features.

Optionally, the determining module 903 is further configured to:

and in the case that the input integrity meets the judgment condition of the voice end point, judging the voice end point.

Referring to fig. 10, a block diagram of a computer device according to an exemplary embodiment of the present application is shown. The computer device 1000 in the present application may include one or more of the following components: a processor 1001 and a memory 1002.

Processor 1001 may include one or more processing cores. The processor 1001 interfaces with various parts within the overall computer device using various interfaces and lines, and performs various functions of the computer device and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1002, and calling data stored in the memory 1002. Alternatively, the processor 1001 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Neural-Network Processing Unit (NPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing contents required to be displayed by the touch display screen; the NPU is used for realizing an Artificial Intelligence (AI) function; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.

The Memory 1002 may include a Random Access Memory (RAM) or a Read-Only Memory (ROM). Optionally, the memory 1002 includes a non-transitory computer-readable medium. The memory 1002 may be used to store instructions, programs, code sets, or instruction sets. The memory 10020 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like; the storage data area may store data (such as audio data, a phonebook) created according to the use of the computer device, and the like.

Optionally, when the computer device is a terminal, the terminal may further include a display screen, a microphone, and a speaker. A display screen is a display component for displaying a user interface. Optionally, the display screen further has a touch function, and through the touch function, a user can perform touch operation on the display screen by using any suitable object such as a finger, a touch pen, and the like. The microphone is used for collecting audio, and the loudspeaker is used for playing voice so as to realize voice interaction with a user.

In addition, those skilled in the art will appreciate that the configurations of the computer apparatus shown in the above-described figures do not constitute limitations on the computer apparatus, and that a computer apparatus may include more or less components than those shown, or some of the components may be combined, or a different arrangement of components. For example, the computer device further includes a camera module, a radio frequency circuit, an input unit, a sensor (such as an acceleration sensor, an angular velocity sensor, a light sensor, and the like), a Wireless Fidelity (WiFi) module, a power supply, a bluetooth module, and other components, which are not described herein again.

The embodiment of the present application further provides a computer-readable medium, where at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for determining a voice endpoint according to the above embodiments.

The embodiment of the present application further provides a computer program product, where at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for determining a voice endpoint according to the above embodiments.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for determining a voice endpoint, the method comprising:

acquiring audio collected by a microphone;

2. The method according to claim 1, wherein before determining the end-of-speech endpoint based on the speech rate feature and the content feature, the method comprises:

determining the speech rate characteristics based on the current speech rate of the user speech and the historical speech rate of the user corresponding to the user speech;

and under the condition that multiple rounds of conversations exist, inputting historical conversation contents and the voice contents of the user voice into a conversation system to obtain the content characteristics.

3. The method of claim 2, wherein said determining said speech rate characteristic based on said current speech rate and said historical speech rate comprises:

4. The method of claim 2, wherein the inputting the historical dialog content and the speech content of the user speech into a dialog system to obtain the content characteristics comprises:

5. The method of claim 2, wherein determining the end-of-speech endpoint based on the speech rate feature and the content feature comprises:

determining a dynamic VAD duration based on the speech speed characteristics, wherein the dynamic VAD duration and the speech speed characteristics are in a negative correlation relationship; determining a first arbitration confidence level based on audio within the dynamic VAD duration;

determining a second confidence level of the decision based on the content features;

determining a total arbitration confidence level based on the first arbitration confidence level and the second arbitration confidence level;

and judging the voice ending endpoint under the condition that the total judgment confidence level is higher than a confidence threshold value.

6. The method of claim 5, wherein determining a second confidence of decision based on the content features comprises:

7. The method of claim 5, further comprising:

carrying out voice type identification on the audio frequency in the duration of the dynamic VAD to obtain a voice type identification result;

determining a third judgment confidence level based on the voice type recognition result;

determining the total arbitration confidence level based on the first arbitration confidence level, the second arbitration confidence level, and the third arbitration confidence level.

8. The method of claim 1, wherein the determining the input integrity of the user speech comprises:

performing voice recognition on the user voice to obtain a voice text corresponding to the user voice;

determining the input integrity of the user speech based on audio features of the user speech and text features of the speech text.

9. The method of claim 8, wherein determining the input integrity of the user speech based on audio features of the user speech and text features of the speech text comprises:

10. The method of claim 9, wherein the audio feature extraction model consists of a sincent network, a first LSTM network, and a first fully-connected layer, and wherein the text feature extraction model consists of a word-embedding network, a second LSTM network, and a second fully-connected layer;

the performing feature fusion on the audio features and the text features to obtain fusion features includes:

the determining the complete probability of the user speech based on the fused features comprises:

11. The method of claim 9, further comprising:

extracting high-order features of the user voice;

12. The method according to any one of claims 1 to 11, further comprising:

13. An apparatus for determining a voice endpoint, the apparatus comprising:

14. A computer device, wherein the computer device comprises a processor and a memory; the memory stores at least one instruction for execution by the processor to implement a method of determining a voice endpoint as claimed in any one of claims 1 to 12.

15. A computer-readable storage medium having stored thereon at least one instruction for execution by a processor to implement a method for determining a voice endpoint as claimed in any one of claims 1 to 12.

16. A computer program product comprising at least one instruction which is loaded and executed by a processor to implement a method for speech endpoint determination as claimed in any one of claims 1 to 12.