WO2023115588A1

WO2023115588A1 - Speech interaction method and apparatus, and storage medium

Info

Publication number: WO2023115588A1
Application number: PCT/CN2021/141405
Authority: WO
Inventors: 唐瑞雪; 高益; 聂为然
Original assignee: 华为技术有限公司
Priority date: 2021-12-25
Filing date: 2021-12-25
Publication date: 2023-06-29
Also published as: CN116670760A

Abstract

A speech interaction method and apparatus, and a storage medium. The method comprises: acquiring a first audio signal, wherein the first audio signal comprises a first speech command (S410); determining the duration of a first timer according to first text corresponding to the first speech command (S420); starting the first timer (S430); acquiring a second audio signal, wherein a start moment of the second audio signal is equal to or later than an end moment of the first audio signal (S440); when text corresponding to a speech command in the second audio signal is blank, determining an end moment of the first timer as a speech endpoint (S450); and after the speech endpoint is determined, responding to the first speech command (S460). By means of the method, a speech endpoint can be flexibly determined, such that the problem of a speech response delay being too long due to noise can be alleviated, and premature truncation of speech interaction due to the fact that a user pauses during speech is reduced.

Description

Method, device and storage medium for voice interaction

technical field

The present application relates to the field of human-computer interaction, and more specifically, to a voice interaction method, device and storage medium.

Background technique

The voice recognition function is widely used in smart home equipment, smart vehicle equipment and other equipment to realize natural human-computer voice interaction. The judgment of automatic speech recognition (ASR) on the effective language segment in the audio signal involves front-end point detection and back-end point detection, that is, detecting the start and end of speech. Speech back-end detection often suffers from excessive delays or premature truncation due to background noise, user speech rate differences, and user pauses in speech.

Contents of the invention

Embodiments of the present application provide a voice interaction method, device, and storage medium, which can improve user experience with voice responses.

In a first aspect, a voice interaction method is provided, the method comprising: acquiring a first audio signal, the first audio signal including a first voice command; determining the first voice command according to the first text corresponding to the first voice command The duration of the timer; start the first timer; obtain the second audio signal, the start moment of the second audio signal is equal to or later than the end moment of the first audio signal; the voice command in the second audio signal When the corresponding text is empty, determine the end time of the first timer as the voice endpoint; after determining the voice endpoint, respond to the first voice instruction.

In the embodiment of the present application, the timer duration can be determined according to the text corresponding to the voice command in the voice interaction, and the voice endpoint can be flexibly determined according to the timer and the second audio signal, so that the voice noise caused by noise can be alleviated. Addresses issues with excessively long response delays, and reduces cases where voice interactions cut off prematurely due to user pauses in speech. Furthermore, in the case of shortening the system response delay, the voice command response speed can be improved, and the user experience can be improved.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: when the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer can be identified as a voice endpoint.

In this embodiment of the present application, whether the second audio signal includes a voice command can be determined through the energy of the audio frame of the second audio signal, which can reduce the misjudgment rate of the voice endpoint.

With reference to the first aspect, in some implementations of the first aspect, when the text corresponding to the voice instruction in the second audio signal is empty, determining the end time of the first timer as the voice endpoint includes: When the text corresponding to the voice command in the second audio signal is empty and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer is determined as the voice endpoint.

In the embodiment of the present application, by combining the energy of the audio frame of the second audio signal and the text obtained according to the audio signal, it is possible to more accurately determine whether the second audio signal includes a voice instruction, thereby improving the determined voice Endpoint accuracy.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: acquiring a second text, and the second text is displayed on a display screen; the method further includes, in the first voice instruction corresponding to the second text When a text matches the second text, perform the operation indicated by the second text.

In the embodiment of the present application, by performing the operation indicated by the second text, the function of seeing and talking can be realized, so that the user can realize the interaction with the user equipment through voice without contacting the user equipment. Improve user experience. In addition, in the embodiment of the present application, the matching process between the first text corresponding to the voice command and the second text displayed on the display screen can be performed before the voice endpoint detection, rather than after the complete voice command is determined. In this way, the response time to the user's voice command can be significantly shortened, thereby improving user experience. Moreover, when the matching cannot be performed, the first text can be used for voice endpoint detection without affecting the voice endpoint detection.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: acquiring the second text displayed on the display screen; and determining the first timer's value according to the first text corresponding to the first voice instruction. The duration includes: when the first text corresponding to the first voice instruction does not match the second text, determining the duration of the first timer according to the first text corresponding to the first voice instruction.

In the embodiment of the present application, when the voice command given by the user cannot match the text displayed on the display screen, the duration of the first timer can be determined according to the voice command in the voice interaction, and in response to the user's voice command, so that for The response to the user's voice command is not limited to the operation indicated by the text in the display screen, and may have a wider scope of application.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: acquiring a third audio signal, where the third audio signal includes an audio signal received within a first preset time, and the first preset The start moment of time is equal to or later than the end moment of the first audio signal; the first text corresponding to the first voice instruction is used to determine the duration of the first timer, including: when the third audio signal does not include During the voice command, the duration of the first timer is determined according to the first text corresponding to the first voice command.

In the embodiment of the present application, by obtaining the third audio signal received within the first preset time, when it is determined that the third audio signal does not include voice instructions, the duration of the first timer can be determined according to the first text, which can reduce the detection time. The frequency of the voice endpoint can save resources occupied by detecting the voice endpoint.

With reference to the first aspect, in some implementations of the first aspect, before acquiring the first audio signal, the method further includes: acquiring a fourth audio signal, where the fourth audio signal includes a third voice instruction; according to the The third text corresponding to the third voice command determines the duration of the second timer; starts the second timer, and obtains the fifth audio signal when the second timer is running, and the end time of the second timer is earlier than Or equal to the start time of the first timer; when the text corresponding to the voice command in the fifth audio signal is not empty, determine the first audio signal according to the fourth audio signal and the fifth audio signal, the The first audio signal includes the fourth audio signal and the fifth audio signal.

In the voice interaction, the user may have multiple pauses, so in the process of determining the voice endpoint in the voice interaction, multiple detections can be performed. When the voice endpoint detection fails, the voice can be performed again according to the audio signal in the voice interaction Endpoint detection until a voice endpoint is successfully detected and thus responds to voice commands. In order to distinguish the audio signal and text used in multiple speech detection processes, the text used in this speech endpoint detection can be defined as the first text, and correspondingly, the corresponding audio signal can be defined as the first audio signal, define the text used when determining the timer in the previous process of detecting the voice endpoint as the third text, correspondingly, define the corresponding audio signal as the fourth audio signal, and the fourth audio signal can be used as a part of the first audio signal , the third text can be part of the first text.

In the embodiment of the present application, the text used when the previous voice endpoint detection failed is used as part of the first text used in this voice endpoint detection, which can make full use of the voice commands obtained in the voice interaction and their corresponding text, the first timer can be determined more accurately, thereby improving the accuracy of voice endpoint detection, thereby avoiding inappropriate truncation of voice interaction, and alleviating the impact caused by noise and pauses in user speech.

With reference to the first aspect, in some implementations of the first aspect, the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than the end moment of the fifth audio signal.

In the embodiment of the present application, for the time period between the start moment of the fourth audio signal and the end moment of the fifth audio signal, the first audio signal may include all the voice instructions in the audio signal in the time period, by This can improve the accuracy of the determined first timer, thus can improve the accuracy of voice endpoint detection, and further, can improve user experience.

With reference to the first aspect, in some implementations of the first aspect, determining the duration of the first timer according to the first text corresponding to the first voice instruction includes: the first text corresponding to the first voice instruction Inputting the prediction model to obtain the semantic completeness of the first text; and determining the duration of the first timer according to the semantic completeness of the first text.

Optionally, the semantic completeness may refer to the completeness of the semantics. Exemplarily, the semantic completeness of the first text may refer to the completeness of the semantics of the first text. Optionally, the first information may be used to characterize the semantic completeness.

In the embodiment of the present application, by inputting the first text into the prediction model, the semantic completeness of the first text can be obtained, and thus it can be determined whether the corresponding voice instruction is complete according to the semantic completeness of the first text, so that Voice endpoints can be flexibly determined.

In a second aspect, a device for voice interaction is provided, which is characterized in that the device includes: an acquisition module, configured to acquire a first audio signal, the first audio signal including a first voice instruction; For the audio signal, the start time of the second audio signal is equal to or later than the end time of the first audio signal; the processing module is configured to determine the duration of the first timer according to the first text corresponding to the first voice instruction; Start the first timer; when the text corresponding to the voice command in the second audio signal is empty, determine the end time of the first timer as the voice endpoint; after determining the voice endpoint, respond to the first voice instruction.

With reference to the second aspect, in some implementation manners of the second aspect, the processing module may also be configured to, when the energy of the audio frame of the second audio signal is less than or equal to the first threshold, set the first timer The end moment of is determined as the voice endpoint.

With reference to the second aspect, in some implementation manners of the second aspect, the processing module is specifically configured to: the text corresponding to the voice instruction in the second audio signal is empty, and the audio frame of the second audio signal When the energy is less than or equal to the first threshold, the end time of the first timer is determined as the voice endpoint.

With reference to the second aspect, in some implementations of the second aspect, the acquiring module is further configured to: acquire the second text displayed on the display screen; the processing module is specifically configured to: When the first text does not match the second text, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.

With reference to the second aspect, in some implementation manners of the second aspect, the acquiring module is further configured to: acquire a third audio signal, where the third audio signal includes an audio signal received within a first preset time, and the first The start time of the preset time is equal to or later than the end time of the first audio signal; the processing module is specifically used to: when the third audio signal does not include a voice command, according to the first voice command corresponding to the A text specifying the duration of the first timer.

With reference to the second aspect, in some implementations of the second aspect, the acquiring module is further configured to: acquire a fourth audio signal before acquiring the first audio signal, where the fourth audio signal includes a third voice instruction; Acquire the fifth audio signal when the second timer is running; the processing module is also used to: determine the duration of the second timer according to the third text corresponding to the third voice instruction; start the second timer, the first The end time of the second timer is earlier than or equal to the start time of the first timer; when the text corresponding to the voice command in the fifth audio signal is not empty, according to the fourth audio signal and the fifth audio signal, The first audio signal is determined, and the first audio signal includes the fourth audio signal and the fifth audio signal.

With reference to the second aspect, in some implementations of the second aspect, the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than At the end moment of the fifth audio signal.

With reference to the second aspect, in some implementation manners of the second aspect, the processing module is specifically configured to: input the first text corresponding to the first voice instruction into the prediction model, and obtain the semantic completeness of the first text; The duration of the first timer is determined according to the semantic integrity of the first text.

In a third aspect, a method for training a prediction model for voice interaction is provided, the method comprising: acquiring a text data set, the text data set includes a plurality of fourth texts, and the fourth texts are marked with first information, The first information is used to represent the semantic completeness of the text; model training is performed according to the text data set to obtain a prediction model, and the prediction model is used to predict the semantic completeness of the voice instruction.

In the embodiment of the present application, model training can be performed according to the text data set, and a prediction model can be obtained. Through the training process, the prediction model can learn the relationship between the text and its semantic integrity from the fourth text in the text data set, so that In the model prediction stage, the semantic completeness of the text to be analyzed can be predicted based on the prediction model, so that during the voice interaction process, by determining the semantic completeness of the text corresponding to the voice command in the audio signal, it can be determined whether the user has the ability to continue speaking intention of.

With reference to the third aspect, in some implementation manners of the third aspect, the method further includes: acquiring a text corpus, where the text corpus includes multiple texts with complete semantics; and determining the text data set according to the text corpus.

In the embodiment of the present application, the text data set is determined according to the text corpus, so that only texts with complete semantics can be prepared, the number of texts that need to be prepared for building a text data set can be reduced, and the process of building a text data set can be simplified .

With reference to the third aspect, in some implementations of the third aspect, the determining the text data set according to the text corpus may include: determining one or more fourth texts according to the texts with complete semantics in the text corpus ; Determine a text data set according to multiple fourth texts determined by multiple texts with complete semantics in the text corpus.

In the embodiment of the present application, determining one or more fourth texts according to the texts with complete semantics in the text corpus can simplify the process of determining the semantic completeness of the one or more fourth texts, thereby simplifying the determination and labeling of the first information process.

In conjunction with the third aspect, in some implementations of the third aspect, the method further includes: determining a dictionary tree according to the text corpus, the dictionary tree including a plurality of nodes; according to the number of child nodes of the nodes in the dictionary tree , to determine the semantic completeness of the fourth text.

Exemplarily, according to the node in the dictionary tree, the fourth text corresponding to the node can be determined, for example, the fourth text can be the text ending with the node. Optionally, the semantic completeness of the fourth text corresponding to the node may be determined according to the number of child nodes of the node in the trie.

In the embodiment of the present application, by determining the dictionary tree, the number of child nodes of the node can be determined, thereby determining the semantic completeness of the fourth text corresponding to the node in the text data set, thereby improving the efficiency of determining the semantic completeness of the fourth text.

In conjunction with the third aspect, in some implementations of the third aspect, determining the semantic integrity of the fourth text according to the number of child nodes of the nodes in the dictionary tree includes: according to the number of child nodes of the nodes in the dictionary tree , and the tail node label corresponding to the text with complete semantics, determine the semantic completeness of the fourth text.

In the embodiment of the present application, the semantic integrity of the fourth text can be confirmed at a finer granularity through the number of sub-nodes of the dictionary tree and the mark of the tail node, so that a more accurate prediction model can be obtained through training.

In a fourth aspect, there is provided a device for training a prediction model for voice interaction, the device includes an acquisition module and a training module, wherein the acquisition module can be used to: acquire a text data set, the text data set includes a plurality of fourth Text, the fourth text is marked with first information, and the first information can be used to represent the semantic integrity of the fourth text; the training module can be used to: perform model training according to the text data set to obtain a prediction model, the prediction model Used to predict the semantic completeness of speech commands.

With reference to the fourth aspect, in some implementations of the fourth aspect, the acquisition module can also be used to acquire a text corpus, which can include multiple texts with complete semantics; the device can also include a processing module , the processing module can be used to determine a text data set according to the text corpus.

With reference to the fourth aspect, in some implementations of the fourth aspect, the processing module is specifically configured to determine one or more fourth texts according to the texts with complete semantics in the text corpus; A plurality of fourth texts determined by texts with complete semantics determine a text data set.

In conjunction with the fourth aspect, in some implementations of the fourth aspect, the processing module may also be used to: determine a dictionary tree according to the text corpus, the dictionary tree includes a plurality of nodes; it may be based on the nodes in the dictionary tree The number of child nodes determines the semantic integrity of the fourth text.

Exemplarily, according to the node in the dictionary tree, the fourth text corresponding to the node can be determined, for example, the fourth text can be the text ending with the node. Optionally, the semantic completeness of the fourth text corresponding to the node may be determined according to the number of child nodes of the node in the dictionary tree.

With reference to the fourth aspect, in some implementations of the fourth aspect, the processing module may also be used to: determine the Semantic completeness of the fourth text.

In a fifth aspect, there is provided a device, the device includes: a processor and a memory, the memory is used to store program instructions, and the processor is used to call the program instructions to execute the first aspect or any one of the possible options in the first aspect. method in the implementation. The device can be set in various devices or systems capable of voice endpoint detection such as voice interaction, voice recognition, voice assistants or smart speakers, for example, various terminal devices such as mobile terminals, vehicle terminals or wearable devices, or can be It is a computer, a mainframe or a server and other devices with computing capabilities. The device can also be a chip.

According to a sixth aspect, an apparatus is provided, and the apparatus includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor is used for Execute the third aspect or the method in any possible implementation manner of the third aspect. The device can be a computer, a host or a server and other devices with computing capabilities. The device can also be a chip.

A seventh aspect provides a terminal device, and the terminal device may include the apparatus in the second aspect or any possible implementation manner of the second aspect, or the fifth aspect or any possible implementation manner of the fifth aspect device in .

Exemplarily, the terminal device may specifically include a computer, a smart phone, a tablet computer, a personal digital assistant (personal digital assistant, PDA), a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as One or more of devices such as car machine, on-board computer) or robot.

With reference to the seventh aspect, in some implementation manners of the seventh aspect, the terminal device may be a mobile phone or a vehicle.

The device in any possible implementation manner of the second aspect to the seventh aspect and any one aspect may be an on-board chip, an on-board device (such as a car machine, an on-board computer), or a car. The car in the embodiment of the present application can be understood as a vehicle, and the solution proposed in the embodiment of the present application can also be applied to other vehicles or devices.

In an eighth aspect, an electronic device is provided, and the electronic device may include the device in the fourth aspect and any possible implementation manner of the fourth aspect, or the sixth aspect and any possible implementation manner of the sixth aspect installation.

With reference to the eighth aspect, in some implementation manners of the eighth aspect, the electronic device may be a cloud service device.

In a ninth aspect, there is provided a computer-readable medium, where the computer-readable medium stores program code for execution by a device, and the program code includes a program code for performing the above-mentioned first aspect or any possible implementation manner of the first aspect. device, or the third aspect or the method in any one of the implementation manners of the third aspect.

In a tenth aspect, a computer program product containing instructions is provided, and when the computer program product is run on a computer, it causes the computer to execute the device in the above-mentioned first aspect or any possible implementation manner of the first aspect, or, The third aspect or the method in any implementation manner in the third aspect.

Description of drawings

FIG. 1 is an application scenario of voice interaction provided by an embodiment of the present application.

FIG. 2 is a schematic diagram of a method for detecting a voice endpoint by using a voice activity detection technology provided in the present application.

Fig. 3 is a schematic flowchart of a method for training a prediction model for speech interaction provided by an embodiment of the present application.

Fig. 4 is a schematic diagram of a trie determined according to an exemplary text corpus provided by an embodiment of the present application.

FIG. 5 is a schematic diagram of an input format of a prediction model provided by the present application.

Fig. 6 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.

Fig. 7 is a schematic diagram of an audio signal in a voice interaction provided by an embodiment of the present application.

Fig. 8 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application.

Fig. 9 is a schematic diagram of a method for confirming audio frame classification provided by an embodiment of the present application.

FIG. 10 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application.

FIG. 11 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application.

Fig. 12 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application.

FIG. 13 is an exemplary schematic diagram of a user interface of a display screen provided by an embodiment of the present application.

FIG. 14 is a schematic structural diagram of a device for voice interaction provided by an embodiment of the present application.

FIG. 15 is a schematic structural diagram of an apparatus for training a speech interaction prediction model provided by an embodiment of the present application.

Fig. 16 is a structural example diagram of a device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

FIG. 1 is an application scenario of voice interaction provided by an embodiment of the present application. As shown in FIG. 1 , in this application scenario, a user and a user equipment may be included, and the user and the user equipment may perform voice interaction. User equipment can be devices that support voice interaction, such as vehicle-mounted terminals, smart phones, smart robots, and vehicles, or other devices that support voice interaction, such as smart speakers, smart home devices, smart TVs, desktop calculators, etc. For the sake of brevity, no more examples are given. Exemplarily, the device can perform voice recognition. It should be understood that the embodiment of the present application does not limit the type of the user equipment.

Optionally, one user may perform voice interaction with the user equipment, or multiple users may perform voice interaction with the user equipment, or use other user equipment to perform voice interaction with the user equipment, or multiple users simultaneously Voice interaction with multiple user devices. For example: the user can perform voice interaction with the user equipment with the voice recognition function through the microphone; the recorded audio can be played through the recorder, and the user equipment with the voice interaction function can collect, recognize and respond to the audio. It should be understood that the above manner for voice interaction with the user equipment is only an example for illustration, and is not limited in this embodiment of the present application.

Exemplarily, the user equipment may run an application program supporting voice interaction. For example, the application program may be a navigation application, a voice assistant, an intelligent question answering application, and the like. This embodiment of the present application does not limit it. Exemplarily, the user equipment may be a computer, a smart phone, a tablet computer, a personal digital assistant, a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as a vehicle machine, a vehicle computer) or One or more terminal devices in devices such as robots.

Exemplarily, the application scenario may further include a voice detection platform, which may provide background services for applications supporting voice interaction. For example, by training the model, the voice detection platform can obtain a prediction model, and the user equipment can obtain the prediction model trained by the voice detection platform, and the user equipment can use the prediction model to perform voice recognition and detect voice endpoints during voice interaction. etc., for the sake of brevity, no more examples are given here. It should be understood that the terminal device may also have the above functions, that is, voice interaction may be implemented without the background service provided by the voice detection platform, which is not limited in this embodiment of the present application.

In the process of voice interaction, the voice endpoint detection technology can be used to determine the opportunity to respond to the voice command. When the user speaks, by detecting the audio endpoint, the voice start point and the voice end point can be determined, and the voice start point and voice can be intercepted. Audio between end points, as a voice command. Exemplarily, the voice interaction can be initiated by the user. For example, the way to trigger the voice interaction can be a push-to-talk method. For example, the user can start the voice interaction through a button, and the button can be physical or virtual; For another example, the way of triggering the voice interaction may also be the way of voice wake-up, for example, by speaking the wake-up word, the user can start the voice interaction. Therefore, the speech start point (or called the speech front point) is relatively easy to detect accurately. Exemplarily, the voice interaction can also be initiated by the user equipment. For example, the user equipment asks the user for decision-making instructions (for example, "Warning: the left camera may be stained, whether it needs to be cleaned automatically" after the user equipment broadcasts the information by voice). ). The speech end point (or called speech end point, speech back end point) can be determined through machine automatic detection. For example, the detection of the speech end point can be performed based on the speech activity detection technology.

Voice activity detection (VAD) technology can be used to detect whether a signal within a certain time window is a voice signal. Exemplarily, FIG. 2 is a schematic diagram of a method for detecting voice endpoints using the VAD technology provided in the present application. Wherein, the audio signal is shown in (a) in FIG. 2 , and (b) in FIG. 2 is the VAD output corresponding to the audio signal. For example, the audio signal corresponding to the time period of the low value in the VAD output can be determined as a non-speech signal, or called a non-speech signal. Speech, conversely, audio signals of other time periods may be determined as speech signals.

Exemplarily, after a period of non-voice is detected based on the voice activity detection technology, the voice endpoint can be determined, so as to respond to the acquired voice command, for example, perform the operation indicated by the voice command and/or end the voice interaction etc. This duration may be called the silence duration at the end of the speech, and may be set as a fixed duration. The silence duration at the end of the speech is an important parameter of this detection method. Exemplarily, for example, the duration of silence at the end of the voice may be 800 milliseconds (millisecond, ms). When a non-voice over 800 ms is detected according to the VAD technology, it may be determined that the voice ends, and the voice endpoint is triggered. However, it is difficult to set a fixed duration parameter to suit all scenarios and environments. For example, if the duration of silence at the end of the voice is set too large, the user will experience a longer delay; if the duration parameter is set too small, the user may voice commands are easily truncated. Even if different duration parameters are set according to different business types, in voice interaction, when the user pauses in speaking, it is still easy to cause the voice command to be truncated.

In the embodiment of the present application, the timer duration can be determined according to the text corresponding to the user's voice command, and the voice endpoint can be flexibly determined according to the timer and the second audio signal, so that improper truncation of the voice command can be avoided , can avoid the execution error of the voice command caused by improper truncation, so that it can be adapted to a wider range of scenarios and environments. Further, it can also improve the response speed of the voice command while shortening the system response delay. user experience.

The application scenarios of voice interaction are exemplarily introduced above. The following exemplarily introduces a schematic flow of a method for detecting a voice endpoint in this application scenario.

Exemplarily, a prediction model may be used to detect a speech endpoint in a speech interaction, and the speech endpoint in this embodiment of the present application may refer to a speech end point.

The use of the prediction model can include the model training stage and the model prediction stage. In the model training stage, the relationship between the text and its semantic integrity can be obtained by training the prediction model, and a prediction model with high prediction accuracy can be obtained.

Exemplarily, FIG. 3 is a schematic flow chart of a method for training a prediction model for speech interaction provided by an embodiment of the present application. The method 200 may include:

S210. Acquire a text data set, where the text data set includes a plurality of texts, and the plurality of texts are marked with first information, and the first information may be used to indicate the semantic completeness of the text.

For the convenience of description, the text marked with the first information can be defined as the fourth text, that is, the text data set includes multiple fourth texts, and further, the text data set can be a collection of fourth texts. Semantic completeness can be used to indicate the completeness of the semantics of the text. For example, the text 1 "I want to play a game" can have complete semantics. For example, the text 1 can be the text corresponding to the voice command issued by the user in the semantic interaction. Correspondingly, different truncations can be performed on the text 1 to obtain the text 2 "I want to type" and text 3 "I want". Although neither text 2 nor text 3 has complete semantics, the semantic integrity of text 2 is greater than that of text 3. For the sake of brevity, I will not give examples one by one.

It should be understood that the text data set includes a plurality of fourth texts. It may be that the text data set only includes a plurality of fourth texts, or that the text data set also includes other texts other than the fourth text. The embodiment of the present application There is no limit to this.

Exemplarily, the user equipment can run an application program for voice interaction, and when the voice detection platform can provide background services for the application program supporting voice interaction, the voice detection platform obtains the text data set to predict the voice interaction The model is trained. The speech detection platform can train the prediction model through online training or offline training, which is not limited in this embodiment of the present application. For example, the voice detection platform may include a processing module and a training module, the acquisition module may be used to acquire the text data set, and the training module may be used to perform model training in an offline training manner according to the text data set to obtain a prediction model, Subsequent user equipment can obtain the prediction model and determine the voice endpoint according to the prediction model; for another example, the user can participate in the improvement process of the prediction model, and the voice detection platform can obtain multiple voice instructions uploaded by the user through the user equipment online. When the detection platform performs model training, the text data set used by it can be continuously updated according to the user's voice commands, and the resulting prediction model can be more in line with the user's expression habits; another example, the voice detection platform can include a chip, which can pass Obtain the text data set, and perform model training to obtain the prediction model.

Exemplarily, after the user equipment acquires the text data set, it can be used for model training to obtain a prediction model, so that the prediction model can be used for voice interaction, and details are not described here for brevity. It should be understood that, the foregoing is only an example for description, and this embodiment of the present application does not limit it.

For the convenience of description, in the following embodiments of the present application, the voice detection platform acquires a text data set and performs model training as an example for illustration. It should be understood that this application does not limit this.

Exemplarily, the text data set may be determined according to a text corpus, and the text corpus includes a plurality of texts with complete semantics. Optionally, the text corpus can be a collection of texts with complete semantics, that is, the texts in the text corpus can form complete sentences. For example, the acquisition module of the speech detection platform can obtain a text corpus, which can include texts with complete semantics such as "turn on the music", "I want to play a game", etc., and the speech detection platform can also include a processing module, the processing module The text data set can be determined according to the text corpus, so that the training module can perform model training according to the text data set; another example, the speech detection platform can include a chip, and the chip can determine the text data set by obtaining the text corpus; another example , the speech interaction scenario can also include a preprocessing system, wherein the preprocessing system can be used for the preparation work required by the speech prediction platform before training the prediction model, the preprocessing system can obtain the text corpus, according to the text corpus The text data set is determined by the set, and the speech detection platform can train the prediction model after obtaining the text data set determined by the preprocessing system. It should be understood that the above description about the text corpus is just an example, and the method of obtaining the text corpus is not limited in this embodiment of the present application.

Exemplarily, determining the text data set according to the text corpus may be to determine the fourth text in one or more text data sets according to the text with complete semantics in the text corpus, so that multiple texts with complete semantics in the text corpus may be A plurality of fourth texts are determined, so that a text data set can be determined.

Exemplarily, the text in the text corpus can be divided into one or more nodes, and one node can include one word or character. For example, taking "open music" as an example, the preprocessing system can divide it into multiple nodes of "play", "open", "yin" and "music". Further, the last node of the text may indicate the end of the text, for example, the last node "乐" in the text "open music" may be used to indicate that the text "open music" with complete semantics ends at this node, and That is, the node may include a tail node tag of the text, and the tail node tag may be represented by T _tail . Wherein, the tail node mark T _tail may indicate that the text ending with the node has complete semantics, and the text belongs to the text corpus. It should be understood that the above description about the text corpus is just an example, which is not limited in the present application.

Exemplarily, according to the text with complete semantics in the text corpus, based on its divided nodes, one or more fourth texts belonging to the text data set can be determined. Since the text corpus includes multiple texts with complete semantics, by Therefore, a plurality of fourth texts can be determined, so that a text data set can be determined, and the text data set can include the plurality of fourth texts. Exemplarily, the first node of the text in the text corpus can be used as the starting node, and based on the division of the nodes of the text, one or more fourth texts can be determined with the node and its subsequent nodes as the last node . For example, taking the text "turn on the air conditioner" in the text corpus as an example, its corresponding nodes can be respectively "play", "open", "empty", and "tune", and the node "play" can be used as the starting node of the text , with "play", "open", "empty" and "tune" as the last node respectively, four fourth texts belonging to the text data set can be determined, namely "play", "open" and "open empty" , "Turn on the air conditioner", and thus determine the semantic integrity of the plurality of fourth texts. For the sake of brevity, details will not be described here. It should be understood that the above method for determining text in one or more text data sets is just an example, which is not limited in the present application.

Optionally, in order to facilitate determining the text data set according to the text corpus, a plurality of fourth texts may be determined according to the trie determined by the text corpus, so as to determine the text data set. For example, after the text in the text corpus is divided into one or more nodes, such as a preprocessing system, a processing module of a voice detection platform, a chip, etc., the dictionary tree can be determined according to the divided nodes, and the dictionary tree can be determined according to the The trie determines the fourth text and its semantic completeness.

Exemplarily, the text corpus includes a plurality of texts with complete semantics, wherein the texts with complete semantics can be divided into one or more nodes, thereby determining a dictionary tree according to the text corpus, which can be based on the text corpus A plurality of texts with complete semantics can determine a plurality of nodes, and a trie can be determined according to the plurality of nodes. Exemplarily, FIG. 4 is a schematic diagram of a dictionary tree determined according to an exemplary text corpus provided by an embodiment of the present application, wherein the text corpus may include "I want to play a game", "make a phone call", "Turn on the music", "Turn on the air conditioner" and "Turn on the air conditioner heating", there are five texts with complete semantics. For example, in the dictionary tree as shown in Figure 4, the root node 262 of the dictionary tree may not include any words or characters, and a plurality of nodes other than the root node 262 may only contain a word, such as, the node 263 may include the character "hit", Node 264 may contain the characters "on" or the like. Further, by sequentially connecting the nodes included in the entire path from the root node to a certain node, the text with this node as the last node can be obtained, or the text ending with this node, which can be used as the node the corresponding text. For the convenience of explanation, in the follow-up of the embodiment of this application, the text ending with the node will be used as the text corresponding to the node as an example for illustration, that is to say, the text corresponding to the node described in the follow-up of this application can be replaced by The text to end with this node. For example, in the dictionary tree shown in Figure 4, from the root node 262 to the node 266 "tune", it can be determined that the text ending with the node 266 "tune" is "turn on the air conditioner";"Hot", it can be determined that the text corresponding to the node 268 "Hot" is "Turn on the air conditioner for heating". Optionally, the node may include a tail node tag T _tail . For example, as shown in Figure 4, the texts "turn on the music" and "turn on the air conditioner" corresponding to the node "Le" and the node 266 "tune" respectively, the above two texts have complete semantics and can be included in the text corresponding to Figure 4. In the text corpus, the above-mentioned nodes may include tail node markers, and for the sake of brevity, no examples are given here. It should be understood that the above method of determining the dictionary tree according to the text corpus is just an example, and this application does not limit it.

Further, based on the dictionary tree, a text data set can be determined. For example, in the dictionary tree shown in Figure 4 determined according to the text corpus {"I want to play games", "make a phone call", "turn on the music", "turn on the air conditioner", "turn on the air conditioner heating"}, including "I" , "want", "play" and other 16 nodes, can use the root node 262 as the starting node of the text, and use other nodes as the last node of the text, so as to determine "I", "I want", "I There are 15 fourth texts such as "To type", and according to the 15 fourth texts, a text data set including 15 fourth texts can be formed, and for the sake of brevity, no examples are given here.

Exemplarily, the trie determined according to the text corpus may include multiple nodes, and according to the number of child nodes of the nodes in the trie, the semantic completeness of the fourth text in the text data set may be determined.

Exemplarily, according to the number of child nodes of a node in the dictionary tree, the semantic integrity of the text corresponding to the node can be determined. Exemplarily, when the number of child nodes of a node is 0, the text corresponding to the node may have complete semantics. For example, as shown in Figure 4, the subnodes of node 264 "open" include 6 subnodes such as nodes "sound", "music" and node 265 "empty", and the number of subnodes of node 265 "empty" is 3. The texts corresponding to

nodes

264 and 265 do not have complete semantics and cannot form complete sentences. Moreover, the degree of semantic integrity of the text corresponding to node 264 "open" is less than the semantic integrity of the text corresponding to node 265 "empty". For another example, the number of child nodes of node 268 "hot" is 0, and the text corresponding to this node "turn on the air conditioning and heating" has complete semantics and can form a complete sentence.

Further, according to the semantic completeness of the fourth text in the text data set, the fourth text may be marked with first information, and the first information may be used to indicate the semantic completeness of the text.

Exemplarily, the semantic completeness of the text corresponding to the node can be determined according to the number of child nodes of the node in the dictionary tree, so that the semantic completeness of the fourth text in the text data set can be determined, that is to say, according to the The number of child nodes of the node of is determined, the first information of the fourth text corresponding to the node is determined, and the corresponding first information can be marked on the fourth text.

It should be understood that the above method of determining the text data set based on the dictionary tree according to the text corpus is just an example, which is not limited in the present application.

Optionally, the first information may represent the semantic completeness in a numerical manner. Exemplarily, for the convenience of characterization and statistics, based on the dictionary tree, the number of child nodes of the node can be mapped to the interval [0,1], and the first frequency information of the node can be generated, and the first frequency information can reflect the node's The number of child nodes can be used as the first information of the text corresponding to the node to represent the semantic integrity of the text. That is to say, the first frequency information can refer to the first information and A numerical representation of semantic completeness. For example, as shown in Figure 4, the number of child nodes of nodes such as "play" and "talk" is 0, the number of child nodes of nodes such as "you" and "electricity" is 1, node 269 "play", node 266 " The number of sub-nodes of "tune" is 2, the number of sub-nodes of node "want" and node 265 "empty" is 3, the number of sub-nodes of node "I" is 4, the number of sub-nodes of node 264 "open" is 6, node 263 The number of child nodes of "打" is 9, a total of 15 nodes. Among them, the number of child nodes of the node "I" is not 0, and the corresponding text does not have complete semantics. According to the cumulative probability distribution statistics, the number of child nodes is less than Or there are 13 nodes equal to the node "I". Therefore, the first frequency information of the node "I" can be 13/15, which is about 0.867. The first frequency information can be used as the fourth text corresponding to the node " The first information of "I"; as another example, there are 14 nodes with the number of child nodes less than or equal to node 264 "Open", and the first information of the text "Open" corresponding to node 264 "Open" can be 14/15, which is about 0.933, such as the pre-processing system, the processing module of the speech detection platform, etc., so that the corresponding first information can be marked on the fourth text. It should be understood that the above method for determining the first information of the fourth text is only an example, and other methods may also be used to determine the first information according to the number of child nodes, which is not limited in this embodiment of the present application.

Optionally, the first information may represent the semantic completeness in the form of a label.

Exemplarily, the first information may be a first label or a second label, wherein the first label may be used to indicate that the text has complete semantics, and the second label may be used to indicate that the text does not have complete semantics. For example, as shown in Figure 4, the number of subnodes of the node "乐" is 0, and the text "open music" ending with this node has complete semantics, and its first information can be the first label; The text "open empty" ending with "empty" does not have complete semantics, and its first information can be the second label.

Exemplarily, the first information may also be a third tag, and the third tag may be used to indicate that the text may have complete semantics in some contexts, but may not have complete semantics in other contexts , that is to say, it is impossible to determine whether the text has complete semantics only by the content of the text. For example, in the text corpus corresponding to the dictionary tree shown in Figure 4, text A "turn on the air conditioner" and text B "turn on the air conditioner" with complete semantics are included, and the text "turn on the air conditioner" ending with node 266 "tune" ", can be all of the text A "turn on the air conditioner". At this time, the text corresponding to the node 266 can have all the semantics of the text A, and the text corresponding to the node 266 "tune" can also be a part of the text B "turn on the air conditioner". At this time, the text corresponding to the node 266 can only represent a part of all the semantics of the text B, but cannot represent the semantics of the text B. Therefore, the first information of the fourth text "turn on the air conditioner" may be the third label. It should be understood that the above manner of determining the first information of the fourth text is only an example for illustration, and is not limited in this embodiment of the present application.

It should be understood that the first tag, the second tag, and the third tag may be in any data format, such as numbers, letters, and character strings. For example, the first label can be "complete", the second label can be "other (other)", and the third label can be "part (part)", such as, as shown in Figure 4, with node 268 " The text "Turn on air conditioner heating" ending with "hot" has complete semantics, so the first information of the text can be complete; another example, the text "Turn on empty" ending with node 265 "empty" does not have complete semantics , the first information of which can be other. It should be understood that this is not limited in the embodiment of the present application.

For the convenience of description, following the embodiment of the present application, complete is the first label, other is the second label, and part is the third label for example. That is to say, complete described later in this application can be replaced by the first label, other can be replaced by the second label, and part can be replaced by the third label.

Exemplarily, when the number of child nodes of a node is 0, that is, when the node has no child nodes, it may be determined that the first information of the text corresponding to the node is complete. For example, as shown in Figure 4, the number of child nodes of node 268 "hot" is 0, and the text "turn on the air conditioner and heating" ending with this node has complete semantics, then the first information of this text can be complete, for the sake of brevity here No longer.

Optionally, the semantic integrity of the text corresponding to the node may be determined according to the number of child nodes of the node and the label of the tail node, so as to determine the first information to be marked.

Exemplarily, when the number of child nodes of a node is not 0, and the node includes a tail node tag, the first information of the text corresponding to the node may be determined as the third tag. For example, as shown in Figure 4, the number of subnodes of node 266 "tune" is 2, meanwhile, this node can be used as the tail node of the text "turn on the air conditioner" with complete semantics, that is, the node can include the tail node mark, thus It can be determined that the first information of the corresponding text is a part, which will not be described here for brevity. Optionally, when a node includes a tail node tag, the number of child nodes of the node can reflect the completeness of the semantics of the text ending with the node, and the greater the number of child nodes, the lower the completeness of the semantics of the text, namely The lower the semantic integrity of the text.

Exemplarily, when the number of child nodes of a node is greater than 0, and the node does not include a tail node tag, the first information of the corresponding text of the node can be determined as the second tag, which is used to indicate that the node ends with the node Text does not have full semantics. For example, as shown in Figure 4, the number of "empty" child nodes of node 265 is 3, its corresponding first frequency information is greater than 0, and this node has no tail node label, thus the label information of this node is determined to be other, using Yu means that the text "open empty" ending with node 265 "empty" does not have complete semantics. For the sake of brevity, examples are not given here.

Optionally, the first information may combine numbers and labels to characterize semantic completeness. Exemplarily, the first frequency information and the first, second, and third tags may be combined to characterize the semantic completeness of the text. For the sake of brevity, details are not repeated here. It should be understood that this is not limited in the embodiment of the present application.

Optionally, the first information of the text may be adjusted according to the number of nodes included in the fourth text. Exemplarily, when using a label combined with a number as the first information to characterize the semantic integrity of the text, when the length of the text, that is, the number of nodes included in the text, is less than or equal to the length threshold, the first A message is adjusted. For example, if the length threshold is 10 nodes, and the text is "turn on the air conditioner" as shown in Figure 4, the first information of the text can be part and 8/15, where 8/15 is the first frequency information of the node "tune" , since the number of nodes included in the text is 4, which is less than the length threshold, the number in the first information can be adjusted from 8/15 to 0.4. It should be understood that the above manner of adjusting the first information is only an example, and this embodiment of the present application does not limit it.

Since the frequency of use of text with complete semantics in the text corpus may be different from the actual way the user speaks, for example, the frequency of using short sentences may be more than that of long sentences. In the embodiment of the present application, by adjusting the fourth text The first information can make the text data set more in line with the actual voice interaction process, so that a more accurate prediction model can be obtained after the model is trained.

In the embodiment of the present application, the text data set is determined according to the text corpus, so that only texts with complete semantics can be prepared. In this way, the amount of texts to be prepared in the process of constructing the text data set can be reduced. Furthermore, one or more fourth texts can be determined according to the texts in the text corpus, and the process of determining the semantic integrity of the fourth texts can be simplified, thereby simplifying the process of determining and labeling the first information.

It should be understood that text data sets can also be obtained in other ways. For example, a corpus set can be directly constructed, which can include texts with complete semantics and texts without complete semantics. The first information, from which the text data set can be determined. It should be understood that the present application does not limit the method for obtaining the text data set.

S220, perform model training according to the text data set to obtain a prediction model, where the prediction model is used to predict the semantic integrity of the voice instruction.

Exemplarily, the predictive model may predict the semantic completeness of the text corresponding to the voice instruction to determine the completeness of the semantics of the voice instruction, thereby determining whether the user has the intention to continue speaking. Exemplarily, the prediction model can determine the first information of the text according to the input text, and determine whether the text has complete semantics according to the output first information, so as to determine whether the voice instruction corresponding to the text is a complete voice instruction , so as to determine whether the user intends to continue speaking.

The prediction model may be an artificial intelligence (AI) model. Specific types of the prediction model may include multiple types, for example, the prediction model may include at least one of a neural network, a support vector machine, a linear regression model, a logistic regression model, a decision tree, or a random forest. Exemplarily, the predictive model may be a neural network, for example, the predictive model may be a convolutional neural network or a recurrent neural network. It should be understood that the foregoing prediction model is only an example for illustration, and is not limited in this embodiment of the present application.

Alternatively, the predictive model can be a bidirectional encoder representations from transformers (BERT) model. The model input can be [CLS]+text+[SEP], where [CLS] is a special character indicating the start of a piece of text, and [SEP] is a special character indicating the end of a piece of text. Exemplarily, Fig. 5 is a schematic diagram of the input format of a prediction model provided by the present application, wherein the prediction model can be a BERT model, and according to different time nodes, the text to be predicted can be "open the skylight", "open the skylight to", "open the skylight to 100%", "open the skylight to 60%", the above text can be divided into different nodes "open", "open", "sky", "window" and so on, in order For brevity, I will not repeat them here.

Exemplarily, in the voice interaction process, as time goes by, the user instructions in the obtained audio signal can be continuously approached to be complete, and thus new streaming text results can be continuously obtained, and these streaming text The result (or called streaming text) is input into the predictive model to determine the semantic integrity of the current text to determine whether the user has the intention to continue speaking. For example, as shown in Figure 5, if voice interaction starts at the 0th second (second, s), and the text corresponding to the voice command in the audio signal acquired at the 2nd s is "open the sunroof to", the text can be entered as The format [CLS]+“open”“open”“day”“window”“to”+[SEP] shown in Figure 5 is input to the prediction model to predict its semantic integrity, so that it can be determined that the user is willing to continue speaking Intention; for another example, the text corresponding to the voice instruction in the audio signal acquired at the 5th s is "open the sunroof to 60 percent", the text can be input into the prediction model in the format shown in Figure 5, Predict its semantic completeness, so as to determine that the user does not have the intention to continue speaking, so that it can be determined that the complete voice command issued by the user in the voice interaction is "open the sunroof to 60 percent". It should be understood that the above input about the BERT model is just an example, which is not limited in this embodiment of the present application.

It should be understood that related content about the BERT model can refer to related technologies, and for the sake of brevity, this application will not repeat it. In this embodiment of the present application, for example, the output of the model may be the first information of the text, so as to represent the semantic integrity of the text. For example, after outputting to the BERT model in the format [CLS]+"open the skylight to 60%"+[SEP] as shown in Figure 5, the first information of the text can be output, for example, the first information is 0, that is The first information may be in the form of a number to represent the semantic completeness, or may be complete, that is, the first information may also be in the form of a label to represent the semantic completeness, which is not limited in this application.

Exemplarily, the process of model training may include multiple implementation manners. In some embodiments, model training may include a multiple iterative process. An iterative process can include the following steps:

S305. Input the fourth text in the text data set into the prediction model, process the text through the prediction model, and output a prediction result.

S310, according to the prediction result and the first information of the fourth text, a first loss value can be calculated through a loss function, and the first loss value can represent the deviation between the prediction result and the first information, and the difference between the prediction result and the first information The greater the deviation between, the greater the first loss value.

S315. Adjust the parameters of the prediction model according to the first loss value.

The above shows an iterative process of training. After one iteration, the voice detection platform can detect whether the training termination condition is currently met. When the training termination condition is not met, the next iteration process is performed; when the training termination condition is met, Output the prediction model adopted in this iterative process as the trained prediction model.

Wherein, the training termination condition may be that the number of iterations reaches the target number or the loss function satisfies a preset condition, or that the capability does not improve within a period of time when it is verified based on the verification data set. Wherein, the target number of times can be a preset number of iterations to determine the timing of the end of training to avoid waste of training resources; the preset condition can be that the value of the loss function remains unchanged or does not decrease for a period of time during the training process, At this point, it shows that the training process has achieved the training effect, that is, the prediction model has the intention to determine whether the user continues to speak according to the sentence text; the verification data set can be distinguished from the text data set and can be used to evaluate the training effect.

It should be understood that the above manner of training the model is just an example, which is not limited in the present application.

The embodiment of the present application provides a method for training a predictive model for speech detection. Model training can be performed according to the acquired text data set, so as to obtain a predictive model. The prediction model can learn the relationship between the text and its semantic integrity from the semantic integrity of the text in the text dataset through the training process, so that in the model prediction stage, the semantic integrity of the text to be analyzed can be predicted based on the prediction model. Therefore, during the voice interaction process, by determining the semantic integrity of the text corresponding to the voice command in the audio signal, it can be determined whether the user has the intention to continue speaking, and the voice command for voice interaction can be accurately determined to respond.

Exemplarily, FIG. 6 is a schematic flow chart of a voice interaction method provided by an embodiment of the present application, and the method 400 may include steps S410 to S460.

S410. Acquire a first audio signal, where the first audio signal includes a first voice instruction.

Wherein, the first audio signal may be an audio signal used to determine a voice endpoint, and the first voice instruction may be a voice instruction included in the first audio signal.

Exemplarily, when the user equipment supports voice interaction, the user equipment may acquire audio signals in the voice interaction. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, the voice interaction device may include an acquisition module and a processing module, and the acquisition module may be the first audio signal; for another example, the vehicle may include One or more processors, the one or more processors may be used to execute the method 400, and may acquire the first audio signal; for another example, the vehicle may include a chip for voice interaction, and the chip may be used to execute the method 400 , the first audio signal can be obtained, and for the sake of brevity, no examples are given here. It should be understood that the above description about the scene is only an example, which is not limited in this embodiment of the present application.

Exemplarily, during the voice interaction process, the user equipment may continuously acquire audio signals, and when performing voice endpoint detection, part or all of the audio signals acquired during the voice interaction may be used as the first audio signal, wherein the The acquired audio signal includes voice instructions. For example, in the voice interaction starting from time 0, the audio signal from time 0 can be acquired continuously until the end of the voice interaction. The audio signal includes voice commands. For the audio signal between time 3 and time 3, part or all of the audio signal between time 0 and time 3 can be used as the first audio signal to determine the voice endpoint. For example, the audio signal between time 1 and time 3 can be used as The first audio signal, for example, the audio signal between time 0 and time 3 can be used as the first audio signal, if the voice endpoint can be determined according to the first audio signal, then the voice endpoint can be used as the voice interaction End time; if the voice endpoint cannot be determined according to the first audio signal, that is, when the voice interaction is not over, since the audio signal can be continuously obtained, for example, the voice endpoint detection can be performed again at time 5, then time 0 can be set to Part or all of the audio signals between time 5 and time 5 are used as new first audio signals to be used again to determine the endpoint of the speech. It should be understood that time 0<time 1<time 3<time 5, that is, time 0 is the earliest and time 5 is the latest.

It should be understood that the above method for acquiring the first audio signal is only an example for ease of description, and is not limited in this embodiment of the present application.

S420. Determine the duration of the first timer according to the first text corresponding to the first voice instruction.

It should be understood that, before the duration of the first timer is determined, the first text corresponding to the first voice instruction may be acquired.

Exemplarily, the first text is the text corresponding to the first voice instruction, which may be the text obtained through speech recognition of the voice instruction in the first audio signal. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, and the voice interaction device may include an acquisition module and a processing module. In the voice interaction, the voice command issued by the user when speaking is "dǎkāitiānchuāng( That is, open the sunroof)", the acquisition module can acquire the first audio signal including the voice command, and the processing module can determine that the first text corresponding to the voice command is "open the sunroof" through voice recognition of the first audio signal, The duration of the first timer can be determined according to the first text; for another example, the vehicle can also include a voice recognition device, which can perform voice recognition on the audio signal to obtain the first text corresponding to the voice command in the audio signal The processing module can determine the duration of the first timing according to the first text, and the voice recognition device can also be located inside the voice interaction device, that is, it can also be embodied as a voice recognition module of the voice interaction device, which is not discussed in this embodiment of the present application. Do limited. For the sake of brevity, examples are not given here.

Exemplarily, during the voice interaction process, the audio signal may be acquired continuously, a streaming text result may be acquired according to the automatic speech recognition technology, and the first text may be determined according to the streaming text result. For example, in the voice interaction starting from time 0, audio signals can be acquired continuously, and automatic speech recognition can be performed on the acquired audio signals. At time 1, the acquired audio signals do not contain any voice instructions. The streamed text result is empty. At time 2, the streamed text result at this time is "open". At time 3, the real-time streamed text result obtained is "open the skylight". When triggered at time 3 When detecting a voice endpoint, the streaming text result "open the sunroof" at that time can be used as the first text. Correspondingly, for example, the audio signal between time 0 to time 3 or time 1 to 3 can be used as the first text. The audio signal, that is to say, may first determine the first text according to the streaming text result, and then determine the corresponding first audio signal from the acquired audio signals based on the first text; another example, the streaming The text result may include a time stamp. After the first text is determined according to the streaming text result, the corresponding first audio signal may be determined from the acquired audio signals based on the time stamp, so as to avoid excessive time and delay of speech recognition The effects of being too long. It should be understood that time 0<time 1<time 2<time 3, that is, time 0 is the earliest and time 3 is the latest.

It should be understood that the above manner of obtaining the first text is only an example for illustration, and is not limited in this embodiment of the present application.

Exemplarily, due to the processing time of the speech recognition technology, the time for obtaining the first text may be equal to or later than the time for obtaining the first audio signal, which is not limited in this embodiment of the present application.

Exemplarily, the duration of the first timer may be determined according to the prediction model trained by the method 200 . For example, the first text can be input into the prediction model, and according to the prediction model, the first information of the first text can be obtained, and the duration of the first timer can be determined according to the first information; for another example, the first text can also be adjusted to After the required input format of the prediction model is input, it is input into the prediction model, and thus the duration of the first timer is determined. Wherein, the first timer can be used to determine the voice endpoint.

Since in the model training stage, the prediction model uses the text data set for training, and learns the mapping relationship between the text and its semantic integrity, so in step S420, the prediction model can perform the first text based on the learned mapping relationship Recognition, determining the semantic integrity of the first text, can determine the first information of the first text, so as to determine whether the user has the intention to continue speaking.

Optionally, the duration of the first timer may be determined according to the first information of the first text. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, and the voice interaction device may include an acquisition module and a processing module. After the first text is determined, the processing module may input the first text by The prediction model obtained by training in method 200 obtains the first information of the first text to represent the semantic integrity of the first text. Further, the processing module can determine the duration of the first timer according to the first information.

Exemplarily, when the first information represents the semantic integrity in the form of a label, the duration of the first timer may be determined according to the label. For example, if the first, second, and third labels are complete, other, and part respectively, when the first information is complete, it means that the first text can be determined to have complete semantics according to the prediction model, and it can be considered that the user may continue to speak Therefore, a smaller first timer duration (such as 400ms) can be set to reduce the delay of the voice interaction process and improve user experience; when the tag information is other, it means that the first timer is determined according to the prediction model. The text does not have complete semantics. It can be considered that the user has the intention to continue speaking. Therefore, a larger first timer (such as 1500ms) can be set to avoid cutting off the user's voice in advance and avoid voice commands that may be caused by this. Execution error, so as to take into account the user's speaking habits, such as slow speech or frequent pauses, etc.; when the tag information is part, it means that the first text can have relatively complete semantics according to the prediction model, so a moderate value can be set The duration of the first timer (for example, 800ms) is to provide a better user experience while taking into account the delay and cutting off the user's voice in advance. For the sake of brevity, no more examples are given.

Exemplarily, when the first information represents the semantic completeness in a digital form, the duration of the first timer may be determined according to the number. For example, if the first information determined according to the prediction model is 0, it can be considered that the first text can have complete semantics, so a smaller first timer duration (such as 400ms) can be set to reduce the voice interaction process delay to improve user experience; if the first information (for example, according to the prediction model, the first frequency information of the first text is 0.58) is greater than or equal to the second threshold (for example, 0.4), it can be considered that the first text does not have Complete semantics, thus you can set a larger first timer duration (such as 1500ms); when the first information is greater than 0 and less than the second threshold, it can be considered that the first text has relatively complete semantics, thus A moderate duration of the first timer can be set, so as to provide a better user experience. It should be understood that the above method of determining the first duration according to the first information is just an example, and this application does not limit it.

Exemplarily, the first information may be combined with a label and a number to characterize the semantic integrity, and thus determine the duration of the first timer. Exemplarily, the duration of the first timer may be determined according to the first frequency information in combination with the tag. For example, when the label in the first information is part, if the first frequency information is 0.3, the duration of the first timer can be set to 1200ms; if the first frequency information is 0.05, the duration of the first timer can be set to The duration is set to 500ms, so that the duration of the first timer can be set more carefully, which can better take into account the delay and the early truncation of the voice, and can provide a better experience for the user.

It should be understood that the above method of determining the duration of the first timer according to the first information is only an example for ease of description, and this application does not limit it.

It should be understood that the above method of determining the duration of the first timer based on the prediction model is just an example, and other methods may also be used to determine the duration of the first timer.

Exemplarily, after the first text is acquired, the duration of the first timer may be determined by querying a database. The database may include multiple texts and the duration of the first timer corresponding to the multiple texts. For example, the database may include the text of common sentences in voice interaction, and after the first text is determined, the processing module may determine the duration of the first timer according to the matching between the first text and the text in the database. For the sake of brevity, examples are not given here.

Exemplarily, after the first text is obtained, punctuation marks can be added to the first text according to the structure of the first text, the nature of words, etc., when the end of the first text cannot add appropriate punctuation marks, it can be considered that the first text A text does not have complete semantics, you can set a longer duration of the first timer (for example, 1500ms); when the end of the first text can add punctuation (for example, full stop, comma, etc.), it can be based on the added punctuation The symbol sets the duration of the corresponding first timer. For example, when the punctuation is a period, it can indicate that the first text has complete semantics, and a shorter duration of the first timer (such as 500ms) can be set; Semantics, you can set a moderate length of the first timer (such as 800ms), for the sake of brevity, no more examples.

It should be understood that the above method of determining the duration of the first timer according to the first text is only an example for ease of description, and is not limited in this embodiment of the present application.

Optionally, a third audio signal may be obtained, and when the third audio signal does not include a voice command, the duration of the first timer may be determined according to the first text, wherein the third audio signal includes The audio signal received within the first preset time, the start time of the first preset time is the end time of the first audio signal. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, the voice interaction device may include an acquisition module and a processing module, the acquisition module may acquire the third audio signal, and the processing module may determine Whether the third audio signal includes a voice instruction; for another example, the voice interaction device may include a processor, the processor may execute the method 400, acquire the third audio signal, and determine whether the third audio signal includes a voice instruction, for For brevity, I will not repeat them here.

Exemplarily, it may be determined whether the third audio signal includes a voice instruction according to whether the text recognized by the third audio signal is empty. For example, in the voice interaction starting from time 0, the audio signal can be continuously obtained from time 0, and the streamed text result after automatic speech recognition processing can be obtained in real time. At time 3, the streamed text result is "open the skylight" , when the voice endpoint detection is triggered at time 3, the audio signal between time 0 and time 3 can be used as the first audio signal, and the streaming text result "open the sunroof" can be used as the first text, and the first text from time 3 Within a preset time, for example, the end time of the first preset time is time 4, the audio signal between time 3 and time 4 can be used as the third audio signal, and the audio signal between time 3 and time 4 can be separately Speech recognition, if the text result of the recognition is empty, it can be determined that the third audio signal does not include a voice command; for another example, in order to avoid the processing or delay of the speech recognition process being too long, it can also be based on the streaming text result at time 3 The time stamp of the first audio signal is determined. If the audio signal received within the first preset time from the end of the first audio signal, that is, the third audio signal, the text result obtained after speech recognition processing is empty, It can be determined that the third audio signal does not include a voice command; for another example, it can be considered that the processing time and delay of voice recognition are basically unchanged, and when the streamed text result at time 4 is "open the sunroof", which is consistent with time 3, that is That is, when the streaming text result is not updated within the first preset time after time 3, it can be considered that the third audio signal does not contain a voice instruction. For the sake of brevity, no more examples are given. It should be understood that time 0<time 3<time 4, that is, time 0 is the earliest and time 4 is the latest.

Exemplarily, it may be determined whether the third audio signal includes a voice instruction according to the energy of the audio frame of the third audio signal. For example, when the energy of the audio frame of the third audio signal is less than or equal to a preset threshold (such as the first threshold), it may be determined that the third audio signal does not include a voice instruction.

It should be understood that the above method for determining whether the third audio signal includes a voice instruction is only an example for ease of description, and is not limited in this embodiment of the present application.

Exemplarily, when the third audio signal includes a voice command, it may be determined that a new voice command is still obtained after the first audio signal, so that a new first audio signal may be determined to re-detect the voice endpoint, to Reconfirm the detection of the voice endpoint.

In the embodiment of the present application, when it is determined that the third audio signal does not include a voice command, the duration of the first timer is determined according to the first text, which can reduce the frequency of detecting voice endpoints, thereby saving the resources occupied by detecting voice endpoints .

S430, start the first timer.

Exemplarily, the first timer may be started after the duration of the first timer is determined, and the start time of the first timer may not be earlier than the end time of the first audio signal.

Exemplarily, during the voice interaction process, when the received audio signal is silent, voice endpoint detection may be performed, that is, part or all of the acquired audio signal may be used as the first audio signal to confirm the The first text corresponding to the first voice instruction, and after determining the duration of the first timer, start the first timer, so as to determine the voice endpoint.

S440. Acquire a second audio signal, where the start time of the second audio signal is later than the end time of the first audio signal.

Exemplarily, the acquired start time of the second audio signal may not be earlier than the end time of the first audio signal, and the end time of the second audio signal may be the same as the end time of the first timer.

Exemplarily, the start time and end time of the second audio signal may be the same as the first timer, and the start time of the second audio signal may be equal to the end time of the first audio signal. Exemplarily, FIG. 7 is a schematic diagram of audio signals in a voice interaction provided by an embodiment of the present application. For example, in the process of voice interaction, the audio signal can be continuously obtained, and the second audio signal can be obtained when the first timer is running, and the start time and end time of the continuously obtained audio signal can be the same as the first timer As the second audio signal, for example, the second audio signal shown in FIG. 7 , the second audio signal may or may not include a voice instruction, which is not limited in this application.

Exemplarily, the start time and end time of the second audio signal may be the same as the first timer, and the start time of the second audio signal may be later than the end time of the first audio signal. Exemplarily, the process of acquiring the first text and determining the duration of the first timer may take a period of time. Therefore, when the start time of the second audio signal is the same as the start time of the first timer, the There may be a period of time between the end moment of the first audio signal and the start moment of the first timer. Exemplarily, FIG. 8 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application. For example, the second audio signal as shown in Figure 8; for another example, the duration between the end moment of the first audio signal and the start moment of the first timer can be equal to the first preset time, and the end of the third audio signal The moment may be the start moment of the second audio signal, so that while reducing the frequency of detecting the voice endpoint, it is possible to avoid repeated processing of part of the audio signal. For the sake of brevity, examples are not given here.

Exemplarily, the second audio signal is acquired when the first timer is running, and the start time of the second audio signal may be earlier than the start time of the first timer. For example, when the audio signal in the voice interaction can be acquired continuously, the time to start the first timer can be later than the end time of the first audio signal, and for the acquisition of the second audio signal, the previously determined first audio signal can be The end time is used as the start time of the second audio signal, and with the operation of the first timer, the audio signal in the voice interaction is continuously obtained, and the end time of the first timer can be used as the end time of the second audio signal. When the speech recognition response is slow and the time required for determining the duration of the first timer is long, this method can avoid detection errors of speech endpoints caused by inappropriate selection of the second audio signal.

Exemplarily, when the first timer is running, the second audio signal is acquired. The start time of the second audio signal may be later than the end time of the first audio signal, and the end time of the second audio signal may be earlier than the end time of the first audio signal. The end time of a timer is not repeated here for the sake of brevity.

It should be understood that the above method for acquiring the second audio signal is only an example for illustration, and is not limited in this embodiment of the present application.

S450. When the text corresponding to the voice instruction in the second audio signal is empty, determine the end time of the first timer as the voice endpoint.

Exemplarily, by determining the voice endpoint, the voice instruction to be executed can be determined. For example, in voice interaction, for example, if the user issues a voice command of "dǎkāitiānchuāng (that is, open the sunroof)", the processing module may determine the command as to be executed after determining the voice endpoint based on the first audio signal including the voice command. voice command, so as to respond to this command, for the sake of brevity, no more details are given here.

Exemplarily, the text corresponding to the voice instruction in the second audio signal may be acquired according to the voice recognition technology. It should be understood that for the method for acquiring text according to the audio signal, reference may be made to related technologies, which is not limited in this embodiment of the present application.

Exemplarily, the voice instruction in the audio signal can be obtained by performing speech recognition on the audio signal. Before performing speech recognition on the audio signal, the voice instruction in the audio signal may not be accurately known. Therefore, the voice instruction in the second audio signal corresponds to The text of the second audio signal is empty, it may be that the second audio signal does not include a voice command, or the voice command is not obtained after performing voice recognition on the second audio signal, that is, the text corresponding to the voice command in the second audio signal is empty , it may be that after performing speech recognition on the second audio signal, the corresponding text result is not obtained; correspondingly, the text corresponding to the speech command in the second audio signal is not empty, it may be that the second audio signal is speech After recognition, the corresponding text results are obtained.

Exemplarily, voice recognition can be performed on the acquired second audio signal. If no text result is obtained after voice recognition is performed on the second audio signal, it can be determined that the text corresponding to the voice command in the second audio signal is empty. For brevity I won't repeat them here.

Exemplarily, in the voice endpoint detection process, when the text corresponding to the voice command in the second audio signal is empty, the end time of the first timer can be determined as the voice endpoint, so that the voice in the voice interaction can be command to respond; and when the text corresponding to the voice command in the second audio signal is not empty, it may indicate that after the first audio signal used in this voice endpoint detection, a new voice command is still received, and the first The end time of the timer is used as the voice end point, which may cause the user's voice command to be cut off in advance, so that a new first audio signal can be determined to detect the voice end point again.

Exemplarily, voice recognition processing may be performed on the second audio signal to determine whether the text corresponding to the voice command in the second text is empty, and details are not described here for brevity.

For example, during the voice interaction process, the audio signal can be acquired continuously, and when the streaming text result can be acquired by real-time automatic speech recognition, the text result corresponding to the second audio signal is empty, which can be during the running of the first timer, the streaming The text results for the formula were not updated. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, the vehicle may include a voice interaction device, and the voice interaction device may include an acquisition module and a processing module. In the voice interaction starting from time 0, after time 0, it can continue to acquire After the audio signal is processed by automatic speech recognition, a real-time streaming text result can be obtained. At time 3, the streaming text result is "open the skylight", which can be used as the first text. In the response time of speech recognition When it is shorter, the audio signal acquired between time 0 and time 3 can be regarded as the first audio signal, or the first audio signal can also be determined according to the timestamp, so that after determining the duration of the first timer according to the first text , the first timer can be started at time 4. If the first timer ends (for example, the end time of the first timer is time 5), the streaming text result at time 5 is not updated compared to time 4 , it can be considered that the text corresponding to the voice command in the second audio signal is empty, and the end time of the first timer can be determined as the voice endpoint, so that the voice command in the voice interaction can be responded to. For example, the processing module can set The voice command is sent to the vehicle control module, such as an electronic control unit (ECU), and the ECU can control the sunroof motor to run until the sunroof is opened; if the streaming text result at time 4 is updated compared to time 3, it can be considered The voice command issued by the user before time 3 is incomplete, or after the first audio signal, the user has issued a new voice command, thus the first timer may not be started to reduce the number of detections for detecting voice endpoints; if the second During the running of a timer, the streamed text result is updated, that is, when the streamed text result at time 5 is updated compared with time 4, it can be determined that the user has issued a new voice command, and the voice command in the second audio signal can be determined The corresponding text is not empty, so the first timer can be closed or paused, and this detection of the voice endpoint ends.

In the embodiment of the present application, since the audio signal can be obtained continuously during the voice interaction, the first audio signal and the second audio signal can be confirmed and obtained from the audio signal according to the specific situation, so the update of the streaming text result determines the second audio signal. Whether the two audio signals include voice instructions, and thus determine the voice endpoint, can save the process of confirming the first audio signal and the second audio signal, can reduce the complexity of system operation, and save the resources consumed by the method. Moreover, since this method only relies on streamed text results, the judgment of speech endpoints does not depend on the internal algorithm of speech recognition, and can be applied to any ASR engine.

It should be understood that the above method for determining the speech endpoint of the second audio signal is only an example for ease of description, and is not limited in this embodiment of the present application.

Optionally, when the energy of the audio frame according to the second audio signal is less than or equal to the first threshold, the end time of the first timer may be determined as the voice endpoint.

Exemplarily, the energy of the audio frame of the second audio signal may be determined based on a short-time energy analysis method. Further, the energy of the audio frame of the second audio signal is less than or equal to the first threshold, it may be that the energy of a single audio frame is not greater than the first threshold, or the energy of multiple audio frames is not greater than the first threshold, or It may be that the energy of all the second audio frames is not greater than the first threshold, or the energy weighted average of multiple audio frames is not greater than the first threshold, which is not limited in this embodiment of the present application. For example, based on the method of short-term energy analysis, after the second audio signal is framed, the energy of one or more audio frames in the second audio signal can be obtained, and it can be considered that when the energy of the audio frame is greater than the first threshold The user includes a voice command in the audio frame; it can also be obtained by weighting the energy of multiple audio frames to obtain the short-term average energy of multiple audio frames. When the short-term average energy is greater than the first threshold, it can be considered These multiple audio frames include voice instructions; when the energy of all audio frames in the second audio signal is less than or equal to the first threshold, it is considered that the second audio signal does not include a voice instruction; it can be that the energy in the second audio signal is less than Or the number or proportion of audio frames equal to the first threshold exceeds a certain limit, it is considered that the second audio signal does not include voice commands, and no examples are given for brevity. It should be understood that the above method for determining the energy of an audio frame is only an example for illustration, and is not limited in this embodiment of the present application.

Exemplarily, the classification of the audio frame may be determined based on the energy of the audio frame in the second audio signal. For example, an audio frame with energy greater than the first threshold may be determined as the first type of audio frame, which may indicate that the audio frame includes a voice command, that is, the user clearly issued a voice command during the time period when the audio frame was collected; An audio frame whose energy is less than or equal to the third threshold is determined as the second type of audio frame, indicating that the audio frame clearly does not contain voice instructions, and the third threshold may be less than or equal to the first threshold; for another example, when the third threshold is less than When the first threshold is used, an audio frame greater than the third threshold and less than or equal to the first threshold may be determined as the third type of audio frame, which may indicate that it is not clear whether the audio frame includes a voice instruction. It should be understood that the above method for classifying audio frames of an audio signal is only an example for illustration, and is not limited in this embodiment of the present application.

It should be understood that the classification of audio frames may be represented in any data format, such as numbers, letters, character strings, and the like. Exemplarily, for the classification of audio frames, "speech (speech, SPE)" can be used to represent the first type of audio frame, "silence (silence, SIL)" can be used to represent the second type of audio frame, and "neutral (neutral, NEU)" can be used to represent the first type of audio frame. )" represents the third type of audio frame. For the sake of brevity, no more examples are given.

For convenience of description, following the embodiments of the present application, SPE is used as the first type of audio frame, SIL is used as the second type of audio frame, and NEU is used as an example for description. That is, the SPE described later in this application can be replaced by the first type of audio frame, the SIL can be replaced by the second type of audio frame, and the NEU can be replaced by the third type of audio frame.

Exemplarily, FIG. 9 is a schematic diagram of a method for confirming the classification of audio frames provided by an embodiment of the present application, wherein the audio signal can be divided into parts of the first type of audio frames (ie, SPE parts), and parts of the second type of audio frames. part (that is, the SIL part) and a part of the third type of audio frame (that is, the NEU part), or called the part of the first type of audio signal, the part of the second type of audio signal, and the part of the third type of audio signal. As shown in Figure 9, according to the energy of the audio frame, according to different thresholds, such as the first threshold and the third threshold, the part of the audio signal whose energy is higher than the first threshold can be determined as the SPE part, and the part of the audio signal whose energy is low The portion at the third threshold is determined as the SIL portion, and the portion of the audio signal whose energy is between the first threshold and the third threshold is determined as the NEU portion. Wherein, the first threshold and the third threshold may be fixed values, or may be determined according to an environmental energy value, and the environmental energy value may refer to an energy value of an audio frame of ambient noise in a voice interaction environment. It should be understood that during the process of acquiring the audio signal, the audio signal can be classified in real time. It should be understood that, for the method of classifying audio signals according to energy, reference may also be made to other methods in the related art, which is not limited in the present application.

It should be understood that if the second audio signal includes the first type of audio frame, it can be considered that the second audio signal includes a voice instruction, so that this voice endpoint detection can be ended, and the voice can be performed again by acquiring a new first audio signal Endpoint detection.

Exemplarily, when the second audio signal does not include the first type of audio frame, the end time of the first timer may be determined as the voice endpoint, and details are not described here for brevity.

Exemplarily, when the text corresponding to the voice command in the second audio signal is empty, and the energy of the audio frame in the second audio signal is less than or equal to the first threshold, the end moment of the first timer may be determined as voice endpoint. For the sake of brevity, details are not repeated here.

It should be understood that the above method of determining the speech endpoint according to the energy of the audio frame of the second audio signal is just an example, which is not limited in this embodiment of the present application.

In the embodiment of the present application, the voice endpoint can be flexibly set according to the text information of the audio signal, thereby alleviating background noise and the user's speaking habit, thereby improving user experience. In addition, by determining the speech endpoint according to the text information corresponding to the audio signal and combining the energy of the audio frame, the accuracy of the detected speech endpoint can be improved.

Optionally, before the duration of the first timer is determined according to the first text, the second text may be acquired, and the second text may be displayed through a display screen. For example, when the method is applied to a vehicle, the second text may be a text displayed on a vehicle display screen, such as a text displayed on a display screen such as a vehicle central control screen or a headrest display installed on a seat; and for example, when the When the method is applied to a terminal device including a display screen, such as a mobile phone or a tablet computer, the second text may be displayed on the screen of the terminal device, or text on a display screen associated with the terminal device; for another example, when the method is applied to When the chip is turned on, the chip can capture the second text displayed on its associated display screen. For the sake of brevity, examples are not described one by one, and it should be understood that this embodiment of the present application does not make a limitation thereto.

Exemplarily, when the first text can match the second text, this speech endpoint detection can be ended, and the operation corresponding to the second text can be performed. For example, if the music being played is displayed on the display screen, and the displayed text includes "next song", after the user enters the voice interaction through the wake-up word, when the user says "play the next song", by obtaining the audio signal The first text can be obtained. When the first text includes "next song", the first text and the second text can be matched at this time, and the operation corresponding to the text "next song" in the display screen can be directly performed. That is, the next song is played, and this method can also be called a visible and talkable method. For the sake of brevity, no more examples are given. In the embodiment of the present application, by directly executing the operation corresponding to the second text, it is possible to respond to the user's voice command more quickly and improve the user experience.

It should be understood that the match between the first text and the second text may be that the first text is the same or similar to the second text, or that the first text includes the second text, or that the first text and the second text include the same key Words, which are not limited in this embodiment of the application.

Exemplarily, when the first text does not match the second text, the duration of the first timer may be determined according to the first text, and details are not described here for brevity.

For example, in a voice interaction, when the user gives a voice command, there may be multiple pauses, so when determining the voice endpoint in the voice interaction, there may be multiple attempts. When the voice endpoint detection fails, that is to say When it is determined that the user has not issued a complete voice command, the voice endpoint detection can be performed again according to the continuously acquired audio signal until the voice endpoint is successfully detected, and the voice command can be responded accordingly. It should be understood that when the voice endpoint detection is performed any time, the first audio signal and the first text used in the detection process can be confirmed. During the multiple voice endpoint detection processes, the confirmed multiple first audio signals and The first text may be related, for example, the first audio signal used in this detection process may include the first audio signal used in the previous detection process; it may also be unrelated, for example, this detection The first audio signal used in the process may not include the first audio signal used in the previous detection process, which is not limited in this embodiment of the present application.

In order to facilitate understanding and description, the embodiment of the present application distinguishes the audio signal and text used in multiple speech detection processes. For example, the text used in this speech endpoint detection is defined as the first text, and The corresponding audio signal is defined as the first audio signal, the text used in the previous one or more speech endpoint detections is defined as the third text, and the corresponding audio signal is defined as the fourth audio signal.

Exemplarily, before acquiring the first audio signal, a fourth audio signal may be acquired, the fourth audio signal may include a third voice instruction, and the second timer may be determined according to the third text corresponding to the third voice instruction After the duration of the second timer is determined, the second timer can be started, and the fifth audio signal can also be obtained. When the text corresponding to the voice command in the fifth audio signal is not empty, the The fourth audio signal and the fifth audio signal determine the first audio signal, and the first audio signal may include the fourth audio signal and the fifth audio signal. Wherein, the fourth audio signal can be understood as the "first audio signal" used in the previous speech endpoint detection process, such as the previous time; the third speech instruction can be understood as the "first audio signal" used in the previous speech endpoint detection process. In the process, the "first voice instruction" contained in the "first audio signal" used; the third text can be understood as the "first text" used in the previous voice endpoint detection process; the The second timer can be understood as, in the previous voice endpoint detection process, the "first timer" used, the start time of the first timer is not earlier than the end time of the second timer; the fifth audio The signal can be understood as the "second audio signal" obtained during the previous speech endpoint detection process. That is to say, the voice endpoint may be determined again according to the first audio signal confirmed in this detection after the previous failure to determine the voice endpoint, or after the previous failure to detect the voice endpoint.

Exemplarily, according to the fourth audio signal and the fifth audio signal, the first audio signal can be determined. For example, in voice interaction, when the audio signal of voice interaction can be obtained continuously, according to the "first audio signal" and "second audio signal" used in the previous determination of the voice endpoint, or the old first audio signal and The old second audio signal, that is, the fourth audio signal and the fifth audio signal, can determine the first audio signal, that is, can determine the first audio signal used when detecting the voice endpoint this time.

Exemplarily, the determination of the first audio signal according to the fourth audio signal and the fifth audio signal may be described in conjunction with FIG. 10 . Exemplarily, FIG. 10 is a schematic diagram of audio signals in another voice interaction provided by an embodiment of the present application. For example, the first audio signal may only include the fourth audio signal and the fifth audio signal, for example, as shown in (a) in Figure 10; An audio signal may include a fourth audio signal, a fifth audio signal, and an audio signal between the fourth audio signal and the fifth audio signal, for example, as shown in (b) in Figure 13; for another example, the first The audio signal may also include audio signals after the fourth audio signal and the fifth audio signal, such as, as shown in (c) and (d) in Figure 13; for another example, if the " When the starting point of the first audio signal" (that is, the fourth audio signal) is later than the starting point of the voice interaction, in the process of this voice endpoint detection, the starting point of the first audio signal can also be earlier than the fourth audio signal. The starting point of the signal, for example, takes the starting point of the voice interaction as the starting moment of the first audio signal. For the sake of brevity, no examples are given one by one. It should be understood that the above method for obtaining the first audio signal is only an example for illustration, and this embodiment of the present application does not limit it.

S460. After the voice endpoint is determined, respond to the first voice instruction.

Exemplarily, responding to the first voice instruction may mean only responding to the first voice instruction, or the first voice instruction may be included in the responded voice instruction, that is to say, the first voice instruction may be the part of the voice command. For example, during the voice interaction process, it is possible to perform multiple voice endpoint detections. After the voice endpoint is determined, it can respond to the voice instructions acquired from the start moment of the voice interaction. Since the start moment of the first audio signal can be later than The starting moment of the voice interaction, thus, the first voice command in the first audio signal may be a part of the voice command to be responded to. It should be understood that this is not limited in the embodiment of the present application.

Exemplarily, in response to the first voice instruction, the operation indicated by the first voice instruction may be performed. For example, taking the interaction scenario of voice interaction between a human and a vehicle as an example, when the first voice command is "open the sunroof", the processing module may instruct the vehicle controller to perform the operation, and accordingly, the sunroof motor of the vehicle may be started until the sunroof is opened; For another example, when the first voice command is "search for location A", the vehicle can display a map and highlight location A on the control screen, and can also display multiple navigation routes from the current location to location A, and the vehicle can also make a sound through the speaker "Please You choose the navigation route" so that the user can issue new voice commands; for another example, when the first voice command is "next song", the vehicle can switch the music played and keep the voice interaction silent. When no new voice command is issued within a period of time (for example, 10s), the voice interaction can be terminated, and when the user issues a new voice command within this time period, the voice command issued by the user can be obtained in time. It should be understood that the above manner of responding to the first voice instruction is only an example for ease of description, and is not limited in this embodiment of the present application.

The embodiment of the present application provides a method for voice interaction. The length of the first timer can be determined through the text, and whether the user has the intention to continue speaking can be determined according to the text, so that the voice endpoint can be flexibly determined, thereby avoiding the The resulting long system delay can also avoid premature truncation of the voice interaction caused by the user's speech pause, so that the voice command in the voice interaction can be accurately obtained while shortening the system delay.

Exemplarily, FIG. 11 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application, and the method 500 may include part or all of steps S510 to S580.

S510, start speech recognition.

Exemplarily, after the voice interaction starts, the audio signal in the voice interaction may be acquired continuously until the voice interaction ends, and voice recognition is performed on the audio signal during the voice interaction. For example, the voice interaction can be started after the user speaks the wake-up word, and the voice recognition module can be invoked after the voice interaction is started, so that after the audio signal is acquired, it can be voice recognized, so that the processing result of the audio signal after voice recognition can be obtained . For the sake of brevity, details are not repeated here, and it should be understood that this embodiment of the present application does not limit it.

S520. Perform speech recognition according to the audio signal to obtain a streaming text result.

Exemplarily, by performing speech recognition on the continuously acquired audio signal, a streaming text result can be obtained. Optionally, the streamed text result may be used to determine the first text, and the continuously acquired audio signal may be used to determine the first audio signal. For example, the streamed text result at the moment may be determined as the first text, and the audio signal acquired before the moment may be the first audio signal. For the sake of brevity, no more details are given here.

S530. Set a third timer according to the first preset time.

S535, if the streaming text result is not updated before the third timer ends, jump to S540; if the streaming text result is updated, reset the third timer, and jump to S520.

Exemplarily, if the streaming text result is not updated when the third timer ends compared with when the third timer is started, it can be considered that the third audio signal received within the first preset time does not include voice instructions, Therefore, the streaming text result at this moment can be determined as the first text, and the first audio signal can be determined from the continuously acquired audio signals. Exemplarily, for the description about the third audio signal, reference may be made to step S420, and for the sake of brevity, details are not repeated here.

It should be understood that by setting the third timer, the frequency of detecting the voice endpoint can be reduced, and the resources used in the process of determining the voice endpoint can be saved.

S540. Based on the prediction model, determine the duration of the first timer according to the first text.

Exemplarily, the first text may be input into the prediction model to obtain first information of the first text, and the first information may be used to characterize the semantic completeness of the first text. Further, according to the first information, the duration of the first timer can be determined.

Exemplarily, the prediction model may be a prediction model trained according to the method 200 . For the sake of brevity, details are not repeated here.

Exemplarily, for the description of the first text and the first timer, reference may be made to steps S410 to S420, and for the sake of brevity, details are not repeated here.

S550, start the first timer.

Exemplarily, the first timer may be started after the duration of the first timer is determined, and details are not described here for brevity.

S560, before the end of the first timer, if the streaming text result is updated, skip to S520; if the streaming text result is not updated, skip to S570.

Exemplarily, before the end of the first timer, when the streaming text result is not updated, it can be considered that the text corresponding to the voice command in the acquired second audio signal is empty, so that the end time of the first timer can be used as Voice endpoint; before the end of the first timer, when the streaming text result is updated, it can be considered that the text corresponding to the voice command in the second audio signal is not empty, so that the detection of this voice endpoint can be ended, and the follow-up can be According to the updated streaming text result, the speech endpoint detection is performed again.

It should be understood that if the streaming text result is updated before the first timer expires, the first timer may be suspended, closed or reset, which is not limited in this embodiment of the present application.

Exemplarily, regarding the acquisition of the second audio signal and whether the text corresponding to the voice command in the second audio signal is empty, reference may be made to steps S430 to S440 , and details are not repeated here for brevity.

S570. Determine whether to stop speech recognition according to the audio signal classification. If the current audio signal includes audio frames of the first type, go to S520, otherwise, go to S580.

Exemplarily, when the first timer is running, the audio signal continuously received since the end of the first audio signal can be acquired, and if the first timer ends, the audio signal includes an audio frame classified as SPE, It can be considered that the audio signal after the first audio signal includes a voice command, and jump to S520, so that the text corresponding to the voice command in the audio signal can be obtained; The end moment of the first timer determines the speech endpoint.

Exemplarily, for the description of the classification of the audio signal, reference may be made to step S450, and for the sake of brevity, details are not repeated here.

In the embodiment of the present application, when the speech recognition delay is relatively large, by confirming the classification of the audio frame, the misjudgment of the speech endpoint due to the delay can be avoided, and the accuracy of detection can be improved. Moreover, the combination of text and audio frames Classification, which can improve the accuracy of the identified speech endpoints.

S580, responding to voice commands.

Exemplarily, after the voice endpoint is determined, the first text may be sent to the semantic understanding module for analyzing and executing the instruction indicated by the user in the voice interaction. This embodiment of the present application does not limit it.

In the embodiment of the present application, the first duration can be flexibly set according to the semantic completeness of the text corresponding to the semantic instruction in the audio signal, and the end point of the voice interaction can be determined based on this, so that the determined voice interaction can be taken into account. The delay caused by the end point being too late, as well as the user's pause in the voice interaction, enable the user to have a better user experience. At the same time, since the classification of the current audio is combined when determining the end point of the speech, the influence of background noise on the judgment of the end point of the speech can be alleviated, and the accuracy of speech end point detection can be improved. Moreover, in the embodiment of the present application, only the streaming text results output by the ASR engine are used for endpoint detection, and an instruction to stop recognition is sent externally without relying on the internal algorithm of the ASR engine, so it can be applied to any ASR engine, for The ASR engine has better adaptability.

Exemplarily, FIG. 12 is another schematic flowchart of the voice interaction method provided by the embodiment of the present application, and the method 600 may include part or all of steps S610 to S660.

S610, start speech recognition.

It should be understood that step S610 may correspond to step S510, and for the sake of brevity, details are not repeated here.

S615. Obtain interface hot words.

Exemplarily, taking the voice interaction between the user and the vehicle as an example, the vehicle includes a display screen, and the text in the display screen of the vehicle can be obtained, and the interface hot word can be the text corresponding to the control displayed on the display screen. It should be understood that the interface Hot words can be used as the second text.

S620. Perform speech recognition according to the audio signal to obtain a streaming text result.

It should be understood that step S620 may correspond to step S520, and for the sake of brevity, details are not repeated here.

Optionally, in S630, if the streaming text result is not empty, skip to S634; if the streaming text result is empty, skip to S620.

Exemplarily, obtaining interface hot words can be carried out simultaneously with obtaining streaming text results, or first obtaining interface hot words and text results, that is to say, step S615, and the part in steps S620 to S630 Or all the steps can be performed at the same time, or step S615 can be performed first, or S620 to S630 can be performed first, or part or all of the steps from S620 to S630 can be performed first, which is not limited in this application.

S634, match the obtained interface hot words with the obtained streaming text results, if the interface hot words match the streaming text results, skip to S636; if the interface hot words do not match the streaming text results, skip S635.

In order to briefly describe the method for matching interface hot words and streamed text results, as an example, FIG. 13 is an exemplary schematic diagram of a user interface of a display screen provided in an embodiment of the present application, wherein the display screen can be applied to a vehicle , can display maps, music, radio, driving settings and other different information on its user interface. It should be understood that the user interface is only an example, which is not limited in the embodiment of the present application, and may also include lights, vehicle driving parameters and other information, for example. The user can click the control on the user interface, and the vehicle can perform the operation corresponding to the control. For example, as shown in Figure 13, after the user clicks "Song 1" in the control "Music", the vehicle can play music and play "Song 1". 1". For the sake of brevity, examples are not given here.

Exemplarily, after the user turns on the speech recognition function, the streaming text result and interface hot words can be obtained, and the two can be matched. For example, after the user activates voice recognition by speaking the wake-up word, the hot words "map", "frequently used place 1", "music", "song 1", etc. on the interface as shown in Figure 13 can be obtained. When "song 1" is included in the streamed text result (such as "playing song 1"), the streamed text result matches the obtained interface hot words, and can jump to S636; or the streamed text obtained according to the audio signal When the result (such as "open the car window") does not include the hot words on the interface as shown in Figure 13, the streaming text result can be used as the first text to determine the duration of the first timer, that is to say , when the first text does not match the second text, skip to S640.

It should be understood that the above method for obtaining interface hot words is only an example for illustration, and this embodiment of the present application does not limit it. Regarding the description of matching the first text and the second text, reference may be made to step S450, which will not be repeated in this embodiment of the present application.

Optionally, S636. Determine the classification of the audio signal. If the current audio signal is classified as the first type of audio signal, go to S620; otherwise, go to S638.

Exemplarily, since the audio signal can be acquired continuously during voice interaction, during the process of acquiring interface hot words, acquiring streamed text results, and matching the two, the audio signal received during this process, or It is called an updated audio signal, which may include voice instructions. After the matching between interface hot words and streaming text results is completed, the updated audio signal includes audio frames classified as SPE, that is, the user's voice Instruction, skip to S620 to obtain the streaming text result corresponding to the updated audio signal; otherwise, skip to S638.

Exemplarily, when it is determined that the hot words on the interface match the streamed text results, the category of the audio frame may be determined according to the continuously acquired audio signal including the energy of the audio frame at the current moment, if the category of the audio frame is SPE , skip to S620; otherwise, skip to S638.

It should be understood that by determining the category of the audio signal, it is possible to avoid ignoring the user's new instruction in the process of matching interface hot words and streaming text results, and to avoid obvious deviations between the executed operation and the user's actual intention.

For example, regarding the method for classifying audio signals, reference may be made to step S450, which is not limited in this embodiment for the sake of brevity.

S638, the visible and utterable module can perform the operation indicated by the interface hot words.

Exemplarily, a first message may be sent to the visible and speakable module, and the first message may be used to indicate the operation indicated by the successfully matched interface hotword, and accordingly, the visible and speakable module may execute the operation indicated by the interface hotword The operation may also instruct the executing device to execute the operation indicated by the hot word on the interface, which is not limited in this embodiment of the present application.

In the embodiment of the present application, by performing the operation indicated by the interface hot words, the function of seeing and speaking can be realized, so that the user can realize the interaction with the vehicle-mounted terminal only through voice interaction, thereby avoiding contact with the vehicle-mounted terminal and improving user experience. In addition, because in the embodiment of the present application, the matching of interface hot words and voice commands can be performed before the voice endpoint detection, that is, the matching of interface hot words is not performed after the voice interaction is over but during the voice interaction, which can significantly shorten the visible time. The response time can be said to improve the user experience.

It should be understood that when the acquired interface hot words match the streaming text results, the operation indicated by the interface hot words can be directly performed, that is, after step S634 completes the matching of the interface hot words and the streaming text results, you can also directly jump to S638.

Optionally, in S635, if the hot words on the interface do not match the streaming text results, a third timer may be set according to the first preset time.

Exemplarily, the description of step S635 may refer to step S530, and for the sake of brevity, details are not repeated here.

S637, if the streaming text result is not updated before the third timer ends, skip to S640; if the streaming text result is updated, skip to S620.

Exemplarily, the description of step S637 may refer to step S535, and for the sake of brevity, details are not repeated here.

S640. Based on the prediction model, determine the duration of the first timer according to the first text.

Exemplarily, the description of step S640 may refer to step S540, and for the sake of brevity, details are not repeated here.

S645, start the first timer.

S650, if the streaming text result is updated before the first timer ends, skip to S620; if the streaming text result is not updated, skip to S655.

Exemplarily, the description of step S650 may refer to step S560, which will not be repeated for brevity.

S655. Determine whether to stop speech recognition according to the classification of the audio signal. If the current audio signal includes the first type of audio frame, go to S620, otherwise, go to S660.

Exemplarily, the description of step S655 may refer to step S570, and for the sake of brevity, details are not repeated here.

S660, responding to voice commands.

Exemplarily, the description of step S660 may refer to step S580, and for the sake of brevity, details are not repeated here.

It should be understood that the foregoing method 400 may be combined with the method 500 and the method 600, which is not limited in this embodiment of the present application.

The embodiment of the present application also provides an apparatus for implementing any one of the above methods, for example, an apparatus including a unit for implementing the steps performed by the user equipment, vehicle, voice interaction device, etc. in any of the above methods. For example, please refer to FIG. 14 , which is a schematic structural diagram of a voice interaction device provided by an embodiment of the present application. The apparatus 700 may include an acquisition module 710 and a processing module 720 .

Among them, the acquisition module 710 can be used to acquire the first audio signal, which can include the first voice instruction; it can also be used to acquire the second audio signal, and the start time of the second audio signal is equal to or later than At the end moment of the first audio signal; the processing module 720 may be configured to: determine the duration of the first timer according to the first text corresponding to the first voice instruction; start the first timer; When the text corresponding to the voice instruction in the signal is empty, the end time of the first timer is determined as the voice endpoint; after the voice endpoint is determined, the first voice instruction is responded.

Exemplarily, for the description about responding to the first voice instruction, reference may be made to step S460, and details are omitted here for brevity.

Exemplarily, when the processing module 720 determines that the text corresponding to the voice command in the second audio signal is non-empty, it can be determined that a new voice command is still obtained after the end of the first audio signal, and it can be determined that the voice endpoint of this time is The detection fails, and the voice endpoint cannot be determined according to the first timer. Thus, the obtaining module 710 can reacquire a new first audio signal, and the processing module 720 can perform speech endpoint detection again according to the new first audio signal until the speech endpoint is determined.

Optionally, the processing module 720 may be configured to: when the energy of the audio frame of the second audio signal is less than or equal to the first threshold, determine the end time of the first timer as the voice endpoint.

Further, the processing module 720 is specifically configured to: when the text corresponding to the voice instruction in the second audio signal is empty, and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the first timing The end moment of the device is determined as the voice endpoint.

Exemplarily, when the second audio signal does not include the audio frame of the first type, the end time of the first timer may be determined as the voice endpoint.

Exemplarily, for the description about the second audio signal, reference may be made to step S450, which will not be repeated in this embodiment of the present application for the sake of brevity.

Optionally, the acquiring module 710 is further configured to: acquire the second text, which can be displayed through a display screen; the processing module 720 is specifically configured to: compare the first text corresponding to the first voice instruction with the second text When the texts do not match, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.

Optionally, the processing module 720 may also be configured to, when the first text corresponding to the first voice instruction matches the second text, execute the operation indicated by the second text.

Exemplarily, the operation indicated by the second text may be sent to the control device or the execution device, so that it can execute the operation indicated by the second text.

For example, for the description of the second text, reference may be made to step S450, and for the sake of brevity, details are not repeated here. Wherein, the matching of the first text and the second text may be that the first text is the same or similar to the second text, may be that the first text includes the second text, or that the first text and the second text include the same keyword, This embodiment of the present application does not limit it.

Optionally, the acquiring module 710 is further configured to: acquire a third audio signal, the third audio signal includes an audio signal received within a first preset time, and the start moment of the first preset time is equal to or later than the first The end time of the audio signal; the processing module 720 is specifically configured to: when the third audio signal does not include a voice command, determine the duration of the first timer according to the first text corresponding to the first voice command.

Exemplarily, when the third audio signal includes a voice command, it can be determined that a new voice command is still received after the first audio signal, so that the detection of this voice endpoint can be ended, and a new first audio signal can be re-determined, To re-detect the voice endpoint.

Exemplarily, for the description of the third audio signal, reference may be made to step S420, and for the sake of brevity, details are not repeated here.

Exemplarily, in the voice interaction, the user may have multiple pauses, so when determining the voice endpoint in the voice interaction, there may be multiple attempts. Voice endpoint detection is performed to confirm the voice endpoint until the voice endpoint is successfully detected, thereby responding to the first voice command. During the multiple voice endpoint detection processes, the multiple audio signals used may or may not be associated.

In order to distinguish the audio signal and text used in multiple voice detection processes, the text used in this voice endpoint detection can be defined as the first text, and the corresponding audio signal can be defined as the first audio signal, The text used in one or more previous speech endpoint detections is defined as the third text, and the corresponding audio signal is defined as the fourth audio signal.

Optionally, the acquiring module 710 is further configured to: acquire a fourth audio signal before acquiring the first audio signal, the fourth audio signal including the third voice instruction; acquire the fifth audio signal when the second timer is running The processing module 720 can also be used to: determine the duration of the second timer according to the third text corresponding to the third voice instruction; start the second timer, and the end time of the second timer is earlier than or equal to the second timer. The start time of the first timer; when the text corresponding to the voice instruction in the fifth audio signal is not empty, according to the fourth audio signal and the fifth audio signal, determine the first audio signal, the first audio signal includes The fourth audio signal and the fifth audio signal.

Optionally, the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than the end moment of the fifth audio signal.

Exemplarily, for the description of the third voice instruction, the second timer, the third text, the fifth audio signal, etc., reference may be made to step S450, and for the sake of brevity, details are not repeated here.

Optionally, the processing module 720 is specifically configured to: input the first text corresponding to the first voice command into the prediction model to obtain the semantic completeness of the first text; determine the first text according to the semantic completeness of the first text. The duration of the timer.

Exemplarily, the predictive model may be a predictive model obtained through training in method 200. For descriptions of the method for training the predictive model, reference may be made to steps S210 to S220. For the sake of brevity, details are not repeated here.

Exemplarily, the apparatus can be applied to a terminal device, and the terminal device can perform voice interaction with the user. Exemplarily, the terminal device may specifically include a computer, a smart phone, a tablet computer, a personal digital assistant, a wearable device, a smart speaker, a TV, a drone, a vehicle, a vehicle-mounted chip, a vehicle-mounted device (such as a vehicle machine, a vehicle computer) or One or more of devices such as robots. For example, the terminal device may be a mobile phone, a vehicle, etc., or other electronic devices, which are not listed one by one for the sake of brevity. It should be understood that the above terminal devices are only examples for description, and are not limited in this embodiment of the present application.

It should be understood that the voice interaction device shown in FIG. 14 can be used to implement the above voice interaction method 400, and the voice interaction device shown in FIG. 14 can also be used to implement the voice interaction methods described in method 500 and method 600, For specific steps, reference may be made to the foregoing descriptions of FIG. 6 to FIG. 13 , and for the sake of brevity, details are not repeated in this embodiment of the present application.

Exemplarily, the embodiment of the present application further provides an apparatus for implementing the method 200, for example, an apparatus including units for implementing steps performed by the user equipment, the voice detection platform, etc. in any of the above methods. For example, please refer to FIG. 15 , which is a schematic structural diagram of an apparatus for training a speech interaction prediction model provided by an embodiment of the present application. As shown in FIG. 15 , the apparatus 800 may include an acquisition module 810 and a training module 820 .

Wherein, the acquisition module 810 can be used to: acquire a text data set, the text data set includes a plurality of fourth texts, the fourth texts are marked with first information, and the first information can be used to represent the semantic completeness of the text; The training module 820 may be used to: perform model training according to the text data set to obtain a prediction model, and the prediction model is used to predict the semantic completeness of the voice instruction.

Exemplarily, for the description of the text data set and the first information, reference may be made to step S210, and details are omitted here for the sake of brevity.

Optionally, the acquiring module 810 can also be used to acquire a text corpus, which can include multiple texts with complete semantics, and the apparatus 800 can also include a processing module 830 (not shown in FIG. 15 ), which processes A module can be used to determine a text dataset from the text corpus.

Optionally, the processing module 830 may specifically be configured to determine one or more fourth texts according to the texts with complete semantics in the text corpus; The fourth text, determine the text data set.

Optionally, the processing module 830 can also be used to: determine a dictionary tree according to the text corpus, the dictionary tree includes a plurality of nodes; determine the semantic integrity of the fourth text according to the number of child nodes of the nodes in the dictionary tree .

Exemplarily, one or more nodes may be determined according to texts with complete semantics in the text corpus, and multiple nodes of the dictionary tree may be determined according to multiple texts with complete semantics in the text corpus.

Exemplarily, the above description about the text corpus and dictionary tree can refer to step S210, and for the sake of brevity, details are not repeated here.

Optionally, the processing module 830 may also be configured to: determine the semantic completeness of the fourth text according to the number of child nodes of the nodes in the dictionary tree and the tail node mark determined by the text with complete semantics.

Exemplarily, for the description of the tail node label, reference may be made to step S210, and for the sake of brevity, details are not repeated here.

Exemplarily, the apparatus 800 can be used in the voice detection platform described in the embodiment of FIG. 1 , and the voice detection platform can be used to provide background services for the voice interaction process between the user and the terminal device. This embodiment of the present application does not limit it.

It should be understood that the device for training the predictive model used in voice interaction shown in FIG. 15 can be used to implement method 200, and the specific steps can refer to the descriptions of FIG. 3 to FIG. 5 above. For the sake of brevity, this application This will not be described in detail in the embodiment.

It should be understood that the division of units or modules in the above device is only a division of logical functions, and may be fully or partially integrated into one physical entity or physically separated during actual implementation. In addition, the units or modules in the device can be implemented in the form of a processor calling software; for example, the device includes a processor, the processor is connected to a memory, and instructions are stored in the memory, and the processor calls the instructions stored in the memory to realize any of the above. A method or realize the function of each unit of the device, wherein the processor is, for example, a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or a microprocessor, and the memory is a memory in the device or a memory outside the device. Alternatively, the units in the device may be implemented in the form of hardware circuits, and part or all of the functions of the units may be realized through the design of the hardware circuits. The hardware circuits may be understood as one or more processors; for example, in one implementation, The hardware circuit is an application-specific integrated circuit (ASIC), through the design of the logical relationship between the components in the circuit, the functions of some or all of the above units are realized; for another example, in another implementation, the hardware circuit is It can be realized by programmable logic device (programmable logic device, PLD). Taking Field Programmable Gate Array (Field Programmable Gate Array, FPGA) as an example, it can include a large number of logic gate circuits, and configure the logic gate circuits through configuration files. connection relationship, so as to realize the functions of some or all of the above units. All the units of the above device can be realized in the form of calling software by the processor, or in the form of hardware circuit, or partly in the form of calling software by the processor, and the rest can be realized in the form of hardware circuit.

In the embodiment of the present application, the processor is a circuit with signal processing capabilities. In one implementation, the processor may be a circuit with instruction reading and execution capabilities, such as CPU, microprocessor, graphics processor (graphics processing unit, GPU) (can be understood as a microprocessor), or digital signal processor (digital signal processor, DSP), etc.; in another implementation, the processor can realize a certain Function, the logical relationship of the hardware circuit is fixed or reconfigurable, for example, the processor is a hardware circuit implemented by an application-specific integrated circuit ASIC or a programmable logic device PLD, such as FPGA. In a reconfigurable hardware circuit, the process of the processor loading the configuration file to realize the configuration of the hardware circuit can be understood as the process of the processor loading instructions to realize the functions of some or all of the above units. In addition, it can also be a hardware circuit designed for artificial intelligence, which can be understood as an ASIC, such as a neural network processing unit (Neural Network Processing Unit, NPU), a tensor processing unit (Tensor Processing Unit, TPU), deep learning processing Unit (Deep learning Processing Unit, DPU), etc.

It can be seen that each unit in the above device can be one or more processors (or processing circuits) configured to implement the above method, for example: CPU, GPU, NPU, TPU, DPU, microprocessor, DSP, ASIC, FPGA , or a combination of at least two of these processor forms.

In addition, all or part of the units in the above devices can be integrated together, or can be implemented independently. In one implementation, these units are integrated together and implemented in the form of a system-on-a-chip (SOC). The SOC can include at least one processor for implementing any of the above methods or realizing the functions of each unit of the device. The at least one processor can be of different types, such as including CPU and FPGA, CPU and artificial intelligence processor, CPUs and GPUs, etc.

Exemplarily, FIG. 16 is a structural example diagram of an apparatus 1300 provided in an embodiment of the present application. The apparatus 1300 includes a processor 1302 , a communication interface 1303 and a memory 1304 . One example of device 1300 is a chip. Another example of apparatus 1300 is a computing device.

The processor 1302, the memory 1304, and the communication interface 1303 may communicate through a bus. Executable codes are stored in the memory 1304, and the processor 1302 reads the executable codes in the memory 1304 to execute a corresponding method. The memory 1304 may also include an operating system and other software modules required for running processes.

For example, the executable code in the memory 1304 is used to implement the methods shown in FIGS. 3 to 13 , and the processor 1302 reads the executable code in the memory 1304 to execute the methods shown in FIGS. 3 to 13 .

Wherein, the processor 1302 may be a CPU. The memory 1304 may include a volatile memory (volatile memory, VM), such as a random access memory (random access memory, RAM). Memory 1304 can also include non-volatile memory (non-volatile memory, NVM), such as read-only memory (read-only memory, ROM), flash memory, hard disk drive (hard disk drive, HDD) or solid-state starter ( solid state disk, SSD).

The meaning of the term "at least one" in this application refers to one or more, and the meaning of the term "multiple" in this application refers to two or more.

In this application, the terms "first" and "second" are used to distinguish the same or similar items with basically the same function and function. It should be understood that "first", "second" and "nth" There are no logical or timing dependencies, nor are there restrictions on quantity or order of execution. For example, "first text" and "second text" are only used to distinguish, and do not mean that the priorities of "first text" and "second text" are different.

It should be understood that in each embodiment of the present application, the size of the sequence numbers of the various processes does not mean the order of execution, and the execution order of the various processes should be determined by their functions and internal logic, and should not be used in the implementation of the embodiments of the present application. process constitutes any qualification.

It should be understood that determining B according to A does not mean determining B only according to A, and B may also be determined according to A and/or other information.

It should be understood that the term "and/or" in this article is only an association relationship describing associated objects, indicating that there may be three relationships, for example, A and/or B may mean: A exists alone, and A and B exist at the same time , there are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual objects are an "or" relationship.

It should be understood that, in various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation.

The terms "component", "module", "system" and the like are used in this specification to refer to a computer-related entity, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be components. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. A component may, for example, be based on a signal having one or more packets of data (e.g., data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet via a signal interacting with other systems). Communicate through local and/or remote processes.

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. A skilled artisan may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.

Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual conditions to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: various media capable of storing program codes such as U disk, mobile hard disk, read-only memory, random access memory, magnetic disk or optical disk.

The above is only a specific implementation of the application, but the scope of protection of the application is not limited thereto. Anyone familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the application. Should be covered within the protection scope of this application. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A method for voice interaction, characterized in that, comprising:

Acquire a first audio signal, where the first audio signal includes a first voice instruction;

determining the duration of the first timer according to the first text corresponding to the first voice instruction;

start the first timer;

acquiring a second audio signal, the start time of the second audio signal is equal to or later than the end time of the first audio signal;

When the text corresponding to the voice instruction in the second audio signal is empty, determine the end time of the first timer as the voice endpoint;

After the voice endpoint is determined, respond to the first voice instruction.
The method according to claim 1, wherein when the text corresponding to the voice instruction in the second audio signal is empty, determining the end time of the first timer as the voice endpoint includes:

When the text corresponding to the voice instruction in the second audio signal is empty, and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer is determined as the Describe the voice endpoint.
The method according to claim 1 or 2, further comprising:

obtain the second text displayed on the display screen;

The determining the duration of the first timer according to the first text corresponding to the first voice instruction includes:

When the first text corresponding to the first voice instruction does not match the second text, determine the duration of the first timer according to the first text corresponding to the first voice instruction.
The method according to any one of claims 1 to 3, further comprising:

Acquire a third audio signal, the third audio signal includes an audio signal received within a first preset time, and the start moment of the first preset time is equal to or later than the end moment of the first audio signal ;

The determining the duration of the first timer according to the first text corresponding to the first voice instruction includes:

When the third audio signal does not include a voice instruction, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.
The method according to any one of claims 1 to 4, wherein, before acquiring the first audio signal, the method further comprises:

Acquire a fourth audio signal, where the fourth audio signal includes a third voice instruction;

determining the duration of the second timer according to the third text corresponding to the third voice instruction;

Start the second timer and acquire a fifth audio signal when the second timer is running, the end time of the second timer is earlier than or equal to the start time of the first timer;

When the text corresponding to the voice command in the fifth audio signal is not empty, the first audio signal is determined according to the fourth audio signal and the fifth audio signal, and the first audio signal includes the The fourth audio signal and the fifth audio signal.
The method according to claim 5, wherein the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than the end moment of the fifth audio signal.
The method according to any one of claims 1 to 6, wherein the determining the duration of the first timer according to the first text corresponding to the first voice command includes:

inputting the first text corresponding to the first voice command into a prediction model to obtain the semantic completeness of the first text;

The duration of the first timer is determined according to the semantic integrity of the first text.
A device for voice interaction, characterized in that the device includes:

An acquisition module, configured to acquire a first audio signal, the first audio signal including a first voice command; and also used to acquire a second audio signal, the start moment of the second audio signal being equal to or later than the first audio signal the end time of an audio signal;

A processing module, configured to determine the duration of the first timer according to the first text corresponding to the first voice command; start the first timer; the text corresponding to the voice command in the second audio signal When it is empty, determine the end time of the first timer as the voice endpoint; after determining the voice endpoint, respond to the first voice instruction.
The device according to claim 8, wherein the processing module is specifically used for:

When the text corresponding to the voice instruction in the second audio signal is empty, and the energy of the audio frame of the second audio signal is less than or equal to the first threshold, the end time of the first timer is determined as the Describe the voice endpoint.
The device according to claim 8 or 9, wherein the acquiring module is also used for:

obtain the second text displayed on the display screen;

The processing module is specifically used for:

When the first text corresponding to the first voice instruction does not match the second text, determine the duration of the first timer according to the first text corresponding to the first voice instruction.
The device according to any one of claims 8 to 10, wherein the acquisition module is also used for:

Acquiring a third audio signal, the third audio signal comprising an audio signal received within a first preset time, the start moment of the first preset time being equal to or later than the end moment of the first audio signal;

The processing module is specifically used for:

When the third audio signal does not include a voice instruction, the duration of the first timer is determined according to the first text corresponding to the first voice instruction.
The device according to any one of claims 8 to 11, wherein the acquisition module is also used for:

Before acquiring the first audio signal, acquire a fourth audio signal, where the fourth audio signal includes a third voice instruction;

acquiring a fifth audio signal when the second timer is running;

The processing module is also used for:

determining the duration of the second timer according to the third text corresponding to the third voice instruction;

start the second timer, the end time of the second timer is earlier than or equal to the start time of the first timer;

When the text corresponding to the voice instruction in the fifth audio signal is not empty, the first audio signal is determined according to the fourth audio signal and the fifth audio signal, and the first audio signal includes the The fourth audio signal and the fifth audio signal.
The device according to claim 12, wherein the start moment of the first audio signal is earlier than or equal to the start moment of the fourth audio signal, and the end moment of the first audio signal is equal to or later than at the end moment of the fifth audio signal.
The device according to any one of claims 8 to 13, wherein the processing module is specifically used for:

inputting the first text corresponding to the first voice command into a prediction model to obtain the semantic completeness of the first text;

The duration of the first timer is determined according to the semantic integrity of the first text.
An apparatus, characterized by comprising a processor and a memory, the memory is used to store program instructions, and the processor is used to invoke the program instructions to execute the method according to any one of claims 1 to 7.
A computer program product, characterized in that it includes computer program code, and when the computer program code runs on a computer, it causes the computer to execute the method according to any one of claims 1 to 7.
A computer-readable storage medium, characterized in that the computer-readable medium stores program codes, and when the program codes are run on a computer, the computer executes the method according to any one of claims 1 to 7. method.