CN118366458A

CN118366458A - Full duplex dialogue system and method, electronic equipment and storage medium

Info

Publication number: CN118366458A
Application number: CN202410786022.9A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2024-06-18
Filing date: 2024-06-18
Publication date: 2024-07-19

Abstract

The present disclosure relates to the field of computer technologies, and in particular, to a full duplex dialogue system and method, an electronic device, and a storage medium, where the system includes: the system comprises a voice-to-text module, an agent module, a large language model and a text-to-voice module; the voice-to-text module receives a first input voice in real time, determines a first problem text corresponding to the first input voice, and sends the first problem text to the proxy module; the agent module determines a first dialogue context comprising a first question text according to the historical dialogue information and sends the first dialogue context to the large language model; the large language model determines a first response text corresponding to the first question text according to the first dialogue context, and sends the first response text to the text-to-speech module; the text-to-speech module determines and outputs a first output speech corresponding to the first answer text. The embodiment of the disclosure effectively improves the fluency, accuracy and naturalness of the dialogue, and further remarkably improves the use experience of users.

Description

Full duplex dialogue system and method, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a full duplex dialogue system and method, an electronic device, and a storage medium.

Background

In the field of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI), particularly in the development of voice interactive systems, AI voice assistants have become an integral part of the user's daily life, providing a wide range of applications from simple query services to complex task execution. Full duplex dialog systems provide an advanced human-machine interaction mode that allows the system to respond to speech output while capturing and understanding user speech input. However, existing full duplex dialog systems have certain limitations in terms of both fluency and flexibility.

Disclosure of Invention

The disclosure provides a full duplex dialogue system and method, an electronic device and a technical scheme of a storage medium.

According to an aspect of the present disclosure, there is provided a full duplex dialog system, including: the system comprises a voice-to-text module, an agent module, a large language model and a text-to-voice module; the voice-to-text module is used for receiving a first input voice in real time, determining a first problem text corresponding to the first input voice, and sending the first problem text to the proxy module; the agent module is used for determining a first dialogue context comprising the first question text according to the historical dialogue information and sending the first dialogue context to the large language model; the large language model is used for determining a first response text corresponding to the first question text according to the first dialogue context and sending the first response text to the text-to-speech module; the text-to-speech module is used for determining and outputting a first output speech corresponding to the first response text.

In one possible implementation manner, the voice-to-text module is configured to: under the condition that other input voices are not received within N continuous preset time periods after the first input voice is received, adding a first indication character, namely a first indication token, at the tail of the first question text, wherein the first indication token is used for indicating input interruption, and N is a positive integer greater than or equal to 1; and adding a second indication token at the end of the first question text under the condition that other input voices are not received in M continuous preset time periods after the first input voice is received, wherein the second indication token is used for indicating the input termination, and M is a positive integer greater than N.

In one possible implementation, the large language model is used to: judging whether a first target question formed by the first question text is complete or not based on the first dialogue context under the condition that the first indication token is added at the end of the first question text; under the condition that the first target problem is determined to be complete, determining the first response text; and under the condition that the first target problem is determined to be incomplete, outputting a third indication token, wherein the third indication token is used for indicating to wait for perfecting the first problem text.

In one possible implementation, the large language model is used to: and determining the first response text based on the first dialogue context under the condition that a second instruction token is added at the end of the first question text.

In one possible implementation, the proxy module is configured to update the historical dialog information according to the first question text and the first answer text.

In a possible implementation manner, the text-to-speech module is configured to stop outputting the first output speech when it is determined that the speech-to-text module receives the second input speech in the process of outputting the first output speech; the voice-to-text module is used for determining a second problem text corresponding to the second input voice and sending the second problem text to the proxy module; the agent module is used for determining a second dialogue context comprising the second question text according to the historical dialogue information and sending the second dialogue context to the large language model; the large language model is used for determining a problem interrupt type corresponding to the second input voice according to the second dialogue context, and executing corresponding processing operation according to the problem interrupt type.

In one possible implementation manner, the large language model is configured to determine, according to the second dialogue context, a second response text corresponding to the second question text and send the second response text to the text-to-speech module when the question interrupt type is a critical interrupt; the text-to-speech module is used for determining and outputting a second output speech corresponding to the second response text.

In one possible implementation manner, the large language model is configured to control the text-to-speech module to continue outputting the first output speech when the problem interrupt type is a non-critical interrupt.

In one possible implementation, the large language model is used to: judging whether the first question text has error content or not, and determining the first answer text under the condition that the first question text has error content, wherein the first answer text is used for correcting the error content of the first question text; the text-to-speech module is used for interrupting other input speech currently being received by the speech-to-text module, and directly determining and outputting the first output speech.

According to an aspect of the present disclosure, there is provided a full duplex dialogue method including: receiving a first input voice by utilizing a voice-to-text module, determining a first problem text corresponding to the first input voice, and sending the first problem text to an agent module; determining a first dialogue context including the first question text according to historical dialogue information by using the proxy module, and sending the first dialogue context to a large language model; determining a first response text corresponding to the first question text according to the first dialogue context by using the large language model, and sending the first response text to a text-to-speech module; and outputting a first output voice corresponding to the first response text by utilizing the text-to-voice module.

According to an aspect of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In an embodiment of the present disclosure, a full duplex dialog system includes: the system comprises a voice-to-text module, an agent module, a large language model and a text-to-voice module; the advanced voice recognition technology of the voice-to-text module is utilized, the first input voice can be received in real time and converted into the first problem text without waiting for the user to stop speaking and then converted, and the voice-to-text module sends the first problem text obtained by real-time conversion to the proxy module; the agent module is used for maintaining dynamic historical dialogue information, combining the historical dialogue information with the first problem text, effectively forming a complete first dialogue context, and transmitting the first dialogue context to the large language model; utilizing powerful reasoning and semantic understanding capability of the large language model, accurately determining a corresponding first response text for the first question text according to the first dialogue context, and sending the first response text to a text-to-speech module; the text-to-speech module determines and outputs the first output speech corresponding to the first answer text, so that the instant response to the first input speech received in real time is effectively realized, the fluency, accuracy and naturalness of the dialogue are improved, and the use experience of the user is further remarkably improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

Fig. 1 illustrates a block diagram of a full duplex dialog system, according to an embodiment of the present disclosure.

Fig. 2 shows a flow chart of a full duplex conversation method according to an embodiment of the present disclosure.

Fig. 3 illustrates a block diagram of an electronic device, according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Existing full duplex dialog systems include the following functions. 1. And (3) real-time prediction: based on the parallel processing mechanism of language prediction, the system can predict the language according to the existing information when the user utterance is not completely expressed and complete, and continuously correct the prediction along with the progress of the user utterance so as to provide faster response and smoother communication. 2. Speech enhancement: in order to improve the quality and the understandability of the voice signal, the system adopts the technologies of noise reduction, sound enhancement, role separation and the like to reject and filter invalid voice, so that the interference of environmental noise on voice recognition is reduced. 3. Intelligent breaking: the system can be like a human, can give questions, inquire or express comments during the talking of the opposite party, control the rhythm of the dialogue, realize real-time semantic interrupt, avoid false interrupt and simultaneously can not respond to meaningless input. 4. Scene and full field continuous dialogue: scene-type continuous dialogue is limited to respond to instructions in a specific field, such as music playing control; the full-domain continuous dialogue is not domain limited, allowing the user to communicate more widely with the system. 5. Echo cancellation: the echo of the sound of the system is eliminated through a front-end acoustic algorithm, so that the system is prevented from recording the sound of the system when speaking. 6. Refusing to identify: the system continues to turn on the microphone and needs to filter out ineffective background noise, such as the speech of surrounding people.

There is still room for improvement in existing full duplex dialog systems. 1. The system efficiency requirement is high: in full duplex scenarios, the system needs to continuously process voice data in real time, which places high demands on the efficiency of the system, requiring efficient communication protocols and asynchronous processing capabilities. 2. Difficulty of breaking and understanding by mistake: although intelligent breaks are mentioned in existing full duplex dialog systems, in practical applications the system may misjudge the user's intent, resulting in unnecessary breaks or missing important information. When a user breaks in the system answer process, the existing full duplex dialogue system often cannot flexibly handle the situation, so that the user experience is damaged. 3. Complexity of dialog strategies: in a continuous conversation, the system needs to be able to actively switch different conversation strategies according to the instructions or context of the user, which is technically complex to implement. 4. The continuous dialog processing capability is not sufficient: existing full duplex dialog systems often lose context when dealing with continuous, dynamically changing dialog scenes, and it is difficult to maintain a long-term dialog flow. 5. Complex sentences have limited understanding capabilities: for sentences containing multiple intents or complex structures, existing full duplex dialog systems often have difficulty accurately resolving and responding to the actual needs of the user.

In order to solve the above problems, the embodiments of the present disclosure provide a full duplex dialogue system, which can effectively achieve the fluency, accuracy and naturalness of a dialogue of the full duplex dialogue system by using the powerful reasoning and semantic understanding capabilities of a large language model (Large Language Model, LLM), and significantly improve the user experience. The full duplex dialogue system provided by the embodiment of the present disclosure is described in detail below.

Fig. 1 illustrates a block diagram of a full duplex dialog system, according to an embodiment of the present disclosure. As shown in fig. 1, a full duplex dialog system, comprising: the system comprises a voice-to-text module, an agent module, a large language model and a text-to-voice module; the voice-to-text module is used for receiving the first input voice in real time, determining a first problem text corresponding to the first input voice and sending the first problem text to the proxy module; the agent module is used for determining a first dialogue context comprising a first question text according to the historical dialogue information and sending the first dialogue context to the large language model; the large language model is used for determining a first response text corresponding to the first question text according to the first dialogue context and sending the first response text to the text-to-speech module; the text-to-speech module is used for determining and outputting a first output speech corresponding to the first response text.

In the full duplex dialogue system of the disclosed embodiment, the voice-to-text (Automatic Speech Recognition, ASR) module adopts advanced voice recognition technology, and recognizes and converts the input voice of the user through advanced deep learning algorithm, so that the input voice of the user can be converted into text form in real time. In practice, these texts are often referred to as text token or vocabulary units. The ASR module is configured with high performance acoustic models and language models that can handle the diversity and complexity of natural language.

In an embodiment of the disclosure, the ASR module receives a first input speech of a user in real time and converts the first input speech into a corresponding first question text. Wherein the first question text may comprise a plurality of text tokens. For example, the first question text is "how tomorrow is," and then the first question text includes the text token: "tomorrow", "weather", "how". The specific process of converting speech into text in the ASR module may refer to the related art, and this disclosure is not limited in detail.

The ASR module can support a rapid recognition function, complete recognition of a short sentence once within a short preset time period (for example, 300 milliseconds), and ensure fluency of interaction with a user. For interruption and pause of user speaking, the ASR module can intelligently judge whether the user is the end of a sentence or hesitation of the user. Compared with the existing method for judging the statement end by relying on the voice pause of the user in the full duplex dialogue system, the ASR module can capture the continuity of the voice stream, provide more flexible voice input for subsequent processing, and particularly show higher efficiency and accuracy when processing non-continuous or interrupted voice input.

In one possible implementation, the voice-to-text module is configured to: under the condition that other input voices are not received within N continuous preset time periods after the first input voice is received, adding a first indication token at the tail of the first question text, wherein the first indication token is used for indicating input interruption, and N is a positive integer greater than or equal to 1; and adding a second indication token at the end of the first question text under the condition that other input voices are not received in M continuous preset time periods after the first input voice is received, wherein the second indication token is used for indicating the input termination, and M is a positive integer greater than N.

The ASR module adds a first indication token at the end of the first question text if no other input speech is received within N consecutive preset time periods after the first input speech is received. The first indication token is used for indicating that the user inputs an interruption, i.e. a pause in speech of the user is reflected.

The ASR module adds a second indication token at the end of the first question text in case no other input speech is received within a longer succession of M preset time periods after the first input speech is received. The second indication token is used for indicating the ending of the user input, namely reflecting the ending of the user speaking.

The specific value of N, M, the specific duration of each preset time period, and the specific forms of the first indication token and the second indication token can be flexibly set according to actual situations, which is not specifically limited in the disclosure.

For example, in the case where the ASR module does not receive other input speech within a predetermined period of time n=2 consecutive after receiving the first input speech, the ASR module adds a first indication token at the end of the first question text: "</s >" indicates that the user is speaking paused; the ASR module adds a second indication token at the end of the first question text in case no further input speech is received within a predetermined period of time m=3 consecutive times after the first input speech is received by the ASR module: "</s > </s >" indicates the end of the user's utterance.

The ASR module converts the first input voice into corresponding first problem text in real time and then sends the first problem text to an Agent module in real time.

In the full duplex dialogue system of the disclosed embodiment, the Agent module is used as a center of the system and is responsible for storing and managing a dynamic history dialogue information, including each input of a user and each response of the system.

After receiving the first question text sent by the ASR module, the Agent module integrates the first question text with previous historical dialogue information to form a complete first dialogue context comprising the first question text, so that subsequent LLMs can effectively realize understanding of the first question text based on the first dialogue context. The design makes each dialogue answer based on complete history information, and greatly improves the consistency and context correlation of the dialogue.

In one example, the Agent module stores text tokens of chronologically arranged historical dialog information, and after receiving a first question text (a plurality of text tokens) sent by an ASR, the Agent module generates a first dialog context based on a token window of a preset length. For example, after the agent module receives the first question text (a plurality of text token) sent by the ASR, the agent module generates a first dialog context according to the 2000 token sequences with the latest time sequence.

The proxy module sends the generated first dialog context to the LLM to respond in real-time with powerful reasoning and semantic understanding capabilities of the LLM.

In the full duplex dialogue system of the disclosed embodiment, LLM is used as the core of the system, and the latest dialogue context (token sequence) transmitted by an Agent module is processed by utilizing the powerful language understanding and generating capability of the LLM. It can not only generate fluent natural dialogue answers, but also judge the completeness of the questions through deep understanding of semantics, and give proper responses even in the face of inconsistent or incomplete sentences.

In one possible implementation, a large language model is used to: judging whether a first target problem formed by the first problem text is complete or not based on the first dialogue context under the condition that a first indication token is added at the end of the first problem text; under the condition that the first target problem is determined to be complete, determining a first response text; and outputting a third indication token when the first target problem is determined to be incomplete, wherein the third indication token is used for indicating to wait for perfecting the first problem text.

And under the condition that a first indication token is added at the end of the first question text, the ASR does not receive new voice input in N continuous preset time periods, and the user pauses speaking. At this time, in order to flexibly cope with the speaking suspension of the user, the LLM determines whether the first target question composed of the current latest first question text is complete.

In an example, the LLM determines, based on its strong language understanding capability, whether the intent of the first target question can be understood based on the first dialog context, and if so, it indicates that the LLM has obtained enough information at this time to generate a response corresponding to the first target question, i.e., the LLM may determine that the first target question is complete.

When the LLM determines that the first target problem is complete through judgment, the LLM can still generate a first response text corresponding to the first problem text although the user pauses speaking.

When the LLM determines that the first target problem is incomplete through judgment, the LLM indicates that the user speaking pauses to cause the LLM to still not understand the intention of the current first target problem based on the first dialogue context, and at this time, the LLM outputs a third indication token, which indicates that the user needs to continue waiting for inputting voice to perfect the first problem text.

In an example, the LLM can be trained using training samples during a training phase, which can include: incomplete question text and corresponding third indication token, complete question text and corresponding answer text, so that the trained LLM has the following capabilities: judging whether a first target problem formed by the input first problem text is complete or not; under the condition that the first target problem formed by the first problem text is judged and determined to be complete, a corresponding first response text is generated; and outputting a third indication token under the condition that the first target problem formed by the first problem text is determined to be incomplete.

In one possible implementation, a large language model is used to: in the case where a second instruction token is added to the end of the first question text, a first answer text is determined based on the first dialogue context.

And under the condition that a second indication token is added at the end of the first question text, the ASR does not receive new voice input in M continuous preset time periods, namely the user does not speak for a long time, and the user speaks. At this time, regardless of whether the first target question constituted by the first question text is complete, in order to ensure timely response of the full duplex dialog system, the LLM must determine the first answer text based on the response given by the first dialog context. At this time, the first answer text may be a content for prompting the user to continue voice input to perfect the question, or may be a preset content, or a content related to the first dialog context, which is not specifically limited in the present disclosure.

Because the LLM can quickly understand the problem intention of the first input voice of the user based on the dialogue context, the LLM can timely generate and output corresponding first response text after the problem intention is interpreted in the face of the incoherent or incomplete first input voice, so that the response timeliness of the full duplex dialogue system is ensured.

In an example, the LLM can generate the first reply text at a speed of a preset number of text tokens per second, where the preset number can be set according to a scene, and the generation speed is higher than the text output speed, for example, the preset number can be 80, 100 text tokens per second, and the like, so that timeliness of the reply is ensured. The specific value of the preset number can be flexibly changed according to actual situations, for example, depending on the hardware parameters related to the LLM, which is not specifically limited in the disclosure.

Through the powerful reasoning and semantic understanding capabilities of LLM, the full duplex dialogue system of the embodiment of the application effectively processes incoherent or incomplete voice input of a user. LLM is able to infer a user's intent based on a dialog context and generate a consistent, accurate response even in the face of intermittent user speech input.

In the face of high-frequency streaming ASR recognition results, LLM can determine whether to output an answer by judging whether a user question is complete, which means that the answer of LLM does not need to wait for an ASR judging and stopping result, and response delay is greatly shortened.

The LLM sends the generated first response Text To a Text-To-Speech (TTS) module To realize Speech output.

In the full duplex dialogue system of the embodiment of the disclosure, the TTS module is responsible for converting the response text generated by the LLM module into voice and outputting the voice through a speaker to be presented to the user in a natural and smooth manner. The module adopts a high-quality voice synthesis technology, and ensures natural and smooth voice output and clear tone quality.

In one possible implementation, the agent module is configured to update the historical dialog information based on the first question text and the first answer text.

And the Agent module dynamically updates the historical dialogue information according to the first question text and the first answer text so as to ensure timeliness and integrity of the historical dialogue information.

In one possible implementation manner, the text-to-speech module is configured to stop outputting the first output speech when it is determined that the speech-to-text module receives the second input speech in the process of outputting the first output speech; the voice-to-text module is used for determining a second problem text corresponding to the second input voice and sending the second problem text to the proxy module; the agent module is used for determining a second dialogue context comprising a second problem text according to the historical dialogue information and sending the second dialogue context to the large language model; and the large language model is used for determining the problem interrupt type corresponding to the second input voice according to the second dialogue context and executing corresponding processing operation according to the problem interrupt type.

The TTS may interrupt the user at any time while outputting the first output speech in response to the first input speech. In the output process of the first output voice, if the ASR module receives the second input voice, the TTS module pauses the output of the first output voice when the output of the first output voice is interrupted by the user.

The ASR module determines a second problem text corresponding to the second input voice and sends the second problem text to the Agent module; the Agent module determines a second dialog context including a second question text based on the historical dialog information and sends the second dialog context to the LLM. The above process is similar to the process of generating the first question text and the first dialogue context, and will not be described here.

Because the second input voice is input under the condition of breaking the first output voice, the LLM judges the problem breaking type corresponding to the second input voice according to the second dialogue context, and executes corresponding operation based on the problem breaking type so as to ensure the fluency of the dialogue and the consistency of user experience.

In one possible implementation manner, the large language model is used for determining a second response text corresponding to the second question text according to the second dialogue context and sending the second response text to the text-to-speech module when the question interrupt type is a key interrupt; and the text-to-speech module is used for determining and outputting a second output speech corresponding to the second response text.

In the case that the LLM determines that the question interrupt type is a key interrupt, that is, the second question text corresponding to the second input voice needs to be answered in time. At this time, the LLM determines a second response text corresponding to the second question text according to the second dialogue context, and sends the second response text to the TTS module; the TTS module determines and outputs a second output voice corresponding to the second response text. The above process is similar to the process of generating the second response text and the second output voice, and will not be described here.

In an example, the critical interrupt may be whether to recognize or pursue an interrupt.

For example, the TTS module suddenly breaks and inputs a second input voice "no, i know what the rainfall probability is today" during the output of the first output voice "the rainfall probability is 30%" on tomorrow. At this time, the LLM recognizes that the type of problem interrupt is a negative interrupt, i.e., a critical interrupt, and thus, the LLM regenerates the second text response based on the latest second input voice, and outputs the second output voice based on the TTS module "the rainfall probability of today is 40%".

The TTS module suddenly breaks and inputs a second input voice during the output of the first output voice, which is rainy in the open, at about 20 degrees. At this time, the LLM recognizes that the type of the question interruption is a top-end interruption, that is, a key interruption, and thus, the LLM regenerates the second text response based on the latest second input voice, and outputs the second output voice based on the TTS module at a temperature of between 22 and 25 degrees as "the weekend clear".

The key interruption may be the interruption of the denial or inquiry, or may be the interruption of giving a response immediately according to the actual situation, which is not particularly limited in the present disclosure.

In one possible implementation, the large language model is configured to control the text-to-speech module to continue outputting the first output speech if the type of the problem interrupt is a non-critical interrupt.

In the case that the LLM determines that the question interrupt type is a non-critical interrupt, that is, the second question text corresponding to the second input voice does not need to be answered in time. At this time, the LLM controls the TTS module to continue outputting the first output voice.

In an example, the non-critical interrupt may be an additional or consent interrupt.

For example, the TTS module suddenly breaks and inputs a second input voice "good, i know" during the output of the first output voice "library monday to friday open". At this point, the LLM recognizes that the question break type is an append and break, i.e., a non-critical break, and the user has accepted the information without further answers. Thus, the TTS module is controlled to continue outputting the first output voice, and if the first output voice has ended, the system will continue waiting for the next voice input by the user.

In an example, the non-critical interrupt may be a third party noise or an interrupt of an irrelevant sentence.

For example, the TTS module, during outputting the first output voice "tomorrow's wind is of the order of 3 to 4", has a noise input by a third party user different from the target user corresponding to the first input voice, or the target user suddenly breaks and inputs the second input voice "sorry, i have just touched the table carelessly". At this time, the LLM recognizes that the type of question break is a third party noise or a break of an irrelevant sentence, i.e., a non-critical break, without having to answer. Thus, the TTS module is controlled to continue outputting the first output voice, and if the first output voice has ended, the system will continue waiting for the next voice input by the user.

The non-critical interruption may be, besides the above-mentioned additional or consent interruption, third party noise or an interruption of an irrelevant sentence, or other interruptions without giving a response according to the actual situation, which is not specifically limited in the present disclosure.

In one possible implementation, a large language model is used to: judging whether the first question text has error content or not, and determining a first answer text under the condition that the first question text has error content, wherein the first answer text is used for correcting the error content of the first question text; the text-to-speech module is used for interrupting other input speech currently being received by the speech-to-text module, and directly determining and outputting a first output speech.

The LLM generates a first response text for correcting errors based on the first input voice which is input by the user, and judges that the corresponding first problem text has error content (such as common sense error or directivity error) so that the TTS module breaks the current input of the user and outputs a first output voice corresponding to the first response text to correct the user errors.

For example, when the ASR module receives a first input voice input by the user, "the sun of hearing and speaking will disappear … …" in the present year, the LLM recognizes and understands the first question text corresponding to the first input voice, and determines that there is an error content in the first question text, so that the LLM module generates a first answer text for correcting the error, so that the TTS module interrupts the current voice input of the user, and outputs a first output voice corresponding to the first answer text, "sorry, and there appears to be a problem in your description. The statement of the disappearance of the sun is not a scientific basis. The sun is a star with a very long life … … "to correct user errors.

The full duplex dialogue system of the embodiment of the application adopts an innovative interrupt management strategy, and allows the system to intelligently process any interrupt behavior of a user in the voice output process. The system can identify and classify the type of question break (e.g., question, denial, parasitic, and/or extraneous noise) of the user and adjust the TTS output or generate a new response accordingly.

Under the condition that a user interrupts TTS voice output, the full duplex dialogue system of the embodiment of the application can flexibly adjust dialogue strategies according to the type of the problem interrupt so as to adapt to the requirements of the user. For example, when a user breaks TTS speech output, LLM makes an intelligent decision whether to immediately generate a new response or continue the original output based on the type of problem break and the dialog context.

The functions of the ASR module, the Agent module, the LLM and the TTS module are integrated, and an efficient full duplex dialogue system is realized by means of the reasoning capability of the LLM and deep understanding of semantics. The full duplex dialogue system of the embodiment of the disclosure not only can process unconnected voice input, but also can allow a user to interrupt output at any moment under the condition of not depending on ASR fixed time length judgment, thereby providing a more natural, efficient and humanized interactive experience.

Fig. 2 shows a flow chart of a full duplex conversation method according to an embodiment of the present disclosure. As shown in fig. 2, the method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, etc., and the method may be implemented by a processor invoking computer readable instructions stored in a memory. Or the method may be performed by a server. The method comprises the following steps:

In step S21, a voice-to-text module is utilized to receive a first input voice, determine a first question text corresponding to the first input voice, and send the first question text to a proxy module.

In step S22, a first dialog context including the first question text is determined from the historical dialog information using the proxy module, and the first dialog context is sent to the large language model.

In step S23, a first answer text corresponding to the first question text is determined according to the first dialogue context using the large language model, and the first answer text is sent to the text-to-speech module.

In step S24, a text-to-speech module is utilized to output a first output speech corresponding to the first answer text.

In one possible implementation, the method further includes: under the condition that other input voices are not received in N continuous preset time periods after the voice character conversion module receives the first input voice, adding a first indication token at the tail of the first problem text, wherein the first indication token is used for indicating input interruption, and N is a positive integer greater than or equal to 1; and under the condition that other input voices are not received in M continuous preset time periods after the voice character conversion module receives the first input voice, adding a second indication token at the tail of the first question text, wherein the second indication token is used for indicating the input termination, and M is a positive integer greater than N.

In one possible implementation, determining, according to the first dialog context, a first answer text corresponding to the first question text includes: judging whether a first target problem formed by the first problem text is complete or not based on the first dialogue context under the condition that a first indication token is added at the end of the first problem text; under the condition that the first target problem is determined to be complete, determining a first response text; and outputting a third indication token when the first target problem is determined to be incomplete, wherein the third indication token is used for indicating to wait for perfecting the first problem text.

In one possible implementation, determining, according to the first dialog context, a first answer text corresponding to the first question text includes: in the case where a second instruction token is added to the end of the first question text, a first answer text is determined based on the first dialogue context.

In one possible implementation, the method further includes: and updating the historical dialogue information according to the first question text and the first answer text by using the proxy module.

In one possible implementation, the method further includes: in the process that the text-to-speech module outputs the first output speech, under the condition that the speech-to-text module receives the second input speech, the text-to-speech module is controlled to stop outputting the first output speech; determining a second problem text corresponding to the second input voice by utilizing the voice-to-text module, and sending the second problem text to the proxy module; determining a second dialogue context including a second question text according to the historical dialogue information by using the proxy module, and transmitting the second dialogue context to the large language model; and determining a problem interrupt type corresponding to the second input voice according to the second dialogue context by using the large language model, and executing corresponding processing operation according to the problem interrupt type.

In one possible implementation, using the large language model, determining a problem interrupt type corresponding to the second input speech according to the second dialogue context, and performing a corresponding processing operation according to the problem interrupt type, including: under the condition that the large language model determines that the problem interrupt type is key interrupt, determining a second response text corresponding to the second problem text according to the second dialogue context by using the large language model, and sending the second response text to the text-to-speech module; and determining and outputting a second output voice corresponding to the second response text by utilizing the text-to-voice module.

In one possible implementation, using the large language model, determining a problem interrupt type corresponding to the second input speech according to the second dialogue context, and performing a corresponding processing operation according to the problem interrupt type, including: and under the condition that the large language model determines that the problem interrupt type is non-key interrupt, controlling the text-to-speech module to continuously output the first output speech.

In one possible implementation, the method further includes: judging whether the first question text has error content or not by using a large language model, and determining a first answer text under the condition that the first question text has error content, wherein the first answer text is used for correcting the error content of the first question text; and breaking other input voices currently received by the voice-to-text module by utilizing the text-to-voice module, and directly determining and outputting a first output voice.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the full duplex dialogue methods/systems provided by the disclosure, and corresponding technical schemes and descriptions and corresponding descriptions referring to method parts are not repeated.

The method has specific technical association with the internal structure of the computer system, and can solve the technical problems of improving the hardware operation efficiency or the execution effect (including reducing the data storage amount, reducing the data transmission amount, improving the hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system which accords with the natural law.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a volatile or nonvolatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

Embodiments of the present disclosure also provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

The electronic device may be provided as a terminal, server or other form of device.

Fig. 3 illustrates a block diagram of an electronic device, according to an embodiment of the present disclosure. Referring to fig. 3, an electronic device 1900 may be provided as a server or terminal device. Referring to FIG. 3, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as the Microsoft Server operating system (Windows Server ^TM), the apple Inc. promoted graphical user interface-based operating system (Mac OS X ^TM), the multi-user, multi-process computer operating system (Unix ^TM), the free and open source Unix-like operating system (Linux ^TM), the open source Unix-like operating system (FreeBSD ^TM), or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

The computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as SMALLTALK, C ++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The foregoing description of various embodiments is intended to highlight differences between the various embodiments, which may be the same or similar to each other by reference, and is not repeated herein for the sake of brevity.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

If the technical scheme of the application relates to personal information, the product applying the technical scheme of the application clearly informs the personal information processing rule before processing the personal information and obtains the autonomous agreement of the individual. If the technical scheme of the application relates to sensitive personal information, the product applying the technical scheme of the application obtains individual consent before processing the sensitive personal information, and simultaneously meets the requirement of 'explicit consent'. For example, a clear and remarkable mark is set at a personal information acquisition device such as a camera to inform that the personal information acquisition range is entered, personal information is acquired, and if the personal voluntarily enters the acquisition range, the personal information is considered as consent to be acquired; or on the device for processing the personal information, under the condition that obvious identification/information is utilized to inform the personal information processing rule, personal authorization is obtained by popup information or a person is requested to upload personal information and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing mode, and a type of personal information to be processed.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A full duplex dialog system, comprising: the system comprises a voice-to-text module, an agent module, a large language model and a text-to-voice module;

the voice-to-text module is used for receiving a first input voice in real time, determining a first problem text corresponding to the first input voice, and sending the first problem text to the proxy module;

the agent module is used for determining a first dialogue context comprising the first question text according to the historical dialogue information and sending the first dialogue context to the large language model;

the large language model is used for determining a first response text corresponding to the first question text according to the first dialogue context and sending the first response text to the text-to-speech module;

The text-to-speech module is used for determining and outputting a first output speech corresponding to the first response text.

2. The system of claim 1, wherein the speech-to-text module is configured to:

Under the condition that other input voices are not received within N continuous preset time periods after the first input voice is received, adding a first indication character, namely a first indication token, at the tail of the first question text, wherein the first indication token is used for indicating input interruption, and N is a positive integer greater than or equal to 1;

And adding a second indication token at the end of the first question text under the condition that other input voices are not received in M continuous preset time periods after the first input voice is received, wherein the second indication token is used for indicating the input termination, and M is a positive integer greater than N.

3. The system of claim 2, wherein the large language model is configured to:

Judging whether a first target question formed by the first question text is complete or not based on the first dialogue context under the condition that the first indication token is added at the end of the first question text;

Under the condition that the first target problem is determined to be complete, determining the first response text;

And under the condition that the first target problem is determined to be incomplete, outputting a third indication token, wherein the third indication token is used for indicating to wait for perfecting the first problem text.

4. The system of claim 2, wherein the large language model is configured to:

and determining the first response text based on the first dialogue context under the condition that a second instruction token is added at the end of the first question text.

5. The system of any one of claims 1 to 4, wherein the agent module is configured to update the historical dialog information based on the first question text and the first answer text.

6. The system of claim 5, wherein the text-to-speech module is configured to stop outputting the first output speech if it is determined that the speech-to-text module receives a second input speech during outputting the first output speech;

the voice-to-text module is used for determining a second problem text corresponding to the second input voice and sending the second problem text to the proxy module;

the agent module is used for determining a second dialogue context comprising the second question text according to the historical dialogue information and sending the second dialogue context to the large language model;

The large language model is used for determining a problem interrupt type corresponding to the second input voice according to the second dialogue context, and executing corresponding processing operation according to the problem interrupt type.

7. The system of claim 6, wherein the large language model is configured to determine a second response text corresponding to the second question text according to the second dialogue context if the question interrupt type is a critical interrupt, and send the second response text to the text-to-speech module;

the text-to-speech module is used for determining and outputting a second output speech corresponding to the second response text.

8. The system of claim 6, wherein the large language model is configured to control the text-to-speech module to continue outputting the first output speech if the type of problem interrupt is a non-critical interrupt.

9. The system of claim 1, wherein the large language model is configured to: judging whether the first question text has error content or not, and determining the first answer text under the condition that the first question text has error content, wherein the first answer text is used for correcting the error content of the first question text;

The text-to-speech module is used for interrupting other input speech currently being received by the speech-to-text module, and directly determining and outputting the first output speech.

10. A full duplex conversation method, comprising:

receiving a first input voice by utilizing a voice-to-text module, determining a first problem text corresponding to the first input voice, and sending the first problem text to an agent module;

determining a first dialogue context including the first question text according to historical dialogue information by using the proxy module, and sending the first dialogue context to a large language model;

Determining a first response text corresponding to the first question text according to the first dialogue context by using the large language model, and sending the first response text to a text-to-speech module;

and outputting a first output voice corresponding to the first response text by utilizing the text-to-voice module.

11. An electronic device, comprising:

A processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to invoke the memory-stored instructions to perform the method of claim 10.

12. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of claim 10.