CN112365892A

CN112365892A - Man-machine interaction method, device, electronic device and storage medium

Info

Publication number: CN112365892A
Application number: CN202011245627.5A
Authority: CN
Inventors: 陈粮阳; 谢恩宁; 曹宇慧
Original assignee: Hangzhou Dasouche Auto Service Co ltd
Current assignee: Hangzhou Dasouche Auto Service Co ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-12

Abstract

The application relates to a man-machine conversation method, a man-machine conversation device, an electronic device and a storage medium. The man-machine conversation method comprises the following steps: receiving the current turn of dialogue voice of a user, and preprocessing the dialogue voice to obtain text information; processing the text information through a preset semantic analysis model to obtain intention information; acquiring historical response information, and determining the conversation state of the current turn according to the historical response information and the intention information; and configuring response information corresponding to the conversation state according to a preset response configuration model, and generating response voice corresponding to the response information. Through the application, the problems of low conversation efficiency and poor conversation effect of a conversation system in the correlation technique are solved, the quick and effective outward-calling function of the AI robot in each scene is realized, the labor cost is reduced, and the beneficial effects of conversation efficiency and conversation effect are improved.

Description

Man-machine interaction method, device, electronic device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a human-machine interaction method, device, electronic device, and storage medium.

Background

In recent years, artificial intelligence technology has been rapidly developed, and products related to intelligent voice technology have entered thousands of households. People are increasingly accustomed to talking to machines and have a higher expectation of understanding and answering capabilities of machines. The Speech-based dialog system framework adopts an Automatic Speech Recognition (ASR) model and a Natural Language Understanding (NLU) model, and the work flow comprises the following steps: firstly, the voice of a user is converted into characters through an ASR model, then, the NLU model is used for semantic analysis, and finally, the intention of the user is obtained.

The dialogue system in the related technology needs a large amount of dialogue labeling corpora to carry out model training, and can achieve good dialogue effect after long data accumulation, but as the applied dialogue scenes of the dialogue system increase, the dialogue system is updated iteratively at high frequency, and the long-period dialogue system does not meet the requirements of the dialogue system.

At present, no effective solution is provided for the problems of low conversation efficiency and poor conversation effect of a conversation system in the related art.

Disclosure of Invention

The embodiment of the application provides a man-machine conversation method, a man-machine conversation device, an electronic device and a storage medium, and aims to at least solve the problems of low conversation efficiency and poor conversation effect of a conversation system in the related art.

In a first aspect, an embodiment of the present application provides a man-machine interaction method, including: receiving conversation voice of a current turn of a user, and preprocessing the conversation voice to obtain text information, wherein the preprocessing comprises text conversion and text error correction; processing the text information through a preset semantic analysis model to obtain intention information, wherein the intention information at least comprises an intention corresponding to the current turn of the user; obtaining historical response information, and determining the dialog state of the current turn according to the historical response information and the intention information, wherein the historical response information comprises response information generated according to the dialog state of the dialog of the previous turn; configuring the response information corresponding to the dialogue state according to a preset response configuration model, and generating response voice corresponding to the response information, wherein the preset response configuration model at least comprises one of the following: a dialogue strategy learning model and a knowledge base question-and-answer model.

In some embodiments, preprocessing the conversational speech to obtain text information includes:

performing text conversion processing on the dialogue voice through an automatic voice recognition technology to obtain a text to be processed;

and inputting the text to be processed into a text error correction model for text error correction to obtain the text information, wherein the text error correction model is generated by training according to a first sample text of preset semantic information, a second sample text without text errors and a third sample text with text errors.

In some embodiments, the intention information further includes slot position information, and processing the text information through a preset semantic analysis model to obtain the intention information includes:

performing natural language understanding processing on the text information to obtain candidate intention data, wherein the candidate intention data comprises candidate intentions and candidate slot position information;

detecting first intention data in the candidate intention data according to a preset intention recognition model, wherein the preset intention recognition model at least comprises one of the following items: the method comprises the following steps of (1) a regular matching model, a pre-training semantic matching model and an intention slot position joint model;

in an instance in which the first intent data is detected, determining that the intent information includes the first intent data, wherein the first intent data includes an intent corresponding to the user's current turn and the slot information.

In some of these embodiments, determining the dialog state for the current turn based on the historical answer information and the intent information includes:

inputting the historical response information and the intention information into a dialogue state tracking model, and acquiring a first characteristic value, wherein the first characteristic value comprises a semantic characteristic value associated with the historical response information and the intention information;

and detecting a preset state characteristic value in the first characteristic value, and determining the corresponding state of the current turn according to the preset state characteristic value.

In some embodiments, configuring the response information corresponding to the dialog state according to a preset response configuration model includes:

extracting first state semantic information of the dialog state, wherein the first state semantic information at least comprises state semantics corresponding to the intention information;

inputting the first state semantic information into the preset response configuration model to obtain the response information, wherein the preset response configuration model comprises one of the following: a dialogue strategy learning model and a knowledge base question-and-answer model.

In some embodiments, the preset response configuration model includes a conversation strategy learning model and a knowledge base question-and-answer model, and configuring the response information corresponding to the conversation state according to the preset response configuration model includes:

extracting second state semantic information of the dialog state, wherein the second state semantic information at least comprises state semantics corresponding to the intention information;

inputting the second state semantic information into the dialogue strategy learning model, and inquiring robot speech information corresponding to the second state semantic information, wherein the dialogue strategy learning model is generated by training according to first preset state semantic information and the robot speech information corresponding to the first preset state semantic information;

and under the condition that the robot talk information corresponding to the second state semantic information is not inquired, inputting the second state semantic information into the knowledge base question-answer model, acquiring response text information corresponding to the second state semantic information, and determining that the response text information comprises the response text information, wherein the knowledge base question-answer model comprises second preset state semantic information and response text information corresponding to the second preset state semantic information.

In some embodiments, in a case where the robot speech information corresponding to the second state semantic information is queried, it is determined that the response information includes the robot speech information corresponding to the second state semantic information.

In some of these embodiments, generating the response voice corresponding to the response information includes: and carrying out voice conversion on the response information to generate the response voice.

In a second aspect, an embodiment of the present application provides a human-machine interaction device, including:

the conversion module is used for receiving the conversation voice of the current turn of the user and preprocessing the conversation voice to obtain text information, wherein the preprocessing comprises text conversion and text error correction;

the generating module is used for processing the text information through a preset semantic analysis model to obtain intention information, wherein the intention information at least comprises an intention corresponding to the current turn of the user;

the processing module is used for acquiring historical response information and determining the conversation state of the current turn according to the historical response information and the intention information, wherein the historical response information comprises response information generated according to the conversation state of the conversation of the previous turn;

a response module, configured to configure the response information corresponding to the dialog state according to a preset response configuration model, and generate a response voice corresponding to the response information, where the preset response configuration model at least includes one of: a dialogue strategy learning model and a knowledge base question-and-answer model.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform the human-machine interaction method according to the first aspect.

In a fourth aspect, the present application provides a storage medium, in which a computer program is stored, where the computer program is configured to execute the human-computer interaction method according to the first aspect when the computer program runs.

Compared with the related art, the man-machine conversation method, the man-machine conversation device, the electronic device and the storage medium provided by the embodiment of the application receive the conversation voice of the current turn of the user and preprocess the speaking voice to obtain the text information; processing the text information through a preset semantic analysis model to obtain intention information; acquiring historical response information, and determining the conversation state of the current turn according to the historical response information and the intention information; the method and the system have the advantages that the response information corresponding to the conversation state is configured according to the preset response configuration model, the response voice corresponding to the response information is generated, the problems that conversation systems in the related art are low in conversation efficiency and poor in conversation effect are solved, the AI robot outbound function of each scene is achieved quickly and effectively, labor cost is reduced, and conversation efficiency and conversation effect are improved.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a terminal of a man-machine conversation method of an embodiment of the present invention;

FIG. 2 is a flow diagram of a human-machine dialog method according to an embodiment of the application;

fig. 3 is a block diagram of a human-machine interaction device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments provided in the present application without any inventive step are within the scope of protection of the present application.

It is obvious that the drawings in the following description are only examples or embodiments of the present application, and that it is also possible for a person skilled in the art to apply the present application to other similar contexts on the basis of these drawings without inventive effort. Moreover, it should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another.

Reference in the specification to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the specification. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of ordinary skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Unless defined otherwise, technical or scientific terms referred to herein shall have the ordinary meaning as understood by those of ordinary skill in the art to which this application belongs. Reference to "a," "an," "the," and similar words throughout this application are not to be construed as limiting in number, and may refer to the singular or the plural. The present application is directed to the use of the terms "including," "comprising," "having," and any variations thereof, which are intended to cover non-exclusive inclusions; for example, a process, method, system, article, or apparatus that comprises a list of steps or modules (elements) is not limited to the listed steps or elements, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Reference to "connected," "coupled," and the like in this application is not intended to be limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as referred to herein means two or more. "and/or" describes an association relationship of associated objects, meaning that three relationships may exist, for example, "A and/or B" may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Reference herein to the terms "first," "second," "third," and the like, are merely to distinguish similar objects and do not denote a particular ordering for the objects.

Various techniques described herein may be used for intent recognition, slot information acquisition, dialog state confirmation in a dialog system.

Before describing and explaining embodiments of the present application, a description will be given of the related art used in the present application as follows:

automatic Speech Recognition (ASR) is a technology for converting human Speech into text.

Natural Language Understanding (NLU), which processes a sentence input by a user or a result of speech recognition, extracts a dialog intention of the user and information transferred by the user.

Dialog State Tracking (DST), which infers the current Dialog State and user goals from all Dialog history information.

Dialog Policy Learning (DPL) selects the next appropriate action based on the current Dialog state.

The method comprises the steps of (KBQA) giving natural language questions, carrying out semantic understanding and analysis on the questions, and then utilizing a Knowledge Base to carry out inquiry and reasoning to obtain answers.

Text-To-Speech (TTS) is a technique for converting Text To human Speech.

bert denotes the pre-trained speech characterization model/pre-trained model, and jointbort denotes the intent slot combination model.

The embodiment of the man-machine conversation method provided by the embodiment can be executed in a terminal, a computer or a similar test platform. Taking the operation on the terminal as an example, fig. 1 is a hardware structure block diagram of the man-machine interaction method operation terminal according to the embodiment of the present invention. As shown in fig. 1, the terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally, a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used for storing computer programs, for example, software programs and modules of application software, such as computer programs corresponding to the man-machine interaction method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the above-mentioned method. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The embodiment provides a man-machine conversation method, and fig. 2 is a flowchart of the man-machine conversation method according to the embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

step S201, receiving the current turn of dialogue voice of the user, and preprocessing the dialogue voice to obtain text information, wherein the preprocessing comprises text conversion and text error correction.

In this embodiment, the dialog voices of the current turn include the dialog voices of the robot and the dialog voices of the user, and what the dialog system desires to complete in the dialog is to acquire a corresponding intention according to the dialog voices of the user so as to make an action corresponding to the intention, where the corresponding action includes a reply according to the dialog voices of the user.

In this embodiment, after the current turn of the dialog speech of the user is obtained, the dialog system performs ASR recognition on the dialog speech into a text, and performs text error correction; in the ASR recognition process, there will be recognition errors, which further cause a great difference between the generated text and the original semantics of the user, and further require error correction of the text recognized by ASR, for example: aiming at the inquiry of the robot, the answer of the user is 'buy and buy', but in the process of setting ASR, the user can recognize a 'good and good' text, and at the moment, the text information obtained by text error correction is changed into 'buy and good'.

Step S202, processing the text information through a preset semantic analysis model to obtain intention information, wherein the intention information at least comprises the intention corresponding to the current turn of the user.

In this embodiment, natural language understanding is performed on the text information after the text error correction, and a corresponding intention and slot are generated.

In specific embodiments, such as: natural speech understanding (NLU recognition) is carried out on the 'bought and bought' to obtain the intention of 'bought car', and certainly, the intention signal corresponding to the 'bought and bought' conversation speech does not comprise a slot position; for another example: when the inquiry is made to the robot, the user answers "XX 320, how much is you XX 320? ", for" XX320, how much money is your XX 320? "natural speech understanding (NLU recognition) is performed, and the intention of" asking for a car price "and" vehicle type: the slot of XX 320'.

In this embodiment, the generated intent is determined by intent recognition, the intent recognition including at least one of: regular matching, bert semantic matching, jointbert model; wherein the content of the first and second substances,

regular matching and bert semantic matching can achieve the effect of intention identification only by configuring a regular expression and a key sentence, and can be applied to the early cold start stage and the scene of newly added intention.

After the data are accumulated to a certain degree, multi-round intention recognition is carried out by using a pre-trained jointbert model, so that the intention recognition accuracy is improved.

In this embodiment, the generated slot is obtained by slot extraction, where the slot extraction at least includes one of the following: regular matching, a bert entity labeling model and a jointbert model; wherein the content of the first and second substances,

the regular matching and bert entity labeling model can be suitable for the early cold start stage and the scene of a newly added slot, wherein the training data of the bert entity labeling model can adopt an open-source universal data set.

And when the data are accumulated to a certain degree, using a pre-trained jointbert model to perform multi-round slot extraction and improve the accuracy of slot extraction.

Step S203, obtaining historical response information, and determining the dialog state of the current turn according to the historical response information and the intention information, wherein the historical response information comprises response information generated according to the dialog state of the dialog of the previous turn.

In the present embodiment, the confirmation of the dialog state of the current round includes two ways of configuring the dialog state by DST configuration and generating the dialog state by jointbert model, wherein,

DST configuration comprises intention mapping, AI tracking, label reasoning, secondary label generation and the like, and can be applied to scenes of early cold start and newly-added states.

And when the data are accumulated to a certain degree, using the pre-trained jointbert model to perform multi-round state updating and improve the accuracy of state updating.

In this embodiment, determining the dialog state of the current round is based on the robot expression (corresponding to the query before the current round) of the historical round (the dialog before the dialog of the current round) and the answer confirmation corresponding to the dialog voice of the user of the current round, for example: the machine expression for the historical round is: "ask you to have the intention of buying car recently", the robot expression has the semantic of asking about the intention of buying car, and the dialogue voice of the user in the current turn is: under the condition of buying, determining the conversation state of the current turn as 'purchased vehicle' based on the semantics of the intention of inquiring the purchased vehicle and the answer corresponding to the conversation voice of the user; another example is: the machine expression for the historical round is: "what car does you buy? "the expression of the robot has the semantics of inquiring the vehicle type, and the dialogue voice of the user in the current turn is: "XX 320, how much you XX 320? "in the case of the present invention, the dialog state of the current turn is determined to be" car purchased, car price asked, car purchased type: XX320 ″.

Step S204, configuring response information corresponding to the conversation state according to a preset response configuration model, and generating response voice corresponding to the response information, wherein the preset response configuration model at least comprises one of the following: a dialogue strategy learning model and a knowledge base question-and-answer model.

In the present embodiment, configuring the response information corresponding to the dialog state includes configuring through a Dialog Policy (DPL) model and configuring through a knowledge base question and answer (KBAQ) model, wherein,

the DPL configuration includes a global match, branch match, positive negative match, unhealthy, unmatched, and allopathic flow functions for dealing with most of the dialog content dominated by robots, such as: and configuring response information in a robot guiding type inquiry mode.

The KBQA configuration is to adopt NLP2SQL to carry out database query to generate a corresponding robot reply for the query content which is dominated by the user.

In this embodiment, after the response information is configured, the response information of the robot is generated into response voice by TTS technology and is transmitted back to the user terminal.

Through the steps S201 to S204, receiving the conversation voice of the current turn of the user, and preprocessing the conversation voice to obtain text information; processing the text information through a preset semantic analysis model to obtain intention information; acquiring historical response information, and determining the conversation state of the current turn according to the historical response information and the intention information; the method and the system have the advantages that the response information corresponding to the conversation state is configured according to the preset response configuration model, the response voice corresponding to the response information is generated, the problems that conversation systems in the related art are low in conversation efficiency and poor in conversation effect are solved, the AI robot outbound function of each scene is achieved quickly and effectively, labor cost is reduced, and conversation efficiency and conversation effect are improved.

It should be noted that, in the embodiment of the present application, the NLU configuration and the NLU model are combined, so as to improve the implementation efficiency and effect of the intent recognition and the slot extraction; by combining DST configuration and a DST model, the realization efficiency and effect of state updating are improved; DPL configuration and KBQA are combined, so that the realization efficiency and effect of the speech transfer are improved; under the condition that the dialogue data volume is small, both the NLU model and the DST model adopt a model based on bert pre-training, and under the condition that the dialogue data volume is accumulated to a preset data threshold value, a jointbert model is adopted, so that the problems of intention identification, slot position extraction and state updating are solved, and the accuracy is improved.

In some embodiments, the pre-processing of the conversational speech to obtain the textual information comprises the steps of:

step 1, performing text conversion processing on the dialogue voice through an automatic voice recognition technology to obtain a text to be processed.

In this embodiment, after the current turn of the dialog speech of the user is acquired, the dialog system performs ASR recognition on the spoken speech into a text, where the text recognized by the ASR is a text to be processed.

And 2, inputting the text to be processed into a text error correction model for text error correction to obtain text information, wherein the text error correction model is generated by training according to a first sample text of preset semantic information, a second sample text without text errors and a third sample text with text errors.

In this embodiment, in the ASR recognition process, there may be a recognition error, which may cause a great difference between the generated text and the original semantics of the user, and further require error correction on the text recognized by ASR.

Specifically, for the inquiry of the robot, the original semantic of the user is "buy", but in the ASR identification process, the user can recognize the text as "good", and at this time, the text information obtained through text error correction becomes "buy"; and for the text correction model, where the first sample text corresponds to "good", the second sample text corresponds to "good", and the third sample text corresponds to some erroneous text associated with "good".

Performing text conversion processing on the dialogue voice through an automatic voice recognition technology in the steps to obtain a text to be processed; the text to be processed is input into the text error correction model for text error correction to obtain text information, so that the text information of the user dialogue voice can be accurately obtained.

In some embodiments, the intention information further includes slot position information, and processing the text information through a preset semantic analysis model to obtain the intention information includes the following steps:

step 1, natural language understanding processing is carried out on the text information to obtain candidate intention data, wherein the candidate intention data comprises candidate intentions and candidate slot position information.

In this embodiment, natural language understanding is performed on the text information after text error correction, and corresponding candidate intentions and candidate slot position information are generated.

Step 2, detecting first intention data in the candidate intention data according to a preset intention recognition model, wherein the preset intention recognition model at least comprises one of the following items: the system comprises a regular matching model, a pre-training semantic matching model and an intention slot position joint model.

And 3, under the condition that the first intention data is detected, determining that the intention information comprises the first intention data, wherein the first intention data comprises the intention corresponding to the current turn of the user and the slot position information.

Performing natural language understanding processing on the text information in the steps to obtain candidate intention data; detecting first intention data in the candidate intention data according to a preset intention recognition model; under the condition that the first intention data is detected, the intention information is determined to comprise the first intention data, wherein the first intention data comprises the intention and the slot position information corresponding to the current turn of the user, intention identification and slot position extraction are achieved, and the achievement efficiency and effect of intention identification and slot position extraction are improved.

In some embodiments, determining the dialog state for the current turn based on the historical response information and the intent information includes the steps of:

step 1, inputting historical response information and intention information into a dialogue state tracking model, and obtaining a first characteristic value, wherein the first characteristic value comprises semantic characteristic values related to the historical response information and the intention information.

In this embodiment, the first feature value is determined according to the historical response information and the intention information of the current round, and at the same time, the first feature value includes a plurality of semantic feature values associated with the historical response information and the intention information.

And 2, detecting a preset state characteristic value in the first characteristic value, and determining the corresponding state of the current turn according to the preset state characteristic value.

In this embodiment, the preset state feature value is a target state feature value, and the target state feature value is strongly related to the state of the current round, for example: the target status feature value is "buy, how much money the AA320 has", then it can be determined that the status of the current round includes at least the status of the purchase, the ask price, the product model.

Inputting historical response information and intention information into a dialogue state tracking model in the steps to obtain a first characteristic value; the preset state characteristic value is detected in the first characteristic value, the corresponding state of the current turn is determined according to the preset state characteristic value, the conversation state of the current turn is determined according to the historical response information and the intention information, and the realization efficiency and the effect of state updating are improved through the conversation state tracking model.

In some embodiments, configuring the response information corresponding to the dialog state according to the preset response configuration model includes the following steps:

step 1, extracting first state semantic information of a conversation state, wherein the first state semantic information at least comprises state semantics corresponding to intention information.

In this embodiment, the first state semantic information is information for describing a current turn of dialog state, and the first state semantic information includes intention information and slot position information of the user.

Step 2, inputting the first state semantic information into a preset response configuration model to acquire response information, wherein the preset response configuration model comprises one of the following: a dialogue strategy learning model and a knowledge base question-and-answer model.

In the embodiment, the configuration of the response information corresponding to the dialog state is to input the semantic information of the first state as a data source, and configure the corresponding response information according to the input data source through a dialog strategy (DPL) model and/or a knowledge base question answering (KBAQ) model.

In this embodiment, a dialog strategy (DPL) model is adopted, the semantic information of the first state is used as a data source, and an action and a response suitable for a next dialog turn, that is, response information corresponding to a current dialog turn state, are selected.

In this embodiment, a knowledge base question answering (KBAQ) model is adopted, the semantic information of the first state is used as a data basis for performing knowledge base query and reasoning out response information, and the database query is adopted to obtain the response information corresponding to the current round of conversation state.

Extracting first state semantic information of the dialog state in the steps; and inputting the semantic information of the first state into a preset response configuration model, acquiring response information, and configuring response information corresponding to the current turn of conversation state.

In some embodiments, the preset response configuration model includes a conversation strategy learning model and a knowledge base question-and-answer model, and configuring the response information corresponding to the conversation state according to the preset response configuration model includes the following steps:

step 1, extracting second state semantic information of the dialog state, wherein the second state semantic information at least comprises state semantics corresponding to the intention information.

In this embodiment, the second state semantic information is information for describing a current turn of dialog state, and the first state semantic information includes intention information and slot position information of the user.

And 2, inputting the second state semantic information into a dialogue strategy learning model, and inquiring robot dialect information corresponding to the second state semantic information, wherein the dialogue strategy learning model is generated by training according to the first preset state semantic information and the robot dialect information corresponding to the first preset state semantic information.

In this embodiment, a dialogue strategy model is adopted, the semantic information of the second state is used as a data source, and the robot tactical information, that is, the response information corresponding to the current round of dialogue state, is selected.

And 3, under the condition that the robot dialect information corresponding to the second state semantic information is not inquired, inputting the second state semantic information into a knowledge base question-answer model, acquiring response text information corresponding to the second state semantic information, and determining that the response information comprises the response text information, wherein the knowledge base question-answer model comprises second preset state semantic information and response text information corresponding to the second preset state semantic information.

In this embodiment, when the dialog strategy learning (DPL) model is not located and configured to the corresponding dialect of the semantic information of the second state, a database query is performed according to a knowledge base question answering (KBAQ) model, and a response information corresponding to the dialog state of the current turn is obtained by using the database query.

Extracting second state semantic information of the dialog state in the steps; inputting the semantic information of the second state into a dialogue strategy learning model, and inquiring robot dialogue information corresponding to the semantic information of the second state; and under the condition that the robot dialect information corresponding to the second state semantic information is not inquired, inputting the second state semantic information into a knowledge base question-answer model, acquiring response text information corresponding to the second state semantic information, determining that the response text information comprises the response text information, realizing configuration of the response information corresponding to the current turn of dialog state, and combining a dialog strategy learning (DPL) model with a knowledge base question-answer (KBAQ) model to improve the realization efficiency and effect of the dialect circulation.

In some embodiments, configuring the response information corresponding to the dialog state according to the preset response configuration model includes the following steps: and determining that the response information includes the robot speech information corresponding to the second state semantic information when the robot speech information corresponding to the second state semantic information is queried.

In some embodiments, generating the response voice corresponding to the response information comprises the following steps: and carrying out voice conversion on the response information to generate the response voice.

Following human-machine conversation process analysis with the conversation of the specific embodiment

Examples of dialogs are as follows:

robotic surgery: "ask you an intention to have a car recently? "

User conversational speech: "good and good". "

Robotic surgery: "what car does you buy? "

User conversational speech: "AA 320, how much money is you AA 320? "

Robotic surgery: the official guide price of "AA 320 is about 38-40 ten thousand. "

The human-machine conversation process is analyzed as follows:

step 1, the robot asks "ask you for an intention to buy a car recently? "the dialogue voice replied by the user is recognized as 'good' and 'good' by ASR. "changed to" buy, buy by text error correction. ".

And 2, purchasing. "NLU recognition is performed to obtain the intention of" purchased car ".

And 3, automatically updating the conversation state of the current turn into 'purchased car' by the DST.

Step 4, according to the branch of the DPL, locating to the purchased car, configuring the response message (bots) automatically replied to? ".

Step 5, TTS will "what car did you buy? "to a voice reply to the user.

Step 6, the dialogue voice replied by the user is recognized by ASR and corrected with text to become' AA320, how much money is in your AA 320? ".

Step 7, for "AA 320, how much money is you AA 320? "NLU recognition is performed, and the intention of" asking for car price "and" vehicle type: the slot of AA 320'.

Step 8, the DST automatically updates the conversation state of the current turn to be' purchased car, inquiry car price and purchased car type: AA320 ".

And 9, firstly, the configuration of the DPL does not locate the configuration to the ' inquiry vehicle price ', then database inquiry is carried out according to the KBQA, and the ' official guide price of the AA320 corresponding to the current turn conversation state is about 38-40 ten thousand. "is used.

Step 10, the TTS will "the official guide price of AA320 is about 38-40 ten thousand. "to a voice reply to the user.

It should be noted that the steps illustrated in the above-described flow diagrams or in the flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment also provides a man-machine interaction device, which is used for implementing the above embodiments and preferred embodiments, and the description of the device is omitted. As used hereinafter, the terms "module," "unit," "subunit," and the like may implement a combination of software and/or hardware for a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram of a human-machine interaction device according to an embodiment of the present application, and as shown in fig. 3, the device includes:

the conversion module 31 is configured to receive a current turn of dialogue speech of a user, and preprocess the dialogue speech to obtain text information, where the preprocessing includes text conversion and text error correction;

the generating module 32 is coupled to the converting module 31 and configured to process the text information through a preset semantic analysis model to obtain intention information, where the intention information at least includes an intention corresponding to the current turn of the user;

the processing module 33 is coupled to the generating module 32, and is configured to acquire historical response information, and determine a dialog state of a current turn according to the historical response information and the intention information, where the historical response information includes response information generated according to a dialog state of a dialog of a previous turn;

the response module 34 is coupled to the processing module 33, and configured to configure response information corresponding to the dialog state according to a preset response configuration model, and generate a response voice corresponding to the response information, where the preset response configuration model at least includes one of: a dialogue strategy learning model and a knowledge base question-and-answer model.

In some embodiments, the conversion module 31 is configured to perform text conversion processing on the dialog speech through an automatic speech recognition technology to obtain a text to be processed; inputting the text to be processed into a text error correction model for text error correction to obtain text information, wherein the text error correction model is generated by training according to a first sample text of preset semantic information, a second sample text without text errors and a third sample text with text errors.

In some embodiments, the intention information further includes slot position information, and the generating module 32 is configured to perform natural language understanding processing on the text information to obtain candidate intention data, where the candidate intention data includes a candidate intention and candidate slot position information; detecting first intention data in the candidate intention data according to a preset intention recognition model, wherein the preset intention recognition model at least comprises one of the following items: the method comprises the following steps of (1) a regular matching model, a pre-training semantic matching model and an intention slot position joint model; in a case where the first intention data is detected, determining that the intention information includes the first intention data, wherein the first intention data includes an intention and slot position information corresponding to a current turn of the user.

In some embodiments, the processing module 33 is configured to input the historical response information and the intention information into the dialog state tracking model, and obtain a first feature value, where the first feature value includes a semantic feature value associated with the historical response information and the intention information; and detecting a preset state characteristic value in the first characteristic value, and determining the corresponding state of the current turn according to the preset state characteristic value.

In some embodiments, the response module 34 is configured to extract first state semantic information of the dialog state, where the first state semantic information includes at least a state semantic corresponding to the intention information; inputting the first state semantic information into a preset response configuration model to acquire response information, wherein the preset response configuration model comprises one of the following: a dialogue strategy learning model and a knowledge base question-and-answer model.

In some embodiments, the preset response configuration model includes a dialogue strategy learning model and a knowledge base question-and-answer model, and the response module 3 is configured to extract second state semantic information of the dialogue state, where the second state semantic information at least includes state semantics corresponding to the intention information; inputting the second state semantic information into a dialogue strategy learning model, and inquiring robot dialect information corresponding to the second state semantic information, wherein the dialogue strategy learning model is generated by training according to the first preset state semantic information and the robot dialect information corresponding to the first preset state semantic information; and under the condition that the robot dialect information corresponding to the second state semantic information is not inquired, inputting the second state semantic information into a knowledge base question-answer model, acquiring response text information corresponding to the second state semantic information, and determining that the response information comprises the response text information, wherein the knowledge base question-answer model comprises second preset state semantic information and response text information corresponding to the second preset state semantic information.

In some embodiments, the response module 3 is configured to determine that the response information includes the robot speech information corresponding to the second state semantic information when the robot speech information corresponding to the second state semantic information is queried.

In some embodiments, the response module 3 is configured to perform voice conversion on the response message to generate a response voice.

The above modules may be functional modules or program modules, and may be implemented by software or hardware. For a module implemented by hardware, the modules may be located in the same processor; or the modules can be respectively positioned in different processors in any combination.

The present embodiment also provides an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

and S1, receiving the current turn of dialogue voice of the user, and preprocessing the dialogue voice to obtain text information, wherein the preprocessing comprises text conversion and text error correction.

And S2, processing the text information through a preset semantic analysis model to obtain intention information, wherein the intention information at least comprises the intention corresponding to the current turn of the user.

And S3, acquiring historical response information, and determining the dialog state of the current turn according to the historical response information and the intention information, wherein the historical response information comprises response information generated according to the dialog state of the dialog of the previous turn.

S4, configuring response information corresponding to the dialogue state according to a preset response configuration model, and generating response voice corresponding to the response information, wherein the preset response configuration model at least comprises one of the following: a dialogue strategy learning model and a knowledge base question-and-answer model.

It should be noted that, for specific examples in this embodiment, reference may be made to examples described in the foregoing embodiments and optional implementations, and details of this embodiment are not described herein again.

In addition, in combination with the man-machine interaction method in the foregoing embodiments, the embodiments of the present application may provide a storage medium to implement. The storage medium having stored thereon a computer program; the computer program, when executed by a processor, implements any of the human-machine interaction methods of the above embodiments.

It should be understood by those skilled in the art that various features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the features in the above embodiments are not described, but should be considered as within the scope of the present disclosure as long as there is no contradiction between the combinations of the features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for human-computer interaction, comprising:

receiving conversation voice of a current turn of a user, and preprocessing the conversation voice to obtain text information, wherein the preprocessing comprises text conversion and text error correction;

processing the text information through a preset semantic analysis model to obtain intention information, wherein the intention information at least comprises an intention corresponding to the current turn of the user;

obtaining historical response information, and determining the dialog state of the current turn according to the historical response information and the intention information, wherein the historical response information comprises response information generated according to the dialog state of the dialog of the previous turn;

configuring the response information corresponding to the dialogue state according to a preset response configuration model, and generating response voice corresponding to the response information, wherein the preset response configuration model at least comprises one of the following: a dialogue strategy learning model and a knowledge base question-and-answer model.

2. The human-computer conversation method of claim 1, wherein preprocessing the conversation speech to obtain text information comprises:

3. The human-computer interaction method according to claim 1, wherein the intention information further includes slot position information, and the processing the text information through a preset semantic analysis model to obtain the intention information includes:

4. The human-computer conversation method of claim 1, wherein determining a conversation state of a current turn based on the historical response information and the intention information comprises:

5. The human-computer interaction method of claim 1, wherein configuring the response information corresponding to the interaction state according to a preset response configuration model comprises:

6. The human-computer conversation method according to claim 1, wherein the preset response configuration model comprises a conversation strategy learning model and a knowledge base question-and-answer model, and configuring the response information corresponding to the conversation state according to the preset response configuration model comprises:

7. The human-computer interaction method according to claim 6, wherein in a case where the robot speech information corresponding to the second state semantic information is queried, it is determined that the response information includes the robot speech information corresponding to the second state semantic information.

8. The human-computer conversation method according to claim 1, wherein generating a response voice corresponding to the response information comprises: and carrying out voice conversion on the response information to generate the response voice.

9. A human-computer interaction device, comprising:

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the human-machine interaction method of any one of claims 1 to 7.

11. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the human-computer interaction method of any one of claims 1 to 10 when executed.