CN111309888A

CN111309888A - Man-machine conversation method, device, electronic equipment and storage medium

Info

Publication number: CN111309888A
Application number: CN202010116245.6A
Authority: CN
Inventors: 刘占一; 王海峰; 吴华; 赵亮; 徐新超; 刘智彬; 郭振
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-19
Anticipated expiration: 2040-02-25
Also published as: CN111309888B

Abstract

The application discloses a man-machine conversation method, a man-machine conversation device, electronic equipment and a storage medium, and relates to the field of artificial intelligence, wherein the method comprises the following steps: in the conversation process, when a reply needs to be generated aiming at user input, if the user input is determined to contain semantic content, N replies are respectively determined, wherein N is a positive integer greater than one, and each reply corresponds to different feedback intentions; and splicing the N replies according to a preset sequence, and feeding back a splicing result as a generated reply to the user. By applying the scheme, the reply quality and the like can be improved.

Description

Man-machine conversation method, device, electronic equipment and storage medium

Technical Field

The present application relates to computer application technologies, and in particular, to a human-machine interaction method and apparatus, an electronic device, and a storage medium in the field of artificial intelligence.

Background

In the process of the man-machine conversation, a reply aiming at the input of the user is mainly generated according to the designated system intention and the semantic content, for example, the semantic content is filled in a corresponding slot position conversation technology (a reply template with slot positions), and the result of filling the slot is used as the generated reply. But the reply content generated in this way is single and hard, and the reply quality is not high.

Disclosure of Invention

In view of the above, the present application provides a man-machine interaction method, an apparatus, an electronic device and a storage medium.

A human-machine dialog method, comprising:

in the conversation process, when a reply needs to be generated aiming at user input, if the user input is determined to contain semantic content, N replies are respectively determined, wherein N is a positive integer greater than one, and each reply corresponds to different feedback intentions;

and splicing the N replies according to a preset sequence, and feeding back a splicing result as a generated reply to a user.

According to a preferred embodiment of the present application, the N replies include: a first reply, a second reply and a third reply; wherein the feedback intent corresponding to the first reply comprises: responding to a user intent; the feedback intent corresponding to the second reply comprises: explaining a first content to be expressed next; the feedback intent corresponding to the third reply comprises: expressing the first content;

the splicing the N replies according to a preset sequence comprises the following steps: and splicing the 3 replies according to the sequence of the first reply, the second reply and the third reply.

According to a preferred embodiment of the present application, the N replies further include: fourth replying; the feedback intent corresponding to the fourth reply comprises: guiding second content to be expressed in a next round;

the splicing the N replies according to a preset sequence comprises the following steps: and splicing the 4 replies according to the sequence of the first reply, the second reply, the third reply and the fourth reply.

According to a preferred embodiment of the present application, the determining that the semantic content is included in the user input includes: and semantic extraction is carried out on the user input, if the extraction result is not null, the user input is determined to contain semantic content, otherwise, the user input is determined not to contain the semantic content.

According to a preferred embodiment of the present application, determining the first reply includes: determining a system intention corresponding to the user input, acquiring a slot position phonetics corresponding to the system intention, adding the extracted semantic content into a slot position in the slot position phonetics to obtain candidate replies, respectively determining similarity scores between each candidate reply and the user input, and taking the candidate reply with the highest score as the first reply.

According to a preferred embodiment of the present application, the method further comprises: if the fact that the semantic content is not contained in the user input is determined, a system intention corresponding to the user input is determined, high-frequency replies corresponding to the system intention are used as candidate replies, similarity scores between the candidate replies and the user input are respectively determined, and the candidate reply with the highest score is used as a generated reply and fed back to the user.

According to a preferred embodiment of the present application, determining the second reply includes: performing semantic extraction on the first content, determining a system intention corresponding to the extracted semantic content, acquiring a slot position phonetics corresponding to the system intention, adding the extracted semantic content into a slot position in the slot position phonetics to obtain candidate replies, respectively determining similarity scores between each candidate reply and the first content, and taking the candidate reply with the highest score as the second reply.

According to a preferred embodiment of the present application, determining the third reply includes: if the first content is text type data, converting the first content into a spoken language expression form, and taking a conversion result as the third reply; and if the first content is structural data, generating the third reply according to the structural data.

According to a preferred embodiment of the present application, determining the fourth reply includes: and performing semantic extraction on the second content, determining a system intention corresponding to the extracted semantic content, inputting the extracted semantic content, the second content and the system intention into a generation model obtained by pre-training, and obtaining the output fourth reply.

A human-machine interaction device, comprising: a first recovery unit and a second recovery unit;

the first reply unit is used for respectively determining N replies if the semantic content is contained in the user input in the process of conversation and when the replies need to be generated aiming at the user input, wherein N is a positive integer greater than one, and each reply corresponds to different feedback intentions;

and the second reply unit is used for splicing the N replies according to a preset sequence and feeding back a splicing result as a generated reply to the user.

and the second reply unit splices the 3 replies according to the sequence of the first reply, the second reply and the third reply.

and the second reply unit splices the 4 replies according to the sequence of the first reply, the second reply, the third reply and the fourth reply.

According to a preferred embodiment of the present application, the first recovery unit performs semantic extraction on the user input, and determines that the user input includes semantic content if an extraction result is not null, or determines that the user input does not include semantic content if the extraction result is null.

According to a preferred embodiment of the present application, the first reply unit determines a system intention corresponding to the user input, obtains a slot dialect corresponding to the system intention, adds the extracted semantic content to a slot in the slot dialect to obtain candidate replies, determines similarity scores between each candidate reply and the user input, and takes the candidate reply with the highest score as the first reply.

According to a preferred embodiment of the present application, the first reply unit is further configured to, if it is determined that the user input does not include semantic content, determine a system intent corresponding to the user input, use a high-frequency reply corresponding to the system intent as a candidate reply, respectively determine a similarity score between each candidate reply and the user input, and feed back a candidate reply with a highest score as a generated reply to the user.

According to a preferred embodiment of the present application, the first reply unit performs semantic extraction on the first content, determines a system intention corresponding to the extracted semantic content, obtains a slot dialect corresponding to the system intention, adds the extracted semantic content to a slot in the slot dialect to obtain candidate replies, determines similarity scores between each candidate reply and the first content, and takes the candidate reply with the highest score as the second reply.

According to a preferred embodiment of the present application, the first reply unit converts the first content into a spoken language expression form when the first content is text-type data, takes a conversion result as the third reply, and generates the third reply according to the structure-type data when the first content is structure-type data.

According to a preferred embodiment of the present application, the first reply unit performs semantic extraction on the second content, determines a system intention corresponding to the extracted semantic content, inputs the extracted semantic content, the second content, and the system intention into a generation model obtained by pre-training, and obtains the output fourth reply.

An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

One embodiment in the above application has the following advantages or benefits: the method has the advantages that multiple replies can be determined according to input of a user, each reply corresponds to different feedback intentions, the multiple replies can be spliced according to a preset sequence and then fed back to the user as a generated reply, so that the generated reply contains multiple feedback intentions of a machine, organic combination of the multiple feedback intentions is realized, reply contents are enriched, expression is smoother, and reply quality is improved; different replies can respectively correspond to feedback intentions such as responding to the intention of a user, explaining first contents to be expressed next, expressing the first contents and guiding second contents to be expressed in the next round, so that the generated reply simultaneously comprises various functions such as pre-starting, post-starting, knowledge expression, excitation guiding and the like, and the reply quality is further improved; each reply can be determined by respectively combining the extracted semantic content, the system intention and the like, so that the accuracy of each determined reply is ensured, the reply quality is further improved, and the like; other effects of the above-described alternative will be described below with reference to specific embodiments.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of an embodiment of a human-machine dialog method described herein;

FIG. 2 is a schematic diagram of the relationship between different feedback intents described herein;

FIG. 3 is a schematic diagram of a processing method for user input with semantic content and without semantic content according to the present application;

FIG. 4 is a schematic diagram illustrating a process for generating a second reply according to the present application;

FIG. 5 is a schematic diagram illustrating a process for generating a fourth reply according to the present application;

FIG. 6 is a schematic diagram illustrating a structure of a human-computer interaction device 600 according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In addition, it should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 is a flowchart of an embodiment of a human-machine conversation method according to the present application. As shown in fig. 1, the following detailed implementation is included.

In 101, in a dialog process, when a reply needs to be generated for a user input, if it is determined that the user input includes semantic content, N replies are respectively determined, where N is a positive integer greater than one, and each reply corresponds to a different feedback intention.

At 102, the N replies are spliced according to a predetermined sequence, and the splicing result is fed back to the user as the generated reply.

The specific value of N can be determined according to actual needs. Preferably, N can take the value 3 or 4.

When the value of N is 3, the 3 replies may include: a first reply, a second reply and a third reply; wherein the feedback intention corresponding to the first reply may include: responding to a user intent; the feedback intent corresponding to the second reply may include: explaining a first content to be expressed next; the feedback intent corresponding to the third reply may include: the first content is expressed. Accordingly, when the 3 replies are spliced in a predetermined order, the 3 replies may be spliced in the order of the first reply, the second reply, and the third reply.

When the value of N is 4, a fourth reply may be further included, and the feedback intention corresponding to the fourth reply may include: and guiding the second content to be expressed in the next round. Accordingly, the 4 replies may be concatenated in the order of the first reply, the second reply, the third reply, and the fourth reply.

Fig. 2 is a schematic diagram of the relationship between different feedback intents described in the present application. As shown in fig. 2, the feedback intention "responding to the user intention" functions as "carry forward", the feedback intention "explaining the first content to be expressed next" functions as "carry forward", the feedback intention "expressing the first content" functions as "knowledge expression", and the feedback intention "guiding the second content to be expressed next" functions as "bring up guidance". The first reply, the second reply, the third reply and the fourth reply can be spliced in sequence to obtain the final generated reply.

The following describes specific generation methods of the first reply, the second reply, the third reply and the fourth reply, respectively.

1) First recovery

As previously described, the first reply corresponds to a "pre-stage" function, making feedback on the user's intent, in response to the user's intent, so that the user feels the machine understands his intent.

The semantic content can be extracted from the acquired user input firstly, if the extraction result is not null, the semantic content in the user input can be determined, otherwise, the semantic content in the user input can be determined.

If the fact that the semantic content is not contained in the user input (no semantic content) is determined, the system intention corresponding to the user input can be determined, the high-frequency responses corresponding to the system intention are used as candidate responses, the similarity scores between the candidate responses and the user input are respectively determined, and the candidate response with the highest score is used as the generated response to be fed back to the user.

If the fact that the user input contains semantic content (has the semantic content) is determined, the system intention corresponding to the user input can be determined, slot dialects corresponding to the system intention are obtained, the extracted semantic content is added to slots in the slot dialects to obtain candidate replies, similarity scores between each candidate reply and the user input are respectively determined, and the candidate reply with the highest score is used as a first reply.

Fig. 3 is a schematic diagram of a processing manner of user input for semantic-free content and semantic content according to the present application.

As shown in fig. 3, assuming that the user input is "hi, hello", semantic extraction is performed on the user input, and then no semantic content is determined, for example, important semantic content may be extracted from "hi, hello" by using an entity recognition model, and a specific entity recognition algorithm may adopt a Long Short-term memory network (LSTM), Long Short-term memory + Conditional Random Field (CRF) algorithm. Then, the system intention corresponding to "hi, you so" may be determined, for example, the system intention corresponding to "hi, you so" may be determined by performing intention classification on "hi, you so" based on an intention classification model obtained through training in advance or a predefined intention classification rule, and the like, assuming "call calling". Based on a large amount of dialogue data, high-frequency replies corresponding to different system intentions can be respectively collected and sorted in advance, and accordingly, the high-frequency reply corresponding to the system intention of 'call calling' can be used as a candidate reply, and the system intention comprises 'hello', 'hi' and 'hello'. Then, similarity scores between "hello", "hi", and "hello" and "hi, hello" may be determined, for example, a score model obtained through pre-training may be used to determine the similarity score, the score model may be a Bag of Words (BOW, Bag of Words) model, and the like, that is, the candidate reply and the user input are used as model inputs, and the similarity score between the two is output. The candidate reply with the highest score can be fed back to the user as the generated reply, and if the similarity scores between "hello", "hi", and "hello o" and "hi, hello" are respectively 0.6, 0.55 and 0.65, the candidate reply "hello" corresponding to 0.65 can be fed back to the user as the generated reply.

As shown in fig. 3, it is assumed that the user inputs "hi, do you like to watch a movie", and the semantic content "movie" is extracted by performing semantic extraction on it. Thereafter, it can be determined what the corresponding system intent, say "query," is you like to watch the movie. Based on a large amount of dialogue data, slot dialogues corresponding to different system intents can be collected and sorted in advance, for example, querying the corresponding slot dialogues may include: "I like Tag very much", "I enjoy Tag very much", "Tag is my favorite", wherein "Tag" means slot. A "movie" may be added to each slot in the slot dialect, resulting in candidate replies as follows: "i like the movie very much", "i enjoy the movie very much", "the movie is my favorite". Then, the similarity scores between "i like the movie very much", "i enjoy the movie very much", "the movie is my favorite" and "hi, do you like watching the movie" can be determined, respectively. The candidate reply with the highest score may be used as the first reply, and assuming that the similarity scores between "i like the movie very much", "i enjoy the movie very much", "the movie is my favorite" and "hi, do you like the movie" are 0.6, 0.55, and 0.33, respectively, then the candidate reply "i like the movie very much" corresponding to 0.6 may be used as the first reply.

2) Second reply

The second reply corresponds to a "post-launch" function, explaining what the machine will express next, making it easier for the user to understand the goals of the machine. For convenience of description, the content to be expressed next is referred to as first content.

How to determine what to express next (i.e., what to express next) is not limited, and may be determined according to a preset rule.

The method comprises the steps of performing semantic extraction on first content, determining a system intention corresponding to the extracted semantic content, obtaining a slot position phony corresponding to the system intention, adding the extracted semantic content into a slot position in the slot position phony to obtain candidate replies, determining similarity scores between each candidate reply and the first content respectively, and taking the candidate reply with the highest score as a second reply.

Fig. 4 is a schematic diagram of a process for generating a second reply according to the present application. As shown in fig. 4, it is assumed that the first content is "royal jelly strength" which is a simple character of the natural color in no future and becomes classic ", and the semantic content" royal jelly strength "can be extracted. Then, the system intention corresponding to "Wangbaoqiang" can be determined, and the system intention is assumed to be "key Tag". Based on a large amount of dialogue data, slot dialogs corresponding to different system intents can be collected and sorted in advance, for example, the slot dialogs corresponding to the "key Tag" may include: "how do we chat with a Tag bar", "how do we chat with a Tag", "do you like a Tag, how do we chat with a Tag bar", "how do we chat with a Tag recently, and" how do we chat with a Tag ", where" Tag "represents a slot. "wangbaoqiang" may be added to the slot in each slot dialect respectively, so as to obtain each candidate reply as follows: "how do we chat with the King Bao of King Bao", "how do we chat with the King Bao of Rong", "do you like the King Bao of Rong Bao, we chat with the King Bao of Rong of very recent special fire, how do we chat with the King Bao of Rong Bao". And then, the similarity score between each candidate reply and the fact that one character of the Wangbaoqiang which is originally played in the thieves is classic in nature can be determined respectively. The candidate reply with the highest score can be used as the first reply, and the similarity scores between each candidate reply and the 'King Bao strength is in classic' in the world without the principle that one character of the King Bao performs in the thief are respectively 0.45, 0.66, 0.55 and 0.5, so that the candidate reply 'how strong the King Bao is chatted by us' corresponding to 0.66 can be used as the second reply generated.

3) Third reply

The third reply corresponds to the function of "knowledge representation" obtaining the first content currently required to interact with the user.

The first content is typically text-type data or structured data. If the first content is text type data, the first content can be converted into a spoken language expression form, and the conversion result is used as a third reply. If the first content is structural data, a third reply can be generated according to the structural data.

How to convert textual data into a spoken form of expression is state of the art. For example, the text-type data is: the indifferent way is mastered by Liu De Hua and Liangchao, tells that two men with disordered identities are the police and the bedridden part of the black society respectively, and through a fierce corner fighter, the men want to retrieve the story, and the two bedridden parts are decorated by two chief angles respectively; after conversion may be: the fairway is a movie mastered by Liu De Hua and Liang Wei, two people play the scenes of the police and the black society in the movie respectively, and the movie searches the scenes of the two people with disordered identities after going through a fierce corner fight.

For the structural data, a sentence expressing smoothness can be generated based on the structural data according to a model obtained by training in advance or a preset rule.

4) Fourth reply

The fourth reply corresponds to a function of 'stimulating guidance', guides the content to be expressed in the next round, stimulates the interactive interest of the user, and lays a cushion on the content for the user, thereby better realizing the immersive interaction. For convenience of description, the content to be expressed in the next round is referred to as second content.

How to determine what to express in the next round is not limited, and may be determined according to a preset rule.

And performing semantic extraction on the second content, determining a system intention corresponding to the extracted semantic content, inputting the extracted semantic content, the second content and the system intention into a generation model obtained by pre-training, and obtaining an output fourth reply.

Fig. 5 is a schematic diagram of a process of generating a fourth reply according to the present application. As shown in fig. 5, it is assumed that the second content is "zhuangzhou self-exploded fan of director, and a profound movie restores history, which is called military teaching film and documentary, and the semantic content" zhuangzhou "can be extracted. Thereafter, the system intent corresponding to "yearnish" can be determined, assuming "Tag-centric guidance". The method can input the 'Zhangshaozhong' and 'Zhangshaozhong self-explosion' into the generation model, so as to obtain the fourth reply of the output, and the hypothesis is that 'you can hear the Zhangzhong loyalty'. The generated model may be an Enhanced knowledge semantic Representation (ERNIE) model obtained by Fine-Tuning a question-answer pair extracted from a large amount of dialogue data, and the final output result of the excited guidance is obtained by using the Representation capability of the ERNIE.

When generating the second reply, the third reply and the fourth reply, when semantic extraction, system intention determination, similarity score determination and the like are required, the specific manner can refer to the relevant description in the first reply.

The generated first reply, second reply, third reply and fourth reply can be spliced in sequence, and the splicing result 'first reply + second reply + third reply + fourth reply' is used as the generated reply and is fed back to the user.

It should be noted that the foregoing method embodiments are described as a series of acts or combinations for simplicity in explanation, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.

In a word, by adopting the scheme of the embodiment of the method, a plurality of replies can be respectively determined according to the input of the user, each reply respectively corresponds to different feedback intentions, and the plurality of replies can be spliced according to a preset sequence to serve as the generated reply to be fed back to the user, so that the generated reply contains various feedback intentions of the machine, the organic combination of the various feedback intentions is realized, the reply content is enriched, the expression is more smooth, the reply quality is improved, in addition, the user can better feel that the machine understands the user, the user can better understand the reply generated by the machine, and the immersive human-computer interaction is realized in multiple rounds of conversation; different replies can respectively correspond to feedback intentions such as responding to the intention of a user, explaining first contents to be expressed next, expressing the first contents and guiding second contents to be expressed in the next round, so that the generated reply simultaneously comprises various functions such as pre-starting, post-starting, knowledge expression, excitation guiding and the like, and the reply quality is further improved; and each reply can be determined by respectively combining the extracted semantic content, the system intention and the like, so that the accuracy of each determined reply is ensured, and the reply quality and the like are further improved.

The above is a description of method embodiments, and the embodiments of the present application are further described below by way of apparatus embodiments.

Fig. 6 is a schematic structural diagram of a human-computer interaction device 600 according to an embodiment of the present disclosure. As shown in fig. 6, includes: a first recovery unit 601 and a second recovery unit 602.

A first reply unit 601, configured to, in a dialog process, when a reply needs to be generated for a user input, if it is determined that the user input includes semantic content, determine N replies respectively, where N is a positive integer greater than one, and each reply corresponds to a different feedback intention.

And a second replying unit 602, configured to splice the N replies according to a predetermined order, and feed back a splicing result as a generated reply to the user.

The N replies may include: a first reply, a second reply, and a third reply. Wherein the feedback intention corresponding to the first reply may include: responding to a user intent; the feedback intent corresponding to the second reply may include: explaining a first content to be expressed next; the feedback intent corresponding to the third reply may include: the first content is expressed. Accordingly, the second replying unit 602 may splice the 3 replies in the order of the first reply, the second reply, and the third reply.

The N replies may further include: and fourthly, replying. The feedback intent corresponding to the fourth reply may include: and guiding the second content to be expressed in the next round. Accordingly, the second replying unit 602 may splice the 4 replies in the order of the first reply, the second reply, the third reply, and the fourth reply.

In addition, the first recovery unit 601 may perform semantic extraction on the user input, and if the extraction result is not null, it may be determined that the user input includes semantic content, otherwise, it may be determined that the user input does not include semantic content.

If it is determined that the user input includes semantic content, the first reply unit 601 may determine a system intent corresponding to the user input, obtain a slot dialect corresponding to the system intent, add the extracted semantic content to slots in the slot dialect to obtain candidate replies, determine similarity scores between each candidate reply and the user input, and use the candidate reply with the highest score as the first reply.

If it is determined that the user input does not include semantic content, the first reply unit 601 may determine a system intent corresponding to the user input, use high-frequency replies corresponding to the system intent as candidate replies, respectively determine similarity scores between each candidate reply and the user input, and feed back the candidate reply with the highest score as a generated reply to the user.

The first reply unit 601 may further perform semantic extraction on the first content, determine a system intention corresponding to the extracted semantic content, obtain a slot dialect corresponding to the system intention, add the extracted semantic content to a slot in the slot dialect to obtain candidate replies, respectively determine a similarity score between each candidate reply and the first content, and use the candidate reply with the highest score as a second reply.

The first reply unit 601 may further convert the first content into a spoken language expression form when the first content is text-type data, and use the conversion result as a third reply, and generate a third reply with smooth expression according to the structure-type data when the first content is structure-type data.

The first reply unit 601 may further perform semantic extraction on the second content, determine a system intention corresponding to the extracted semantic content, input the extracted semantic content, the second content, and the system intention into a generation model obtained by pre-training, and obtain an output fourth reply.

For a specific work flow of the apparatus embodiment shown in fig. 6, reference is made to the related description in the foregoing method embodiment, and details are not repeated.

In a word, by adopting the scheme of the embodiment of the device, a plurality of replies can be respectively determined according to the input of the user, each reply respectively corresponds to different feedback intentions, and the plurality of replies can be spliced according to the preset sequence to serve as the generated reply to be fed back to the user, so that the generated reply contains various feedback intentions of the machine, the organic combination of the various feedback intentions is realized, the reply content is enriched, the expression is more smooth, the reply quality is improved, in addition, the user can better feel that the machine understands the user, the user can better understand the reply generated by the machine, and the immersive human-computer interaction is realized in multiple rounds of conversation; different replies can respectively correspond to feedback intentions such as responding to the intention of a user, explaining first contents to be expressed next, expressing the first contents and guiding second contents to be expressed in the next round, so that the generated reply simultaneously comprises various functions such as pre-starting, post-starting, knowledge expression, excitation guiding and the like, and the reply quality is further improved; and each reply can be determined by respectively combining the extracted semantic content, the system intention and the like, so that the accuracy of each determined reply is ensured, and the reply quality and the like are further improved.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 7 is a block diagram of an electronic device according to the method of the embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 7, the electronic apparatus includes: one or more processors Y01, a memory Y02, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information for a graphical user interface on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 7, one processor Y01 is taken as an example.

Memory Y02 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein.

Memory Y02 is provided as a non-transitory computer readable storage medium that can be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the methods of the embodiments of the present application. The processor Y01 executes various functional applications of the server and data processing, i.e., implements the method in the above-described method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory Y02.

The memory Y02 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Additionally, the memory Y02 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory Y02 may optionally include memory located remotely from processor Y01, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, blockchain networks, local area networks, mobile communication networks, and combinations thereof.

The electronic device may further include: an input device Y03 and an output device Y04. The processor Y01, the memory Y02, the input device Y03, and the output device Y04 may be connected by a bus or other means, and the bus connection is exemplified in fig. 7.

The input device Y03 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer, one or more mouse buttons, track ball, joystick, or other input device. The output device Y04 may include a display device, an auxiliary lighting device, a tactile feedback device (e.g., a vibration motor), and the like. The display device may include, but is not limited to, a liquid crystal display, a light emitting diode display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific integrated circuits, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube or a liquid crystal display monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for human-computer interaction, comprising:

2. The method of claim 1,

the N replies include: a first reply, a second reply and a third reply; wherein the feedback intent corresponding to the first reply comprises: responding to a user intent; the feedback intent corresponding to the second reply comprises: explaining a first content to be expressed next; the feedback intent corresponding to the third reply comprises: expressing the first content;

3. The method of claim 2,

the N replies further include: fourth replying; the feedback intent corresponding to the fourth reply comprises: guiding second content to be expressed in a next round;

4. The method of claim 2,

the determining that semantic content is included in the user input comprises: and semantic extraction is carried out on the user input, if the extraction result is not null, the user input is determined to contain semantic content, otherwise, the user input is determined not to contain the semantic content.

5. The method of claim 4,

determining the first reply comprises: determining a system intention corresponding to the user input, acquiring a slot position phonetics corresponding to the system intention, adding the extracted semantic content into a slot position in the slot position phonetics to obtain candidate replies, respectively determining similarity scores between each candidate reply and the user input, and taking the candidate reply with the highest score as the first reply.

6. The method of claim 4,

the method further comprises the following steps: if the fact that the semantic content is not contained in the user input is determined, a system intention corresponding to the user input is determined, high-frequency replies corresponding to the system intention are used as candidate replies, similarity scores between the candidate replies and the user input are respectively determined, and the candidate reply with the highest score is used as a generated reply and fed back to the user.

7. The method of claim 2,

determining the second reply comprises: performing semantic extraction on the first content, determining a system intention corresponding to the extracted semantic content, acquiring a slot position phonetics corresponding to the system intention, adding the extracted semantic content into a slot position in the slot position phonetics to obtain candidate replies, respectively determining similarity scores between each candidate reply and the first content, and taking the candidate reply with the highest score as the second reply.

8. The method of claim 2,

determining that the third reply comprises: if the first content is text type data, converting the first content into a spoken language expression form, and taking a conversion result as the third reply; and if the first content is structural data, generating the third reply according to the structural data.

9. The method of claim 3,

determining that the fourth reply comprises: and performing semantic extraction on the second content, determining a system intention corresponding to the extracted semantic content, inputting the extracted semantic content, the second content and the system intention into a generation model obtained by pre-training, and obtaining the output fourth reply.

10. A human-computer interaction device, comprising: a first recovery unit and a second recovery unit;

11. The apparatus of claim 10,

12. The apparatus of claim 11,

13. The apparatus of claim 11,

and the first recovery unit performs semantic extraction on the user input, if the extraction result is not null, the user input is determined to contain semantic content, otherwise, the user input is determined not to contain the semantic content.

14. The apparatus of claim 13,

the first reply unit determines a system intention corresponding to the user input, acquires a slot position phonetics corresponding to the system intention, adds the extracted semantic content to a slot position in the slot position phonetics to obtain candidate replies, determines similarity scores between each candidate reply and the user input respectively, and takes the candidate reply with the highest score as the first reply.

15. The apparatus of claim 13,

the first reply unit is further configured to, if it is determined that the user input does not include semantic content, determine a system intent corresponding to the user input, use a high-frequency reply corresponding to the system intent as a candidate reply, respectively determine a similarity score between each candidate reply and the user input, and feed back a candidate reply with a highest score as a generated reply to the user.

16. The apparatus of claim 11,

the first reply unit performs semantic extraction on the first content, determines a system intention corresponding to the extracted semantic content, acquires a slot position speech technology corresponding to the system intention, adds the extracted semantic content to a slot position in the slot position speech technology to obtain candidate replies, respectively determines similarity scores between each candidate reply and the first content, and takes the candidate reply with the highest score as the second reply.

17. The apparatus of claim 11,

the first reply unit converts the first content into a spoken language expression form when the first content is text-type data, takes a conversion result as the third reply, and generates the third reply according to the structural data when the first content is structural data.

18. The apparatus of claim 12,

and the first reply unit performs semantic extraction on the second content, determines a system intention corresponding to the extracted semantic content, inputs the extracted semantic content, the second content and the system intention into a generation model obtained by pre-training, and obtains the output fourth reply.

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.