CN113282725A

CN113282725A - Dialogue interaction method and device, electronic equipment and storage medium

Info

Publication number: CN113282725A
Application number: CN202110560009.8A
Authority: CN
Inventors: 冷月; 刘磊
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-08-20

Abstract

The disclosure provides a method, a device, an electronic device and a storage medium for dialogue interaction, wherein the method comprises the following steps: acquiring voice data input by a target user; determining a target interaction scene matched with the voice data based on the voice data; determining at least one question type based on the voice data and the target interaction scenario; and determining and feeding back interactive reply information aiming at the voice data according to the determined problem type.

Description

Dialogue interaction method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for dialog interaction, an electronic device, and a storage medium.

Background

With the development of artificial intelligence, the use of human-computer conversation is more and more, so that the human-computer conversation is more and more popular in scenes such as life and work of people. The man-machine interaction is a technology for a machine to understand and use natural language to realize man-machine communication. For example, through man-machine interaction, the user can inquire weather information, can chat with the machine, or can acquire a specific service, such as acquiring a movie ticket reservation service.

It is therefore particularly important to propose a method of conversational interaction.

Disclosure of Invention

In view of the above, the present disclosure at least provides a method, an apparatus, an electronic device and a storage medium for dialog interaction.

In a first aspect, the present disclosure provides a method of dialog interaction, including:

acquiring voice data input by a target user;

determining a target interaction scene matched with the voice data based on the voice data;

determining at least one question type based on the voice data and the target interaction scenario;

and determining and feeding back interactive reply information aiming at the voice data according to the determined problem type.

In the method, the determined interactive reply information is different according to different problem types aiming at the received voice data, so that the flexibility of conversation interaction is improved. Meanwhile, a target interaction scene is determined for the voice data, and the interaction reply information of the voice data is determined under the target interaction scene, so that the determined interaction reply information can be matched with the current interaction scene, and the matching degree of the determined interaction reply information and the voice data is improved.

In one possible embodiment, the determining at least one question type based on the voice data and the target interaction scenario includes:

determining at least one to-be-detected problem type matched with the target interaction scene from a plurality of preset candidate problem types;

and identifying the voice data based on the at least one problem type to be detected, and determining the at least one problem type.

Here, at least one to-be-detected question type matched with the target interaction scenario may be determined from the set multiple candidate question types, and different target interaction scenarios may correspond to different to-be-detected question types, for example, the to-be-detected question type corresponding to the question and answer scenario may include a sensitive word question type, a service question type, a chat question type, and the like, and the to-be-detected question type corresponding to the navigation interaction scenario may include a sensitive word question type, a service question type, and the like. And voice data can be identified based on at least one problem type to be detected, at least one problem type is determined, and flexibility of determining the problem type is improved.

In a possible implementation manner, the determining and feeding back the interactive reply information for the voice data according to the determined question type includes:

determining and feeding back interactive reply information aiming at the voice data from the obtained multiple service question and answer information under the condition that the question type comprises a question type within a dialogue range;

in the case that the question type includes a question type outside a dialogue scope, determining and feeding back interactive reply information for the voice data includes at least one of the following information: the voice data display device comprises a first warning message used for indicating that the voice data is rejected, a second warning message used for indicating that the conversation interaction is finished, and a display interface used for indicating that a target user inputs message information.

The problem types are divided into the problem types within the conversation range and the problem types outside the conversation range, different interactive reply messages are fed back according to different problem types, and the flexibility of conversation interaction is improved. Meanwhile, the problem types outside the conversation range are set, so that the voice data input by the user can be ensured to have corresponding interactive reply information, the situation that the interactive reply information of the voice data cannot be determined is avoided, and the conversation interaction efficiency is improved.

In a possible implementation manner, in a case that the question type within the dialog range includes a sensitive word question type, the determining and feeding back the interactive reply information for the voice data from the obtained multiple service question-answer information includes:

and determining and feeding back interactive reply information aiming at the voice data as third warning information for indicating that sensitive words are contained from the plurality of service question-answer information.

In a possible implementation manner, in a case that the question types within the dialog range include a service question type, the determining and feeding back the interactive reply information for the voice data from the obtained multiple service question-answer information includes:

extracting keywords from the voice data, and determining target keywords contained in the voice data; determining and feeding back interactive reply information aiming at the voice data from the plurality of service question-answer information based on the target keywords and the keywords corresponding to each service question-answer information; alternatively, the first and second electrodes may be,

determining the similarity between the voice data and the obtained question and answer information of each service; and determining and feeding back interactive reply information aiming at the voice data from the plurality of service question-answering information based on the similarity.

In a possible implementation manner, in a case that the question type within the dialog range includes a chat question type, the determining and feeding back the interactive reply information for the voice data from the obtained multiple service question-answer information includes:

determining a chatting type to which the voice data belongs;

and calling an interactive interface corresponding to the chatting type, and determining and feeding back interactive reply information aiming at the voice data from a database corresponding to the chatting type.

In the above manner, multiple problem types are set for the problem types within the conversation range, and different problem types correspond to different conversation strategies, so that the diversity and flexibility of conversation interaction are improved.

In a possible implementation manner, in a case that the question type includes a plurality of types, the determining and feeding back the interactive reply information for the voice data according to the determined question type includes:

determining a target problem type corresponding to the voice data from a plurality of problem types according to the priority corresponding to each problem type;

and determining and feeding back interactive reply information aiming at the voice data according to the determined target problem type.

Considering the situation that the determined problem types are various, the priority corresponding to each problem type can be set, and the target problem type corresponding to the voice data is determined from the problem types according to the priority corresponding to each problem type; and according to the determined target problem type, interactive reply information aiming at the voice data is determined and fed back, and the efficiency of conversation interaction is improved.

In one possible implementation, the target interaction scenario includes a navigation interaction scenario, and the question type includes a business question type;

the determining and feeding back interactive reply information aiming at the voice data according to the determined question type comprises the following steps:

extracting destination information from the voice data, and inquiring a navigation route map corresponding to the destination information from a navigation route library;

and taking the searched navigation route map as interactive reply information aiming at the voice data.

When the target interaction scene is determined to be the navigation interaction scene and the problem types comprise the service problem types, a navigation route map matched with the extracted destination information can be inquired and used as the interaction reply information of the voice data, so that a target user can reach the destination according to the navigation route map, and the efficiency of dialogue interaction is improved.

In one possible implementation, the target interaction scenario includes a person detail consultation scenario, and the question type includes a business question type;

extracting target character information from the voice data, and inquiring the detail information corresponding to the target character indicated by the target character information from the prestored detail information of a plurality of candidate characters;

and taking the inquired detail information corresponding to the target person as interactive reply information aiming at the voice data.

Here, when the target interaction scenario includes a person detail consultation scenario and the question type includes a business question type, the details corresponding to the target person may be queried from the details of the plurality of candidate persons stored in advance, and the details corresponding to the target person may be used as the interaction reply information corresponding to the voice data, so that the target user may know the details of the target person, and the effect of the dialogue interaction is improved.

In one possible implementation, the target interaction scenario is an affidavit scenario, and the problem type includes a service problem type;

the stored video and/or image of the sworn action executed by the virtual character and the sworn content are used as the interactive reply information aiming at the voice data; the affidavit content comprises affidavit content in a text form and/or a voice form.

In a possible implementation, before acquiring the voice data input by the target user, the method further includes:

acquiring a face image acquired by target equipment;

and identifying the face image to obtain an identification result.

In a possible embodiment, the method further comprises:

under the condition that the recognition result indicates that the user to which the face image belongs accords with an interaction starting condition, target welcoming data aiming at the target user is generated based on the identification information and/or the position information of the target device and the set welcoming words;

and controlling the target equipment to play the target welcome data.

Here, when the recognition result of the face image indicates that the user to which the face image belongs meets the interaction starting condition, target welcome data for the target user may be generated based on the identification information and/or the position information of the target device and the set welcome words, and the target device is controlled to play the target welcome data, so that interestingness and diversity of conversation interaction are increased.

The following descriptions of the effects of the apparatus, the electronic device, and the like refer to the description of the above method, and are not repeated here.

In a second aspect, the present disclosure provides an apparatus for conversational interaction, comprising:

the acquisition module is used for acquiring voice data input by a target user;

the first determining module is used for determining a target interaction scene matched with the voice data based on the voice data;

a second determination module for determining at least one question type based on the voice data and the target interaction scenario;

and the third determining module is used for determining and feeding back interactive reply information aiming at the voice data according to the determined problem type.

In one possible embodiment, the second determining module, when determining at least one question type based on the voice data and the target interaction scenario, is configured to:

In a possible implementation manner, the third determining module, when determining and feeding back the interactive reply information for the voice data according to the determined question type, is configured to:

In a possible implementation manner, in a case that the question types within the dialog range include a sensitive word question type, the third determining module, when determining and feeding back the interactive reply information for the voice data from the obtained multiple service question answering information, is configured to:

In a possible implementation manner, in a case that the question types within the dialog range include a service question type, the third determining module, when determining and feeding back the interactive reply information for the voice data from the obtained plurality of service question and answer information, is configured to:

In a possible implementation manner, in a case that the question types within the dialog range include a chat question type, the third determining module, when determining and feeding back the interactive reply information for the voice data from the obtained multiple service question-answer information, is configured to:

determining a chatting type to which the voice data belongs;

In a possible implementation manner, in a case that the question types include a plurality of types, the third determining module, when determining and feeding back the interactive reply information for the voice data according to the determined question types, is configured to:

the third determining module, when determining and feeding back the interactive reply information for the voice data according to the determined question type, is configured to:

In a possible implementation, before acquiring the voice data input by the target user, the method further includes: an image recognition module to:

acquiring a face image acquired by target equipment;

and identifying the face image to obtain an identification result.

In a possible embodiment, the apparatus further comprises: a control module to:

and controlling the target equipment to play the target welcome data.

In a third aspect, the present disclosure provides an electronic device comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of conversational interaction as described in the first aspect or any embodiment above.

In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of conversational interaction as described in the first aspect or any embodiment above.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 shows a flow chart of a method of dialog interaction provided by an embodiment of the present disclosure;

fig. 2a is a schematic flow chart illustrating generating target welcome data in a method of dialogue interaction provided by an embodiment of the present disclosure;

FIG. 2b is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

FIG. 2c is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a specific manner of determining and feeding back an interaction reply message for voice data according to a determined question type in a dialog interaction method provided by an embodiment of the present disclosure;

fig. 4a shows a schematic flowchart of a dialog interaction in a navigation interaction scenario in a method for dialog interaction provided by an embodiment of the present disclosure;

FIG. 4b is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

FIG. 4c is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

FIG. 4d is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

fig. 5 is a flowchart illustrating a specific manner of determining and feeding back an interaction reply message for voice data according to a determined question type in a dialog interaction method provided by an embodiment of the present disclosure;

fig. 6a is a schematic flow chart illustrating a dialog interaction in a character detail consultation scene in a dialog interaction method provided by an embodiment of the present disclosure;

FIG. 6b is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

FIG. 6c is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

fig. 7a is a schematic flow chart illustrating a conversation interaction in an affidavit scene in a method of conversation interaction provided by an embodiment of the present disclosure;

FIG. 7b is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

FIG. 7c is a schematic interface diagram of a target device provided by an embodiment of the present disclosure;

FIG. 8 is a schematic diagram illustrating an architecture of an apparatus for conversational interaction provided by an embodiment of the disclosure;

fig. 9 shows a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

With the development of artificial intelligence, the use of human-computer conversation is more and more, so that the human-computer conversation is more and more popular in scenes such as life and work of people. The man-machine interaction is a technology for a machine to understand and use natural language to realize man-machine communication. For example, through man-machine interaction, the user can inquire weather information, can chat with the machine, or can acquire a specific service, such as acquiring a movie ticket reservation service. Accordingly, the disclosed embodiments provide a method for dialogue interaction.

The above-mentioned drawbacks are the results of the inventor after practical and careful study, and therefore, the discovery process of the above-mentioned problems and the solutions proposed by the present disclosure to the above-mentioned problems should be the contribution of the inventor in the process of the present disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

For the understanding of the embodiments of the present disclosure, a method for dialog interaction disclosed in the embodiments of the present disclosure will be described in detail first. The execution subject of the method for dialog interaction provided by the embodiment of the present disclosure is generally a computer device with certain computing capability, and the computer device includes: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a Personal Digital Assistant (PDA), a computing device, or a server or other processing device. In some possible implementations, the method of conversational interaction may be implemented by a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a method for dialog interaction provided by an embodiment of the present disclosure is shown, where the method includes S101-S104, where:

s101, acquiring voice data input by a target user;

s102, determining a target interaction scene matched with the voice data based on the voice data;

s103, determining at least one problem type based on the voice data and the target interaction scene;

and S104, determining and feeding back interactive reply information aiming at the voice data according to the determined problem type.

S101 to S104 will be specifically described below.

For S101:

in implementation, the target device may be configured in a real scene, so that the target device may collect voice data output by the target user and use the voice data as voice data input by the target user on the target device. The target device can be any intelligent device capable of collecting voice data.

Before acquiring the voice data input by the target user, the method may further include: acquiring a face image acquired by target equipment; and identifying the face image to obtain an identification result.

Illustratively, a camera can be configured on the target device, a face image is acquired in real time through the camera, and the trained first neural network is used for recognizing the face image to obtain a recognition result. For example, the recognition result may include, but is not limited to: a first recognition result for indicating whether the face image is complete; a second recognition result for indicating whether the face image is a face frontal image; the third recognition result is used for indicating whether the face image is collected or not, the fourth recognition result is used for indicating whether the face image collected this time is the same as the face image collected last time or not, and the like. Meanwhile, when the face image is a complete front face image, the acquired face image can be stored, so that the latest face image can be compared with at least one stored face image after the latest face image is acquired.

In an alternative embodiment, the method further comprises: under the condition that the recognition result indicates that the user of the face image accords with the interaction starting condition, target welcoming data aiming at the target user is generated based on the identification information and/or the position information of the target device and the set welcoming words; and controls the target device to play the target welcome data.

Illustratively, the interaction initiation condition may include one or more of the following conditions: the face image is a complete face image, the user to which the face image belongs is the user who interacts with the target device for the first time in the target time period, the face image acquired this time is inconsistent with the face image acquired the previous time, and the like.

And when the recognition result indicates that the user to which the face graph belongs accords with the interaction starting condition, target welcoming data aiming at the target user can be generated according to the identification information of the target device and the set welcoming words. For example, the generated target welcome data may be "welcome to use virtual digital person No. 1", where "virtual digital person No. 1" may be identification information of the target device; the "welcome, use" may be a set welcome word, etc.

Alternatively, the target welcome data may also be generated according to the location information of the target device and the set welcome word, for example, the generated target welcome data may be "welcome to service hall", where the "service hall" may be the location information of the target device, "welcome, coming" may be the set welcome word, and the like.

Still alternatively, the target welcome data may be generated according to the location information, the identification information, and the set welcome word of the target device, for example, the generated target welcome data may be "virtual digital person # 1 cheers you coming to the service hall". Wherein, the welcome words can be set according to the requirement.

The target welcome data may be multimedia data, for example, the target welcome data may be voice data, text data, image data, video data, and the like. After generating the target welcome data, the target device may be controlled to play the target welcome data.

For example, referring to fig. 2a, in a dialog interaction method, a schematic flow diagram of target welcome data is generated, after a user enters a real field, a target device collects a face image of the user in real time, and a face recognition module is called to recognize the collected face image to obtain a recognition result; when the recognition result indicates that the user to which the face image belongs accords with the interaction starting condition, initiating a welcome Language calling instruction, and transmitting the identification information and/or the position information of the target device to a Natural Language Processing (NLP) module; the NLP module generates and returns target welcome data according To the received identification information and/or the position information, and can also return capability introduction data corresponding To the target equipment, and a Text To Speech (TTS) module on the target equipment plays the target welcome data. After the process is finished, an interface waiting for user input can be entered. Referring to fig. 2b, an interface schematic diagram of a target device is shown, fig. 2b includes displayed target popularity data and capability introduction data, for example, the target popularity data is "welcome to sworn, and the capability introduction data is" 1. how to go to the bathroom? 2. What is the surrounding fun? ".

The face image can be sent to a background system so as to be stored; meanwhile, the target device starts timing for the target user corresponding to the face image, and when the dialogue interaction between the target user and the target device is equal to the set duration threshold, the dialogue interaction between the target device and the target user is ended.

When the recognition result indicates that the user to which the face image belongs does not accord with the interaction starting condition, the initial interface of the target device can be displayed, the virtual digital person can continuously call the face recognition module at regular time, and whether the face image exists in the visual field corresponding to the target device is detected. See fig. 2c for an interface schematic of an initial interface of a target device.

For S102:

in implementation, the intention recognition can be carried out on the voice data, and the target interaction scene matched with the voice data is determined. Alternatively, a text content corresponding to the voice data may be determined by using a voice conversion technology (ASR) such as an Automatic Speech Recognition technology (Automatic Speech Recognition), and the text content corresponding to the voice data may be subjected to intent Recognition to determine a target interaction scene matching the voice data.

For example, the target interaction scenario may include: a question and answer scene, a navigation interaction scene, a character detail consultation scene, an affidavit scene and the like. The scene type included in the target interaction scene can be set according to actual conditions.

And performing intention recognition on the voice data, and if the voice data comprises information such as a place xx and a place xx, determining that a target interaction scene matched with the voice data can be a navigation interaction scene. If the voice data comprises the information such as the person xx and the details of the person xx, it can be determined that the target interaction scene matched with the voice data can be a consultation scene for the details of the person xx. If the speech data includes information of a oath, a sworn word, and the like, it may be determined that the target interactive scene matched with the speech data may be a sworn scene. If the voice data includes chat information such as listening to a joke, telling a story, hello and the like, it can be determined that the target interactive scene matched with the voice data can be a question and answer scene. Alternatively, it may also be determined that the voice data belongs to the question and answer scenario when the voice data does not belong to the navigation interaction scenario, the person detail consultation scenario, and the sworn scenario.

Illustratively, the trained intent recognition neural network may be used to recognize a target interaction scenario corresponding to the speech data. For example, for each target interaction scenario, a plurality of pieces of sample voice data (or text data) commonly used in the target interaction scenario may be stored, content similarity between the voice data (or text content corresponding to the voice data) and the plurality of pieces of sample voice data in each target interaction scenario may be determined using the intent recognition neural network, and a target interaction scenario matched with the voice data may be determined according to the determined content similarity. Or, the voice data can be converted into vectors, cosine similarity between the vectors corresponding to the voice data input by the user and sample vectors corresponding to a plurality of sample voice data in each target interaction scene is determined, and the target interaction scene matched with the voice data is determined according to the determined cosine similarity.

For S103:

in an alternative embodiment, determining at least one question type based on the voice data and the target interaction scenario may include:

s1031, determining at least one to-be-detected problem type matched with the target interaction scene from a plurality of preset candidate problem types;

s1032, based on the at least one problem type to be detected, voice data is identified, and the at least one problem type is determined.

The preset multiple candidate question types can comprise: question types within dialog and question types outside dialog, question types within dialog may include: sensitive word question type, business question type, chatting question type, etc. The sensitive word question type is used for detecting whether the voice data contains sensitive words or not; the service problem type is used for detecting whether the voice data is service consultation data; the chatting question type is used for detecting whether the voice data belongs to chatting data in a preset chatting type.

The type of the problem to be detected matched with each target interaction scene can be set according to the requirement. For example, if the target interaction scenario is a question-answer scenario, the at least one type of question to be detected that is matched with the question-answer scenario may include: a sensitive word question type within the scope of the conversation, a business question type, a chat question type, and a question type outside the scope of the conversation. Or, if the target interaction scene is a navigation interaction scene, the at least one type of problem to be detected matched with the navigation interaction scene may include: sensitive word question types, business question types within the scope of the conversation. Or, if the target interaction scene is an affidavit scene, the at least one to-be-detected problem type matched with the affidavit scene may include: the sensitive word problem type.

And then, based on at least one problem type to be detected, voice data is identified, and at least one problem type is determined. For example, the at least one type of problem to be detected includes: sensitive word problem type, service problem type; whether the voice data contains the sensitive words or not can be detected, and if yes, at least one problem type corresponding to the voice data comprises the problem type of the sensitive words; and if not, at least one problem type corresponding to the voice data does not comprise the sensitive word problem type. And detecting whether the voice data is service consultation data, for example, detecting whether the voice data includes a service keyword, if yes, determining that the voice data belongs to the service consultation data, namely determining that at least one problem type corresponding to the voice data includes a service problem type; if not, determining that the voice data does not belong to the service consultation data, namely determining that at least one problem type corresponding to the voice data does not comprise a service problem type.

For S104:

the interactive reply information can be one or more of character information, image information, voice information and video information. In implementation, after the interactive reply information of the voice data is determined, the target device can be controlled to display or play the interactive reply information. For example, when the interactive reply message includes a message in a voice format, the interactive reply message may be controlled to be played by a voice playing device of the target device; or, when the interactive reply information includes information in an image format, the display screen of the target device may be controlled to display the interactive reply information.

In an alternative embodiment, determining and feeding back the interactive reply information for the voice data according to the determined question type may include:

s1041, determining and feeding back interactive reply information aiming at voice data from the obtained multiple service question and answer information under the condition that the question type comprises a question type within a dialogue range;

s1042, in case that the question type includes a question type outside the dialogue scope, determining and feeding back the interactive reply information for the voice data includes at least one of the following information: the voice data display device comprises a first warning message used for indicating that voice data is rejected, a second warning message used for indicating that conversation interaction is finished, and a display interface used for indicating a target user to input message information.

When it is determined that the question type corresponding to the voice data includes a question type within a dialogue range, interactive reply information for the voice data may be determined and fed back from the acquired plurality of service question and answer information. The plurality of service question-answering information may be pre-stored information, or may also be information acquired from the session server in real time. The service question-answer information may include set question information and answer information corresponding to the question information, and/or may also include common reply information in a certain specific mode.

When it is determined that the question type corresponding to the voice data includes a question type outside the scope of the conversation, determining and feeding back the interactive reply information for the voice data may include, but is not limited to: first warning information indicating that voice data is rejected, such as "voice is unrecognized, please re-enter"; second warning information for indicating the end of the dialog interaction, such as 'speech recognition impossible, the end of the dialog in the current round'; and the display interface is used for indicating the target user to input the message leaving information, and the target user can input the message leaving information in the display interface so that the later worker can check or reply the message leaving information.

In an alternative embodiment, in the case that the question type within the scope of the conversation includes a sensitive word question type, determining and feeding back the interactive reply information for the voice data from the acquired multiple service question-answer information may include: and determining and feeding back interactive reply information aiming at the voice data as third warning information for indicating that the third warning information contains sensitive words from the plurality of service question-answer information.

When the problem type in the conversation range includes the sensitive word problem type, that is, it is detected that the voice data includes the sensitive word, the interactive reply information corresponding to the voice data is third warning information used for indicating that the voice data includes the sensitive word, so that the target device can display or play the third warning information. For example, the third warning message may be "warning, presence-sensitive word" or the like. The content of the sensitive words can be set according to the actual scene.

In an alternative embodiment, in the case that the question types within the dialog range include the service question types, the interactive reply information for the voice data may be determined and fed back from the acquired plurality of service question-answer information according to the following two ways:

firstly, extracting keywords from voice data, and determining target keywords contained in the voice data; and determining and feeding back interactive reply information aiming at the voice data from the plurality of service question-answer information based on the target keywords and the keywords corresponding to each service question-answer information.

Determining the similarity between the voice data and the obtained question and answer information of each service; and determining and feeding back interactive reply information aiming at the voice data from the plurality of service question and answer information based on the similarity.

In the first mode, a key point extraction algorithm may be used in advance to determine a key word corresponding to each service question-answer information, and the key word and the corresponding service question information are stored in an associated manner. And extracting a target keyword contained in the voice data from the voice data or text content corresponding to the voice data by using a key point extraction algorithm, wherein the target keyword can be one or more. And determining the service question-answer information matched with the voice data based on the target keywords and the keywords corresponding to each service question-answer information, and taking answer information indicated by the service question-answer information as interactive reply information corresponding to the voice data.

For example, if the voice data includes a target keyword a and a target keyword B, if the first service question-answering information includes the target keyword a, the second service question-answering information includes the target keyword C, and the third service question-answering information includes the target keyword a and the target keyword B, the answer information in the third service question-answering information may be used as the interactive reply information corresponding to the voice data. If a plurality of service question-answer information are matched with the voice data, one service question-answer information corresponding to the voice data can be randomly selected, or the service question-answer information with the largest access times can be selected as the service question-answer information corresponding to the voice data according to the access times of each service question-answer information. Furthermore, the interactive reply information of the voice data can be determined and fed back based on the service question-answer information corresponding to the voice data.

In the second mode, for example, a set coding mode or a vector space model may be used to determine a first feature vector corresponding to each service question and answer information in advance, and store the first feature vector in association with the corresponding service question and answer information. Determining second characteristic vectors corresponding to the voice data or the text content of the voice data in a mode consistent with the service question answering information; and then determining the similarity (such as cosine similarity) between the second feature vector corresponding to the voice data and the first feature vector corresponding to each service question-answering information, and taking the answer information in the service question-answering information with the maximum similarity as the interactive reply information corresponding to the voice data.

In an alternative embodiment, in the case that the question types within the scope of the conversation include a chatting question type, determining and feeding back the interactive reply information for the voice data from the obtained multiple service question-answer information includes: determining the chatting type of the voice data; and calling an interactive interface corresponding to the chatting type, and determining and feeding back interactive reply information aiming at the voice data from a database corresponding to the chatting type.

The chatting type can be set according to actual needs, for example, the chatting type can include telling a story, checking weather, talking a joke, checking a calendar, virtual digital person-related chatting, and the like. Each chatting type corresponds to one database, and chatting data related to the chatting type can be stored in the database, for example, a plurality of stories can be stored in the database corresponding to the chatting type for telling stories; a plurality of jokes can be stored in a database corresponding to the chatting type of the jokes; the database corresponding to the chatting type of the virtual digital person-related chatting may store a plurality of question and answer information and the like related to the virtual digital person.

For example, if the voice data is "virtual digital person is at a dry chant", "virtual digital person is cheerful", or the like, it is determined that the chatting type to which the voice data belongs is virtual digital person-related chatting; if the voice data is 'I want to listen to a story', 'a virtual digital person speaks a story bar' and the like, determining that the chatting type of the voice data is story telling and the like.

After the chatting type of the voice data is determined, an interactive interface corresponding to the chatting type can be called, interactive reply information aiming at the voice data is determined and fed back from a database corresponding to the chatting type, for example, if the chatting type is a story telling type, the interactive interface corresponding to the story telling can be called, story content is selected from the database corresponding to the story telling type, and the story content is used as the interactive reply information corresponding to the voice data.

In an alternative embodiment, in a case that the question types include a plurality of types, determining and feeding back the interactive reply information for the voice data according to the determined question types may include: determining a target problem type corresponding to the voice data from the plurality of problem types according to the priority corresponding to each problem type; and determining and feeding back interactive reply information aiming at the voice data according to the determined target problem type.

The priority order for each question type pair may be predetermined, for example, question types within a dialog range may have a higher priority than question types outside the dialog range; the priority of the question types within the scope of the conversation may be: the sensitive word question type corresponds to a first priority, the service question type corresponds to a second priority, and the chatting question type corresponds to a third priority, wherein the sensitive word question type is higher in priority than the service question type, and the service question type is higher in priority than the chatting question type.

When the determined question types are multiple, a target question type corresponding to the voice data can be determined from the multiple question types according to the priority corresponding to each question type, for example, if the question type corresponding to the voice data includes: when the sensitive word problem type, the service problem type and the chatting problem type are used, the sensitive word problem type can be used as a target problem type corresponding to the voice data because the priority corresponding to the sensitive word problem type is highest, and the interactive reply information for the voice data is determined and fed back according to the sensitive word problem type, for example, the interactive reply information for the voice data is determined and fed back as third warning information for indicating that the third warning information contains the sensitive word.

In an alternative embodiment, referring to fig. 3, when the target interaction scenario includes a navigation interaction scenario and the problem type includes a service problem type, determining and feeding back the interaction reply information for the voice data according to the determined problem type includes:

s301, extracting destination information from the voice data, and inquiring a navigation route map corresponding to the destination information from a navigation route library;

and S302, taking the searched navigation route map as interactive reply information aiming at the voice data.

The navigation route library may store navigation route maps from a start position to each end position, for example, when the end position includes a first end position, a second end position, a third end position, and a fourth end position, the navigation route library stores a navigation route map from the start position to the first end position, a navigation route map from the start position to the second end position, a navigation route map from the start position to the third end position, and a navigation route map from the start position to the fourth end position. The starting position may be a position corresponding to each target device in the real place.

Meanwhile, for each navigation route map, navigation data corresponding to the navigation route map can be determined, for example, the navigation data can be navigation voice, navigation characters and the like, and the navigation data and the corresponding navigation route map are stored in an associated manner. The navigation data may be description data of a navigation route map, for example, the navigation data may be "go straight north for 50 meters to reach the destination a", "go straight south for 50 meters and then turn right, and then go straight south for 10 meters to reach the destination B", and the like.

When the target interaction scenario includes a navigation interaction scenario and the question type includes a service question type, destination information may be extracted from the voice data, and a navigation route map corresponding to the destination information may be queried from a navigation route library, that is, a navigation route map corresponding to the destination information from a start location corresponding to the target device to the destination information is queried. And the searched navigation route map is used as interactive reply information corresponding to the voice data. And/or acquiring navigation data corresponding to the navigation route map, and taking the inquired navigation data as interactive reply information corresponding to the voice data.

Or after the voice data is extracted to the destination information, a navigation route from the starting position to the destination information is generated according to the starting position and the destination information corresponding to the target device, and the navigation route is used as the interactive reply information corresponding to the voice data.

In an alternative embodiment, the navigation route map corresponding to the destination information may be searched from the navigation route library according to the following procedure: inquiring whether destination information exists from a pre-stored address library; if the destination information exists in the address library, a navigation route map corresponding to the destination information can be inquired from the navigation route library.

If the destination information does not exist in the address base, first indication information for indicating that the destination does not exist can be used as interactive reply information for voice data; alternatively, second indication information indicating the re-input destination may be used as the interactive reply information for the voice data, or the like.

In implementation, if the destination information is not extracted from the voice data, first inquiry data for inquiring the destination can be acquired; and using the first query data as an interactive reply message for the voice data. For example, the first query data may be "where the query is destined", or the like.

When the target interaction scenario includes a navigation interaction scenario, the question types may further include other question types besides the service question types, for example, when the question types include a sensitive word question type, the interaction reply information corresponding to the voice data may be determined from the plurality of service question and answer information as third warning information indicating that the sensitive word is included.

For example, referring to fig. 4a, in a method for dialog interaction, a flow diagram of dialog interaction in a navigation interaction scenario is shown. When the target user does not input voice data, the target device may present an initial interface, i.e. present an interface as shown in fig. 2 c. After voice data is input by a target user, whether the voice data has a navigation intention or not is recognized, if the navigation intention of the target user is recognized, a destination slot position is extracted from the voice data, whether the destination slot position is extracted successfully or not is judged, namely whether destination information is extracted or not is judged, whether the destination exists in an address base or not is judged, if yes, the destination slot position is extracted successfully, position information of target equipment and/or identification information of the target equipment can be called, a navigation route map corresponding to the destination information is inquired from a navigation route base according to the position information of the target equipment and/or the identification information of the target equipment, and the navigation route map is displayed on an interface of the target equipment; and the NLP module can call the navigation characters corresponding to the navigation route map, and the TTS module can broadcast the navigation characters. And finally, after the navigation is finished, the target equipment enters an initial interface waiting for the input of the user.

After the target user inputs the voice data, the interface diagram of the target device is shown in fig. 4b, and the voice data input by the target user, i.e., "how to go to the toilet", is shown in fig. 4 b. Fig. 4c shows the interactive response message (i.e. navigation text) corresponding to the voice data in fig. 4b, i.e. the interactive response message may be "right turn to head or close to the toilet, and you can also refer to the diagram". Or, as shown in fig. 4d, a navigation route map and navigation characters corresponding to the navigation route map are displayed on an interface of the target device.

If the destination information is not extracted or the destination does not exist in the address library, it is determined that the destination slot is not extracted, the NLP module may return indication information indicating that the destination does not exist, for example, the indication information may be "do not exist, please re-input the destination".

In an alternative embodiment, referring to fig. 5, when the target interaction scenario includes a person detail consultation scenario, and the question type includes a service question type, determining and feeding back the interaction reply information for the voice data according to the determined question type may include:

s501, extracting target character information from voice data, and inquiring the detail information corresponding to a target character indicated by the target character information from the prestored detail information of a plurality of candidate characters;

and S502, taking the detail information corresponding to the inquired target person as interactive reply information aiming at the voice data.

The candidate character information can be set according to needs, for example, the candidate character can be a famous poet, a word person, a writer, a scientist, and the like; or the candidate person may be a worker in a real-world location, such as a teacher in a school, a doctor in a hospital, etc. After extracting the target person information from the voice data, the details information corresponding to the target person indicated by the target person information may be queried from among the details information of a plurality of candidate persons stored in advance. For example, when the target person is "libai", the life introduction, poetry introduction, and the like corresponding to the libai can be queried. And the inquired detail information corresponding to the target person is used as interactive reply information aiming at the voice data.

When the detail information corresponding to the target person indicated by the target person information is not inquired from the prestored detail information of a plurality of candidate persons, the preset third indication information for indicating that the person does not exist can be used as the interactive reply information corresponding to the voice data; or, the detail information corresponding to each of the multiple candidate characters may be used as the interactive reply information corresponding to the voice data, so that the target device may be controlled to display the detail information of the candidate characters and/or be controlled to play the voice data corresponding to the candidate characters in response to a trigger operation for any candidate character.

For example, referring to a flow diagram of a dialog interaction in a character detail consultation scenario in a method for dialog interaction shown in fig. 6a, when the target user does not input voice data, the target device may present an initial interface, that is, the interface shown in fig. 2 c. After a target user inputs voice data, whether the voice data has the intention of inquiring people is identified, if yes, whether a people slot position exists in the voice data is judged, if not, people cards corresponding to a plurality of pre-stored candidate people can be displayed, and if the displayed interface is shown in 6b, a people card corresponding to a first person, a people card corresponding to a second person and a people card corresponding to a third person are included in the figure 6 b; the detail information of the triggered candidate characters can be displayed after the target user triggers the display operation aiming at any candidate character through inputting voice data or touch screen operation; the interface is shown in figure 6 c.

If the character slot position exists in the voice data, extracting target character information from the voice data, judging whether the extracted target character is stored in a character library, if so, acquiring the detail information of the target character from the character library, and displaying the detail information of the target character; and broadcasting the detail information of the target person by the TTS module. If the extracted target person does not exist in the person library, prompt information indicating that the target person does not exist can be acquired and returned.

In an alternative embodiment, when the target interaction scenario is an affidavit scenario and the problem type includes a service problem type, determining and feeding back the interaction reply information for the voice data according to the determined problem type may include: the stored videos and/or images of the sworn action executed by the virtual character and the sworn content are used as interactive reply information aiming at the voice data; the affidavit content includes the affidavit content in a text form and/or a voice form.

In a sworn scene, problem type detection can be performed on voice data, if the problem type includes a service problem type, video or/and images of a stored virtual character sworn action and sworn content can be determined as interactive reply information corresponding to the voice data, wherein the sworn content can be in a text form, a voice form and the like. Wherein, the affidavit content can be set according to the requirement.

In implementation, the target device may be controlled to display a video or/and an image of a sworn action performed by the virtual character, and/or to play sworn content and the like, wherein the target device may be controlled to play the sworn content all at once, so that the target user may read a sworn word; and the target device can also be controlled to play a sworn content each time, and after the situation that the target user finishes reading the sworn content is detected, the target device is controlled to play a sworn content until all the sworn content is played.

When the implementation is specific, the camera on the target device can also collect the limb image of the target user in real time, and when the limb image of the target user indicates that the limb action is consistent with the sworn action, the target device can be controlled to play the sworn content; and determining that the sworn action is interrupted when the fact that the limb action of the target user is inconsistent with the sworn action and the duration is greater than a set duration threshold value is detected.

In the process of playing the sworn content by the target device, if the target user inputs the updated voice data in the process, the interactive reply information corresponding to the updated voice data can be uncertain; or, the service problem type detection may be performed on the updated voice data, and when the updated voice data belongs to the service problem type, the interactive reply information of the updated voice data is determined; or, whether the updated voice data is related to the affidavit content or not can be judged, and if so, interactive reply information corresponding to the updated voice data is determined; and if not, not determining the interactive reply information corresponding to the updated voice data.

In a solemn scene, if the problem type includes a sensitive word problem type, the third warning information for indicating that the sensitive word exists can be used as interactive reply information corresponding to the voice data; when the question type includes a chat question, interactive reply information corresponding to the voice data can be determined according to a conversation strategy corresponding to the chat question.

For example, referring to a flow diagram of a conversation interaction in a sworn scene in a conversation interaction method shown in fig. 7a, after the target user triggers a sworn flow, the target device may be controlled to play sworn content, and the target device may be controlled to display a sworn action and enter the sworn flow, referring to an interface diagram shown in fig. 7 b. In the sworn process, the virtual digital person of the target device may enter a locked state, that is, the virtual digital person may recognize the voice data including the target keyword related to the sworn type and return the interactive reply information, reject the voice data not including the target keyword and not return the interactive reply information, and an interface diagram in the sworn process is shown in fig. 7 c. Further, the affidavit process may be ended by the target user with a particular ending word, e.g., the ending word may be "ending affidavit," etc. After finishing the affidavit, the virtual digital person exits the affidavit locking state and opens normal voice interaction with the target user. And after ending the affidavit, the interface of the target device may present an initial interface, i.e., present an interface as shown in fig. 2 c.

For example, timing may also be started when the target user inputs voice data for the first time, and when the duration of the dialog interaction between the target user and the target device is equal to a set target duration (for example, 3 minutes), prompt information for indicating the end of the dialog interaction is obtained and fed back; and ending the dialogue interaction between the target user and the target equipment to reduce useless dialogue between the target user and the target equipment, improve the use efficiency of the target equipment and improve the dialogue interaction efficiency.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same concept, an embodiment of the present disclosure further provides a device for dialog interaction, and as shown in fig. 8, an architecture diagram of the device for dialog interaction provided by the embodiment of the present disclosure includes an obtaining module 801, a first determining module 802, a second determining module 803, and a third determining module 804, specifically:

an obtaining module 801, configured to obtain voice data input by a target user;

a first determining module 802, configured to determine, based on the voice data, a target interaction scenario matched with the voice data;

a second determining module 803, configured to determine at least one question type based on the voice data and the target interaction scenario;

a third determining module 804, configured to determine and feed back the interactive reply information for the voice data according to the determined question type.

In a possible implementation, the second determining module 803, when determining at least one question type based on the voice data and the target interaction scenario, is configured to:

In a possible implementation manner, the third determining module 804, when determining and feeding back the interactive reply information for the voice data according to the determined question type, is configured to:

In a possible implementation manner, in a case that the question types within the dialog range include a sensitive word question type, the third determining module 804, when determining and feeding back the interactive reply information for the voice data from the obtained multiple service question answering information, is configured to:

In a possible implementation manner, in a case that the question types within the dialog range include a service question type, the third determining module 804, when determining and feeding back the interactive reply information for the voice data from the obtained plurality of service question and answer information, is configured to:

In a possible implementation manner, in a case that the question types within the dialog range include a chat question type, the third determining module 804, when determining and feeding back the interactive reply information for the voice data from the obtained multiple service question-answer information, is configured to:

determining a chatting type to which the voice data belongs;

In a possible implementation manner, in a case that the question types include a plurality of types, the third determining module 804, when determining and feeding back the interactive reply information for the voice data according to the determined question types, is configured to:

the third determining module 804, when determining and feeding back the interactive reply information for the voice data according to the determined question type, is configured to:

In a possible implementation, before acquiring the voice data input by the target user, the method further includes: an image recognition module 805 to:

acquiring a face image acquired by target equipment;

and identifying the face image to obtain an identification result.

In a possible embodiment, the apparatus further comprises: a control module 806 configured to:

and controlling the target equipment to play the target welcome data.

In some embodiments, the functions of the apparatus provided in the embodiments of the present disclosure or the included templates may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, no further description is provided here.

Based on the same technical concept, the embodiment of the disclosure also provides an electronic device. Referring to fig. 9, a schematic structural diagram of an electronic device provided in the embodiment of the present disclosure includes a processor 901, a memory 902, and a bus 903. The memory 902 is used for storing execution instructions, and includes a memory 9021 and an external memory 9022; the memory 9021 is also referred to as an internal memory, and is configured to temporarily store operation data in the processor 901 and data exchanged with an external memory 9022 such as a hard disk, the processor 901 exchanges data with the external memory 9022 through the memory 9021, and when the electronic device 900 is operated, the processor 901 communicates with the memory 902 through the bus 903, so that the processor 901 executes the following instructions:

acquiring voice data input by a target user;

Furthermore, the disclosed embodiments also provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps of the method for dialog interaction described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the dialog interaction method described in the foregoing method embodiments, which may be referred to specifically for the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above are only specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and shall be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of conversational interaction, comprising:

acquiring voice data input by a target user;

2. The method of claim 1, wherein determining at least one question type based on the voice data and the target interaction scenario comprises:

3. The method according to claim 1 or 2, wherein the determining and feeding back interactive reply information for the voice data according to the determined question type comprises:

4. The method according to claim 3, wherein in a case that the question type within the dialogue scope includes a sensitive word question type, the determining and feeding back the interactive reply information for the voice data from the obtained multiple service question answering information comprises:

5. The method according to claim 3, wherein in a case that the question type within the dialogue scope includes a service question type, the determining and feeding back the interactive reply information for the voice data from the obtained plurality of service question and answer information comprises:

6. The method of claim 3, wherein in a case that the question type within the dialog scope includes a chat question type, the determining and feeding back the interactive reply information for the voice data from the obtained multiple service question and answer information comprises:

determining a chatting type to which the voice data belongs;

7. The method according to any one of claims 1 to 6, wherein in a case that the question type includes a plurality of types, the determining and feeding back the interactive reply information for the voice data according to the determined question type includes:

8. The method according to any one of claims 1 to 7, wherein the target interaction scenario comprises a navigation interaction scenario, and the problem type comprises a business problem type;

9. The method according to any one of claims 1 to 7, wherein the target interaction scenario comprises a character detail consultation scenario, and the question type comprises a business question type;

10. The method according to any one of claims 1 to 7, wherein the target interaction scenario is an affidavit scenario, and the problem type comprises a service problem type;

11. The method according to any one of claims 1 to 10, further comprising, before acquiring the voice data input by the target user:

acquiring a face image acquired by target equipment;

and identifying the face image to obtain an identification result.

12. The method of claim 11, further comprising:

and controlling the target equipment to play the target welcome data.

13. An apparatus for conversational interaction, comprising:

the acquisition module is used for acquiring voice data input by a target user;

14. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the method of conversational interaction of any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method of conversational interaction according to any one of claims 1 to 12.