CN110415688B

CN110415688B - Information interaction method and robot

Info

Publication number: CN110415688B
Application number: CN201810386235.7A
Authority: CN
Inventors: 苏辉; 杜安强; 栾国良; 金升阳; 蒋海青
Original assignee: Hangzhou Ezviz Software Co Ltd
Current assignee: Hangzhou Ezviz Software Co Ltd
Priority date: 2018-04-26
Filing date: 2018-04-26
Publication date: 2022-02-08
Anticipated expiration: 2038-04-26
Also published as: CN110415688A

Abstract

The embodiment of the application provides an information interaction method and a robot. The method comprises the following steps: acquiring voice information to be responded, judging whether response content aiming at the voice information can be determined or not, and if the response content aiming at the voice information cannot be determined, sending a take-over request carrying the voice information to associated terminal equipment; and receiving response content aiming at the voice information sent by the terminal equipment, and responding the voice information according to the response content. The succession request is used for instructing the terminal equipment to acquire response content aiming at the voice information. By applying the scheme provided by the embodiment of the application, the response rate of the user to the question can be improved, and the user experience is improved.

Description

Information interaction method and robot

Technical Field

The present application relates to the field of information interaction technologies, and in particular, to an information interaction method and a robot.

Background

With the technical development of intelligent equipment, the application of the intelligent equipment is wider and wider. The smart device may be used to interact with the user in a voice manner to answer the user's questions. For example, the smart device may include a robot.

When the intelligent device receives the voice input by the user, the voice can be recognized to obtain a voice recognition result, and response content is determined according to the voice recognition result. For example, the speech recognition result may be matched with each question in a pre-stored template library, and an answer corresponding to the matched question in the template library may be used as the response content. However, the user's questions are various and varied, and when there is no question asked by the user in the template library, the smart device cannot respond, and the response rate at which the smart device can successfully respond to the user's question is not high enough, and the user experience is not good enough.

Disclosure of Invention

The embodiment of the application aims to provide an information interaction method and a robot, so that the response rate of asking questions of a user is improved, and the user experience is improved.

In order to achieve the above object, an embodiment of the present application provides an information interaction method, including:

acquiring voice information to be responded;

judging whether response content for the voice information can be determined;

if not, sending a take-over request carrying the voice information to the associated terminal equipment; the succession request is used for indicating the terminal equipment to acquire response content aiming at the voice information;

receiving response content aiming at the voice information sent by the terminal equipment;

and responding the voice information according to the response content.

Optionally, when the voice information is used to instruct to recognize the image information, the method further includes:

acquiring image information to be identified;

the step of judging whether the response content for the voice information can be determined includes:

judging whether response content aiming at the voice information and the image information to be recognized can be determined;

the step of sending a take-over request carrying the voice information to the associated terminal device includes:

sending a take-over request carrying the voice information and the image information to be recognized to associated terminal equipment, wherein the take-over request is used for indicating the terminal equipment to acquire response contents aiming at the voice information and the image information to be recognized;

the step of receiving the response content for the voice message sent by the terminal device includes:

and receiving response content aiming at the voice information and the image information to be identified, which is sent by the terminal equipment.

Optionally, the step of determining whether response content for the voice information and the image information to be recognized can be determined includes:

judging whether the identification content corresponding to the image information to be identified can be determined or not according to a pre-stored response content template library, and if not, judging that the response content aiming at the voice information and the image information to be identified cannot be determined; the response content template library is used for storing the corresponding relation between the image information and the identification content; alternatively, the first and second electrodes may be,

when the voice information contains a category keyword representing the image information to be recognized, judging whether response content indicated by the voice information belongs to an excellence field or not according to the category keyword and a pre-stored excellence field library, and if not, judging that the response content aiming at the voice information and the image information to be recognized cannot be determined; the excellence field library is used for storing the corresponding relation between the excellence field and the keywords.

Optionally, when response content for the voice information and the image information to be recognized sent by the terminal device is received, the method further includes:

and updating the response content template library according to the image information to be identified and the response content.

Optionally, the step of determining whether the identification content corresponding to the image information to be identified can be determined according to a pre-stored response content template library includes:

and matching the image information to be recognized with each image information in the response content template library, judging whether a target matching result with the matching degree larger than a preset matching degree threshold exists in the matching result, and if not, judging that the recognition content corresponding to the image information to be recognized cannot be determined.

Optionally, when the target matching result exists in the matching result, the method further includes:

and determining the identification content of the image information corresponding to the target matching result in the response content template library as the response content aiming at the voice information and the image information to be identified, and responding the voice information according to the response content.

Optionally, the response content template library further includes confidence levels of the respective image information; before determining the response content for the voice information and the image information to be recognized, the method further includes:

and when the confidence degree of the image information corresponding to the target matching result is greater than a preset confidence degree threshold value, determining the identification content of the image information corresponding to the target matching result in the response content template library as the response content aiming at the voice information and the image information to be identified.

Optionally, after responding to the voice message, the method further includes:

and if response information aiming at the response content is received, updating the confidence of the image information corresponding to the target matching result according to the response information.

Optionally, when the terminal device cannot acquire response content for the voice information, the method further includes:

and responding the voice information according to the pre-stored emergency response content.

Optionally, after the image information to be recognized is obtained, the method further includes:

when an acquisition request for acquiring an interactive video sent by the terminal equipment is received, acquiring the interactive video and sending the interactive video to the terminal equipment; the interactive video is as follows: the video containing the voice information and the image information to be recognized;

The embodiment of the application also provides another information interaction method, which comprises the following steps:

receiving a take-over request carrying voice information sent by associated intelligent equipment; the succession request is used for indicating to acquire response content aiming at the voice information, the voice information is the voice information to be responded acquired by the intelligent equipment, and the succession request is sent when the intelligent equipment judges that the response content aiming at the voice information cannot be determined;

and acquiring response content aiming at the voice information, and sending the response content to the intelligent equipment.

An embodiment of the present application further provides a robot, including: a processor, a memory, and a microphone;

the microphone is used for acquiring voice information to be responded and storing the voice information to the memory;

the processor is used for acquiring the voice information from the memory and judging whether response content aiming at the voice information can be determined or not; if not, sending a take-over request carrying the voice information to the associated terminal equipment; receiving response content aiming at the voice information sent by the terminal equipment; responding the voice message according to the response content; and the succession request is used for indicating the terminal equipment to acquire response content aiming at the voice information.

Optionally, the robot further comprises: a speaker and/or a display screen;

the processor is used for playing the response content through the loudspeaker and/or displaying the response content through the display screen.

Optionally, the robot further comprises a camera module; the camera module is used for collecting image information to be identified and storing the image information to be identified to the memory;

the processor is specifically configured to:

when the voice information is used for indicating to identify the image information, acquiring the image information to be identified from the memory, and judging whether response content aiming at the voice information and the image information to be identified can be determined; sending a take-over request carrying the voice information and the image information to be recognized to associated terminal equipment, wherein the take-over request is used for indicating the terminal equipment to acquire response contents aiming at the voice information and the image information to be recognized; and receiving response content aiming at the voice information and the image information to be identified, which is sent by the terminal equipment.

Optionally, the processor is specifically configured to:

the processor is specifically configured to:

Optionally, the processor is further configured to:

and when response contents aiming at the voice information and the image information to be identified, which are sent by the terminal equipment, are received, updating the response content template library according to the image information to be identified and the response contents.

Optionally, the processor is specifically configured to:

Optionally, the processor is further configured to:

and when the target matching result exists in the matching result, determining the identification content of the image information corresponding to the target matching result in the response content template library as the response content aiming at the voice information and the image information to be identified, and responding the voice information according to the response content.

Optionally, the response content template library further includes confidence levels of the respective image information; the processor is further configured to:

before determining response contents for the voice information and the image information to be recognized, when the confidence degree of the image information corresponding to the target matching result is greater than a preset confidence degree threshold value, determining the recognition contents of the image information corresponding to the target matching result in the response content template library as the response contents for the voice information and the image information to be recognized.

Optionally, the processor is further configured to:

after the voice information is responded, if response information aiming at the response content is received, the confidence degree of the image information corresponding to the target matching result is updated according to the response information.

Optionally, the processor is further configured to:

and when the terminal equipment cannot acquire response content aiming at the voice information, responding the voice information according to the pre-stored emergency response content.

Optionally, the robot further comprises: a camera module; the camera module is used for acquiring an interactive video containing the voice information and the image information to be identified and storing the interactive video to the memory;

the processor is further configured to:

after the image information to be identified is obtained and an obtaining request for obtaining an interactive video sent by the terminal equipment is received, obtaining the interactive video from the memory and sending the interactive video to the terminal equipment; and receiving response content aiming at the voice information and the image information to be identified, which is sent by the terminal equipment.

The embodiment of the application provides a terminal device, which comprises: a processor and a memory;

the processor is used for receiving a take-over request carrying voice information sent by the associated intelligent equipment, acquiring response content aiming at the voice information and sending the response content to the intelligent equipment;

the succession request is used for indicating to acquire response content aiming at the voice information, the voice information is the voice information to be responded acquired by the intelligent equipment, and the succession request is sent when the intelligent equipment judges that the response content aiming at the voice information cannot be determined.

The embodiment of the application provides an information interaction system, which comprises: a robot and a terminal device associated with the robot;

the robot is used for acquiring voice information to be responded and judging whether response content aiming at the voice information can be determined or not; if the terminal equipment cannot receive the voice information, sending a take-over request carrying the voice information to the terminal equipment; receiving response content aiming at the voice information sent by the terminal equipment; responding the voice message according to the response content; the succession request is used for instructing the terminal equipment to acquire response content aiming at the voice information;

and the terminal equipment is used for receiving a take-over request carrying voice information sent by the robot, acquiring response content aiming at the voice information and sending the response content to the robot.

The embodiment of the application provides a computer-readable storage medium, a computer program is stored in the computer-readable storage medium, and the computer program is executed by a processor to implement the information interaction method provided by the embodiment of the application. The method comprises the following steps:

acquiring voice information to be responded;

judging whether response content for the voice information can be determined;

and responding the voice information according to the response content.

The embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the information interaction method provided by the embodiment of the present application is implemented. The method comprises the following steps:

According to the information interaction method and the robot provided by the embodiment of the application, when the situation that the response content aiming at the voice information cannot be determined is judged, the succession request is sent to the associated terminal equipment, the response content aiming at the voice information sent by the terminal equipment is received, and the voice information is responded according to the response content. The terminal equipment can determine the response content of the voice information through various ways, and when the response content cannot be determined, the terminal equipment is requested to assist in determining the response content, so that the response rate of asking questions of a user can be improved, and the user experience is improved. Of course, not all advantages described above need to be achieved at the same time in the practice of any one product or method of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of an information interaction method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of another information interaction method based on the embodiment of FIG. 1;

fig. 3 is a schematic flowchart of another information interaction method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a robot provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an information interaction system according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the described embodiments are merely a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to improve the response rate of user questions and improve user experience, the embodiment of the application provides an information interaction method and a robot. The present application will be described in detail below with reference to specific examples.

Fig. 1 is a schematic flowchart of an information interaction method according to an embodiment of the present application. The method is applied to intelligent equipment, and the intelligent equipment can be a robot or other intelligent equipment. The method of the present embodiment may include steps S101 to S105 as follows:

step S101: and acquiring the voice information to be responded.

The voice information may be voice data, or a voice recognition result obtained by performing voice recognition on the voice data, where the voice recognition result is character data. For example, the voice information may be "what word this is", "what fruit this is", or "what weather today" or the like.

When the voice information to be responded is acquired, the voice information to be responded can be directly received, and the voice information can also be acquired from other equipment except the intelligent equipment serving as an execution main body. For example, voice information to be responded to input by the user may be directly received.

Specifically, step S101 may be to receive voice information to be responded, which is input by the first object. The first object may be a child, an adult, or others.

The intelligent device may include a microphone (or referred to as a sound pickup) and other devices, the device may collect voice information to be responded in real time, and the intelligent device may acquire the voice information to be responded collected by the microphone.

Step S102: it is judged whether or not the response content to the above-mentioned voice information can be determined, and if not, step S103 is executed.

Specifically, the voice information may be matched with each question in a preset first template library, and when the matching is unsuccessful, it may be determined that the response content for the voice information cannot be determined; when the matching is successful, it may be determined that the response content for the voice information can be determined, and at this time, the answer corresponding to the question successfully matched in the first template library may be determined as the response content for the voice information. The first template library may include answers corresponding to the respective questions. The above-mentioned question may be understood as a sentence to be answered, and is not limited to a question sentence.

It may be determined whether or not a prestored illegal word is present in the voice information, and if so, it may be determined that the response content to the voice information cannot be specified.

Or judging whether the current time is in a preset interaction allowing time interval, and if not, judging that the response content aiming at the voice information cannot be determined. And the current moment is the moment of acquiring the voice information. For example, the time when the voice message is received is 23: 00, and the preset allowed interaction period is 8: 00-17: 00, the response content for the voice message cannot be determined.

Step S103: and sending a take-over request carrying the voice information to the associated terminal equipment.

The succession request is used for instructing the terminal equipment to acquire response content aiming at the voice information. The smart device as the execution subject may be associated with the terminal device in advance. For example, when performing association, the terminal device may send an association request to the intelligent device, the intelligent device may verify information carried in the association request after receiving the association request, and when the verification passes, send verification passing information to the terminal device, that is, complete the association between the two devices.

In this step, the succession request may be sent to the terminal device through a wired network or a wireless network.

The terminal device may obtain response content for the voice message when receiving the succession request. Specifically, when acquiring the response content, the terminal device may play or show the voice information to the user, and receive the response content input by the user for the voice information. The response content may be at least one of voice data, character data, and image data. The user may be understood as a second object, which is different from the first object.

When the terminal device obtains the response content for the voice information, the voice information may also be matched with each question in a preset second template library, and when the matching is successful, an answer corresponding to the successfully matched question in the second template library is used as the response content. The second template library may include answers corresponding to the respective questions, and the questions and the corresponding answers may be character data or voice data. In the present embodiment, the second template library is a template library having a larger data size than the first template library. In another embodiment, the second template library may further include image data corresponding to each question. For example, in the second template library, the answer to the question "how to weather today" is "clear weather", and the corresponding image data is an image representing clear sky.

The take-over request may also carry other information than the voice information, for example, description information indicating a type when the response content cannot be determined. For example, when the speech information cannot be successfully matched from the first template library, the explanatory information is set to 0; when the illegal vocabulary exists in the voice information, setting the description information as 1; when the current time is not in the interaction allowed period, the caption information is set to 2.

When the terminal device receives the successor information, the terminal device can perform corresponding processing on the voice information according to the description information in the successor information. For example, when the specification information is 0, the terminal device may determine the response content to the voice information from the second template library; when the description information is 1 or 2, the terminal device can display prompt information aiming at the voice information.

Step S104: and receiving response content aiming at the voice information sent by the terminal equipment.

The response content may be voice data, character data, or image data. For example, the terminal device may send the received voice input by the user as the response content to the intelligent device, where the response content received by the intelligent device is the voice data; the image shot by the terminal device can also be used as response content to be sent to the intelligent device, and the response content received by the intelligent device is image data at the moment; the answer determined from the second template library can be used as the response content to be sent to the intelligent device, and the response content received by the intelligent device is character data.

Step S105: and responding the voice message according to the response content.

The step may specifically be to play the response content, or display the response content to respond to the voice message.

Another specific implementation manner is that the relevant information of the response content may be acquired, and the response content and the relevant information may be played or displayed. For example, when the voice information is "which animal is the largest in body shape in the world", the acquired response content is "blue whale", and information about the weight, body length, and the like of the blue whale can be acquired as the related information. When the relevant information of the response content is acquired, the information can be acquired according to the keyword contained in the response content and/or the voice information.

Another specific implementation manner is to modify the response content and play or display the modified response content. For example, the illegal words included in the response content may be deleted based on the illegal words stored in advance.

As can be seen from the above, in the present embodiment, when it is determined that the response content to the voice information cannot be determined, a relay request is sent to the associated terminal device, the response content to the voice information sent by the terminal device is received, and the voice information is responded according to the response content. The terminal equipment can determine the response content of the voice information through various ways, and when the response content cannot be determined, the terminal equipment is requested to assist in determining the response content, so that the response rate of asking questions of a user can be improved, and the user experience is improved.

The above embodiments will be described below by way of specific examples.

Child a actively initiates voice information to robot B for interaction, e.g. child asks "why fish are swimming". The guardian C can receive the voice information sent by the robot B through the mobile phone, and all information of interaction between the child A and the robot B can be provided for the guardian C.

The robot B responds to the voice information of the child A. The robot B can match the voice information with each problem in the first template library, and when the matching is unsuccessful, the robot can send a take-over request to the guardian C to request the guardian C to intervene.

After receiving the take-over request, the guardian C directly carries out voice response aiming at the voice information of the child A. The mobile phone can send the response content of the guardian C to the robot, and the robot B can play the response content to the child after receiving the response content sent by the mobile phone.

When the voice information is matched with each question in the first template library and the matching is successful, the answer corresponding to the successfully matched question in the first template library can be played to the child A.

In another embodiment provided by the present application, the embodiment shown in fig. 2 can be obtained by modifying the embodiment shown in fig. 1. This embodiment is applied to a smart device, which may be a robot or other smart device. The present embodiment includes the following steps S201 to S206:

step S201: and acquiring the voice information to be responded.

This step is the same as step S101 in the embodiment shown in fig. 1, and specific description may refer to the embodiment shown in fig. 1, which is not described in detail here.

Step S202: and when the voice information is used for indicating the identification of the image information, acquiring the image information to be identified.

Before this step, it may be determined whether the voice information is used to instruct recognition of the image information. When the voice information is a voice recognition result (i.e., character data) obtained by performing voice recognition on voice data, the operation of determining whether the voice information is used for indicating to recognize image information may specifically be to detect whether the voice information carries substantial words, and when the detection result is yes, determine that the voice information is used for indicating to recognize the image information; or, detecting whether the voice information is a preset sentence pattern, and when the detection result is yes, determining that the voice information is used for indicating to identify the image information. The above substantial words may include: noun subjects and/or noun objects, etc.

For example, when the speech information is "how the weather is today", the sentence includes the noun subject "weather", and thus it can be considered that the speech information carries substantial words. When the speech information is "what fruit this is", since the sentence does not include the noun subject, it can be considered that the speech information does not carry substantial words, and such speech information indicates that the image information is recognized.

The preset sentence pattern may be: no subject or pronoun subject + query word. For example, "what fruit this is" corresponds to the preset sentence pattern.

The image information to be recognized may be an image to be recognized, or may be a feature of the image to be recognized obtained after feature extraction is performed on the image to be recognized.

The smart device may include a camera module. When the voice information is used for indicating to identify the image information in the intelligent equipment, an image acquisition instruction can be sent to the camera module, and when the camera module receives the image acquisition instruction, images can be acquired. When the image to be recognized is obtained, the intelligent equipment can obtain the image to be recognized collected by the camera module, and performs feature extraction on the image to be recognized to obtain the features of the image to be recognized. For example, when the smart device receives the voice message of "what fruit this is", start the module of making a video recording and gather the image, the image of gathering may be the apple material object or the apple card that the user placed in the module of making a video recording image acquisition within range.

The intelligent device can also acquire the image information to be identified from other devices.

Step S203: and judging whether response contents aiming at the voice information and the image information to be recognized can be determined.

The step may specifically include various embodiments, for example, image recognition may be performed according to the image information to be recognized, and whether the response content for the voice information and the image information to be recognized can be determined according to the image recognition result.

Step S204: and sending a take-over request carrying the voice information and the image information to be recognized to associated terminal equipment.

The succession request is used for instructing the terminal equipment to acquire response contents aiming at the voice information and the image information to be identified. The smart device as the execution subject may be associated with the terminal device in advance.

When receiving the succession request, the terminal device may acquire response content for the voice information and the image information to be recognized. Specifically, when the response content is acquired, the terminal device may identify the image information to be identified, obtain an image identification result, and send the image identification result as the response content to the intelligent device. When the image information to be recognized is the image to be recognized, the voice information can be played or displayed to the user, the image to be recognized is displayed to the user, and response contents input by the user aiming at the voice information and the image to be recognized are received.

When the terminal equipment identifies the information of the image to be identified, the image to be identified can be identified according to the pre-trained deep learning network, so that a more accurate image identification result can be obtained. The deep learning network can be trained according to a large number of sample images, and the object with the largest area or the object closest to the front part of the image in the sample images can be marked during training. In this embodiment, the operation with a large amount of computation is transferred to the terminal device, so that the amount of computation of the intelligent device can be reduced, the processing speed of the intelligent device is increased, and the real-time performance of the interaction process is improved.

The above-mentioned succession request may also carry explanatory information indicating a type when the response content cannot be determined, for example, the explanatory information may be information indicating "an object in an image cannot be recognized" or the like.

Step S205: and receiving response content aiming at the voice information and the image information to be identified, which is sent by the terminal equipment.

The response content may be voice data or text data. The response content can be understood as content obtained by identifying the image information to be identified.

Step S206: and responding the voice message according to the response content.

This step is the same as step S105 in the embodiment shown in fig. 1, and specific description may refer to the embodiment shown in fig. 1, which is not described in detail here.

As can be seen from the above, in the present embodiment, when it is determined that the response content for the voice information and the image information to be recognized cannot be determined, a relay request is sent to the associated terminal device, the response content for the voice information sent by the terminal device is received, and the voice information is responded according to the response content. The terminal equipment can determine the response content of the voice information through various ways, and when the response content cannot be determined, the terminal equipment is requested to assist in determining the response content, so that the response rate of asking questions of a user can be improved, and the user experience is improved.

In this embodiment, the intelligent device may actively analyze whether the intelligent device itself has the ability to answer some questions, and when the intelligent device finds that the intelligent device itself cannot answer a certain question, the intelligent device may actively remind a third party to access, so that voice interaction continues. The intelligent device determines response content according to the voice information and the image information to be recognized, and can strengthen the voice interaction process.

In another embodiment of the present application, in the embodiment shown in fig. 2, the step of determining whether the response content for the voice information and the image information to be recognized can be determined in step S203 may specifically include the following embodiments:

and judging whether the identification content corresponding to the image information to be identified can be determined or not according to a pre-stored response content template library, and if not, judging that the response content aiming at the voice information and the image information to be identified cannot be determined.

And the response content template library is used for storing the corresponding relation between the image information and the identification content. The reply content template library may be pre-established and then stored in the smart device.

When the image information to be identified is the image to be identified, the image information is the image; and when the image information to be recognized is the image feature to be recognized, the image information is the image feature.

For example, the response content template library may store the corresponding relationship between various fruit images and fruit names, and each fruit image may be multiple, and may include, for example, fruit images taken from multiple angles. The response content template library may also store the corresponding relationship between each chinese character image and a chinese character, or store the corresponding relationship between each color and a color name.

In order to improve the efficiency when matching the response content template library, the above-mentioned respective corresponding relationships in the response content template library may belong to different fields. For example, the correspondence between various fruit images and fruit names belongs to the field of fruit recognition, the correspondence between various Chinese character images and Chinese characters may belong to the field of Chinese character recognition, and the correspondence between various colors and color names may belong to the field of color recognition.

The answer content template library can also be used for storing the corresponding relation between the questions and the answers. The problem in the correspondence relationship may be in the form of voice data or character data. The answer in the corresponding relation may be in the form of voice data or character data.

And judging whether the response content indicated by the voice information belongs to the good field or not according to the category keywords and a pre-stored good field library when the voice information contains the category keywords representing the image information to be recognized, and if not, judging that the response content aiming at the voice information and the image information to be recognized cannot be determined.

The excellence field library is used for storing corresponding relations between the excellence fields and the keywords.

When the voice information to be responded is acquired, whether the voice information contains a category keyword representing the image information to be recognized or not can be detected. The category key words may be nouns or noun phrases. For example, when the voice message is "what fruit this is", "what word this is", or "what color this is", the "fruit", "word", and "color" therein are the category keywords representing the image information to be recognized.

The good areas in the good area library may include area categories in the answer content template library. For example, the field of excellence may include a fruit recognition field, a chinese character recognition field, a color recognition field, and the like. The keywords corresponding to the good field are words capable of showing the meaning of the good field, and each good field can correspond to a plurality of keywords. For example, keywords for the fruit identification domain may include: fruits, fruit, fruits and the like.

And when judging whether the response content indicated by the voice information belongs to the good field or not according to the category keywords and a pre-stored good field library, specifically, matching the category keywords with the keywords in the good field library respectively, and when the matching is unsuccessful, judging that the response content indicated by the voice information does not belong to the good field.

When the matching is successful, the image information to be recognized and the corresponding relation which belongs to the target excellence field in the response content template library can be matched according to the target excellence field corresponding to the keyword which is successfully matched, when the matching is successful, the recognition content which is successfully matched in the response content template library is determined as the response content aiming at the voice information and the image information to be recognized, and the voice information is responded according to the response content.

Therefore, in this embodiment, whether the response content can be determined according to the response content template library or the excellence field library, which is simple and easy to implement.

In another embodiment of the present application, in the embodiment of the first mode, the step of judging whether the identification content corresponding to the image information to be identified can be determined according to a pre-stored response content template library may specifically include:

And when the target matching result does not exist in the matching result, the matching is failed. When response contents aiming at the voice information and the image information to be identified sent by the terminal equipment are received, the response content template library can be updated according to the image information to be identified and the response contents.

Because the response content template library is used for storing the corresponding relation between the image information and the identification content, when the response content template library is updated, the response content can be specifically used as the identification content, and the image information to be identified can be used as the image information and updated to the response content template library.

And updating the response content template library when the response content sent by the terminal equipment is received every time, so that the response content template library is richer and richer. In the subsequent matching process, the response content aiming at the voice information and the image information to be recognized is more easily determined according to the response content template library, the learning capability of the user is improved, and the response rate of the user to ask questions can be further improved.

When the determination result is that the target matching result exists, that is, when the target matching result exists in the matching result, the identification content of the image information corresponding to the target matching result in the response content template library may be determined as the response content for the voice information and the image information to be identified, and the voice information may be responded according to the response content.

The preset matching degree threshold is a preset value, and may be 70% to 80% equivalent, for example.

For example, the image information to be recognized is an image a, and the response content template library includes the following correspondence: image 1-apple, image 2-apple, image 3-banana, image 4-banana. And matching the image A with the images 1 to 4 respectively to obtain matching degrees which are respectively as follows: 81%, 40%, 30% and 33%. When the preset matching degree threshold is 70%, since 81% > 70%, it may be determined that there is a target matching result in the matching results, where the matching degree is greater than the preset matching degree threshold. The target matching result is the matching of the image a and the image 1, and at this time, the apple corresponding to the image 1 can be determined as the response content for the voice information and the image a.

When the preset matching degree threshold is 90%, because 81%, 40%, 30% and 33% are all less than 90%, it may be determined that matching fails, a take-over request may be sent to the terminal device, and response content "strawberry" sent by the terminal device is received, it may be determined that the fruit in the image a is a strawberry, and at this time, the corresponding relationship between the image a and the strawberry may be stored in the response content template library.

In order to improve the accuracy of the matching result when the image information to be identified is matched with each image information in the response content template library, the response content template library can also comprise the confidence coefficient of each image information; before determining the response content for the voice information and the image information to be recognized, when the confidence of the image information corresponding to the target matching result is greater than a preset confidence threshold, determining the recognition content of the image information corresponding to the target matching result in the response content template library as the response content for the voice information and the image information to be recognized.

The preset confidence threshold may be a preset value, for example, 0.5 or other values.

For example, the image information to be recognized is an image a, and the response content template library includes the following correspondence and confidence: image 1 (confidence 0.6) -apple, image 2 (confidence 0.4) -apple, image 3 (confidence 0.6) -banana, image 4 (confidence 0.5) -banana. And matching the image A with the images 1 to 4 respectively to obtain matching degrees which are respectively as follows: 81%, 40%, 30% and 33%. When the preset matching degree threshold is 70%, because 81% > 70%, it may be determined that there is a target matching result (i.e., the matching between the image a and the image 1) whose matching degree is greater than the preset matching degree threshold in the matching results; moreover, since the confidence of the image 1 is 0.6, when the preset confidence threshold is 0.5, it is known that 0.6>0.5, the apple corresponding to the image 1 can be determined as the response content for the voice information and the image information to be recognized.

After the voice information is responded, if the intelligent equipment receives response information aiming at response content, the confidence degree of the image information corresponding to the target matching result is updated according to the response information. Wherein, the response information is sent to the mobile terminal,

the response information may be a response of the user to the response content, such as "you have answered right", "know", "you are really smart", "you have not said right", "this is not …", and the like; the user may also be presented with the next speech message to continue entering, for example, to ask what the next thing is, and so on.

When the confidence of the image information corresponding to the target matching result is updated according to the response information, specifically, when the response information indicates a positive response, the confidence of the image information corresponding to the target matching result is increased; and when the response information shows a negative response, reducing the confidence of the image information corresponding to the target matching result. More specifically, the confidence of the image information corresponding to the target matching result may be increased or decreased according to a preset rule. For example, 10% increase, or 10% decrease.

When the terminal equipment cannot acquire the response content aiming at the voice information, the voice information can be responded according to the pre-stored emergency response content. Specifically, the emergency response content for the voice message may be determined from pre-stored emergency response contents, and the voice message may be responded according to the determined emergency response content.

For example, the emergency response content may include: please see the card once again if the card is unclear; whining, the question i do not learn yet, wait for i to learn to tell you, etc.

When the emergency response content for the voice information is determined from the pre-stored emergency response contents, one emergency response content can be selected from the pre-stored emergency response contents according to words contained in the voice information. For example, when the voice message is "what fruit this is", the emergency response contents related to the fruit may be selected from the respective emergency response contents in response to the voice message.

Therefore, even if the intelligent device cannot determine the response content aiming at the voice information, the emergency response content can be adopted to respond the voice information, and the user is prevented from having bad experience.

In another embodiment of the present application, in the embodiment shown in fig. 2, after the image information to be recognized is acquired, and when an acquisition request for acquiring an interactive video sent by a terminal device is received, the interactive video is acquired, the interactive video is sent to the terminal device, and response content sent by the terminal device for the voice information and the image information to be recognized is received.

Wherein, the interactive video is: and the video comprises the voice information and the image information to be recognized. The intelligent device can collect the interactive video of the user and the intelligent device in real time and store the interactive video.

When response contents for the voice information and the image information to be recognized sent by the terminal device are received, the smart device does not need to execute the steps S203 and S204.

The terminal device may send the acquisition request to the intelligent device after receiving an input operation of a user, and play the interactive video when receiving the interactive video. And when the terminal equipment receives response content input by the user aiming at the voice information and the image information to be identified, the terminal equipment sends the response content to the intelligent equipment. In this embodiment, the user may directly intervene in the interaction process to provide the response content.

The user in this embodiment may be different from the user who inputs the voice information. For example, the intelligent device receives voice information to be responded input by the first object and receives response content provided by the second object and transmitted by the terminal device.

In this embodiment, the third party may monitor the interaction process through the terminal device, and seamlessly intervene in real time, thereby maintaining the continuity of the interaction.

The present application will be described in detail below using specific examples.

The child A initiatively initiates voice information to the robot B for interaction, the guardian C can receive the voice information sent by the robot B through the mobile phone, and all information of interaction between the child A and the robot B can be provided for the parent C.

Robot B responds to children A's voice message, and when the problem that children A proposed was open problem, robot B gathered the image, contained the card that children lifted or painted originally in this image. The robot may attempt to identify the contents of the picture or card. For example, child a asks: "what is this word? The robot tries to identify the content in the collected image according to a pre-stored response content template library, and if the confidence coefficient of successfully matched image information in the image template is greater than 0.5, the robot informs children of the character by adopting voice; and if the confidence coefficient is less than 0.5, sending a take-over request to the guardian C to request the guardian C to intervene.

After receiving the take-over request, the guardian C directly answers the voice information of the child A and the collected image to be identified by voice. The robot B can record the interaction process of the guardian A and the child C, store the image to be identified and the response content into a response content template library, and set the confidence of the image as the maximum value.

After the robot B plays the response content to the child, if the response content is adopted and the user wishes to communicate further, the current response content is considered as a better answer, and the confidence of the successfully matched image in the image template is increased by 10%. If the answer content is not adopted, i.e. the user does not want to communicate further, the current answer is considered as an inappropriate answer, and the confidence of successfully matched image information in the image template is reduced by 10%.

Initially, the confidence of each image information in the image template may be a preset value, for example, 0.5. When the identification content corresponding to a certain image is adopted once, the confidence coefficient is increased once, and when the identification content is not adopted once, the confidence coefficient is decreased once.

After the robot B receives the voice information actively initiated by the child A, the guardian C can directly access the robot B to replace the robot B to carry out interaction. The mobile phone can send an acquisition request to the robot B after receiving the input operation of the guardian C. When receiving the acquisition request, the robot can send the interactive video to the mobile phone and play the interactive video to the guardian C through the mobile phone. After receiving the answer input by the guardian, the mobile phone can directly send the answer to the robot B. The robot B plays the received answer to the child a.

Fig. 3 is a schematic flowchart of another information interaction method according to an embodiment of the present application. The method embodiment is applied to the terminal equipment which can be a computer or a smart phone and other equipment. The method of the embodiment comprises the following steps:

step S301: and receiving a take-over request carrying voice information sent by the associated intelligent equipment.

The succession request is used for indicating to acquire response content aiming at the voice information, the voice information is the voice information to be responded acquired by the intelligent equipment, and the succession request is sent by the intelligent equipment when the intelligent equipment judges that the response content aiming at the voice information cannot be determined.

The terminal device as the execution subject may be associated with the smart device in advance. In this step, the succession request may be received through a wired network or a wireless network.

Step S302: and acquiring response content aiming at the voice information, and sending the response content to the intelligent equipment.

When the succession request is received, the terminal device may acquire response content for the voice information. Specifically, when acquiring the response content, the terminal device may play or show the voice information to the user, and receive the response content input by the user for the voice information. The response content may be at least one of voice data, character data, and image data. The user may be understood as a second object, which is different from the first object.

When the terminal device obtains the response content for the voice information, the voice information may also be matched with each question in a preset second template library, and when the matching is successful, an answer corresponding to the successfully matched question in the second template library is used as the response content. The second template library may include answers corresponding to the respective questions, and the questions and the corresponding answers may be character data or voice data. In another embodiment, the second template library may further include image data corresponding to each question.

The take-over request may also carry other information than the voice information, for example, description information indicating a type when the response content cannot be determined. When the terminal device receives the successor information, the terminal device can perform corresponding processing on the voice information according to the description information in the successor information.

Therefore, in this embodiment, when receiving the succession request, the terminal device may determine the response content of the voice message in multiple ways, and send the response content to the intelligent device, so that the intelligent device may respond to the voice message according to the response content. When the intelligent device cannot determine the response content, the terminal device is requested to assist in determining the response content, so that the response rate of the user to ask questions can be improved, and the user experience is improved.

In another embodiment of the present application, in the embodiment shown in fig. 3, the terminal device may receive a take-over request carrying the voice information and the image information to be recognized, which is sent by the associated smart device, obtain response content for the voice information and the image information to be recognized, and send the response content to the smart device. The succession request is used for indicating to acquire response contents aiming at the voice information and the image information to be recognized, the image information to be recognized is the voice information and is acquired when the image information is indicated to be recognized, and the succession request is sent when the intelligent equipment judges that the response contents aiming at the voice information and the voice information to be recognized cannot be determined.

In another embodiment of the present application, in the embodiment shown in fig. 3, the terminal device is further configured to send an acquisition request for acquiring the interactive video to the intelligent device, acquire response content for the voice information and the image information to be recognized, and send the response content to the intelligent device. Wherein the interactive video is: video containing voice information and image information to be recognized.

The terminal device may specifically send the acquisition request to the smart device when receiving an input operation of the user.

Fig. 4 is a schematic structural diagram of a robot according to an embodiment of the present disclosure. This embodiment corresponds to the embodiment of the method shown in fig. 1. The robot in this embodiment may include: a processor 401, a memory 402 and a microphone 403.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In this embodiment, the microphone 403 is configured to collect voice information to be responded, and store the voice information in the memory 402;

a processor 401 for obtaining the voice information from the memory 402, and determining whether the response content for the voice information can be determined; if not, sending a take-over request carrying voice information to the associated terminal equipment; receiving response content aiming at voice information sent by terminal equipment; responding the voice message according to the response content; the succession request is used for instructing the terminal equipment to acquire response content aiming at the voice information.

The memory 402 may also be used to store the response content received by the processor 401.

In this embodiment, the robot may further include a network module (not shown in the figure), and the network module may be configured to connect to a network and communicate with the terminal device through the network according to the control instruction of the processor.

In another embodiment of the present application, on the basis of the embodiment shown in fig. 4, the robot may further include: a speaker and/or a display screen; (not shown in the figure)

The processor 401 is configured to play the response content through a speaker, and/or display the response content through a display screen.

In another embodiment of the present application, on the basis of the embodiment shown in fig. 4, the robot further includes a camera module (not shown in the figure); and the camera module is used for acquiring the image information to be identified and storing the image information to be identified to the memory. Specifically, the camera module can collect the image information to be identified when receiving a collection instruction sent by the processor. The processor can send a collection instruction to the camera module when the voice information is determined to be used for indicating the image information to be identified.

The processor 401 is specifically configured to:

when the voice information is used for indicating the image information to be identified, acquiring the image information to be identified from the memory, and judging whether response content aiming at the voice information and the image information to be identified can be determined; sending a take-over request carrying voice information and image information to be identified to associated terminal equipment, wherein the take-over request is used for indicating the terminal equipment to acquire response contents aiming at the voice information and the image information to be identified; and receiving response content aiming at the voice information and the image information to be identified, which is sent by the terminal equipment.

In another embodiment of the present application, based on the embodiment shown in fig. 4, the processor is specifically configured to:

the processor 401 is specifically configured to:

when the voice information contains a category keyword representing the image information to be recognized, judging whether response content indicated by the voice information belongs to an excellence field or not according to the category keyword and a pre-stored excellence field library, and if not, judging that the response content aiming at the voice information and the image information to be recognized cannot be determined; the excellence field library is used for storing corresponding relations between the excellence fields and the keywords.

In this embodiment, the memory may be further configured to store the response content template library.

In another embodiment of the present application, based on the embodiment shown in fig. 4, the processor 401 is further configured to: and when response contents aiming at the voice information and the image information to be identified, which are sent by the terminal equipment, are received, updating the response content template library according to the image information to be identified and the response contents.

In another embodiment of the present application, based on the embodiment shown in fig. 4, the processor 401 is specifically configured to:

In another embodiment of the present application, based on the embodiment shown in fig. 4, the processor is further configured to:

and when the matching result has the target matching result, determining the identification content of the image information corresponding to the target matching result in the response content template library as the response content aiming at the voice information and the image information to be identified, and responding the voice information according to the response content.

In another embodiment of the present application, on the basis of the embodiment shown in fig. 4, the answer content template library further includes confidence levels of the respective image information; the processor 401 is further configured to:

before determining response contents aiming at the voice information and the image information to be recognized, when the confidence coefficient of the image information corresponding to the target matching result is larger than a preset confidence coefficient threshold value, determining the recognition contents of the image information corresponding to the target matching result in a response content template library as the response contents aiming at the voice information and the image information to be recognized.

In another embodiment of the present application, based on the embodiment shown in fig. 4, the processor 401 is further configured to: after the voice information is responded, if response information aiming at the response content is received, the confidence degree of the image information corresponding to the target matching result is updated according to the response information.

In another embodiment of the present application, based on the embodiment shown in fig. 4, the processor 401 is further configured to: and when the terminal equipment cannot acquire response content aiming at the voice information, responding the voice information according to the pre-stored emergency response content.

In another embodiment of the present application, on the basis of the embodiment shown in fig. 4, the robot further includes: a camera module; the camera module (not shown in the figure) is used for collecting an interactive video containing voice information and image information to be identified and storing the interactive video into the memory;

the processor 401 is further configured to:

after the image information to be identified is obtained and an obtaining request for obtaining the interactive video sent by the terminal equipment is received, obtaining the interactive video from the memory and sending the interactive video to the terminal equipment; and receiving response content aiming at the voice information and the image information to be identified, which is sent by the terminal equipment.

Since the robot embodiment is obtained based on the method embodiment shown in fig. 1, and has the same technical effect as the method, the technical effect of the robot embodiment is not described herein again. For the robot embodiment, since it is basically similar to the method embodiment, it is described relatively simply, and the relevant points can be referred to the partial description of the method embodiment.

Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. This embodiment corresponds to the method embodiment shown in fig. 3. The terminal device in this embodiment includes: a processor 501 and a memory 502;

the processor 501 is configured to receive a take-over request carrying voice information sent by an associated smart device, acquire response content for the voice information, and send the response content to the smart device;

the succession request is used for indicating to acquire response content aiming at the voice information, the voice information is the voice information to be responded acquired by the intelligent equipment, and the voice information is sent by the intelligent equipment when the intelligent equipment judges that the response content aiming at the voice information cannot be determined.

Since the above terminal device embodiment is obtained based on the method embodiment described in fig. 3, and has the same technical effect as the method, the technical effect of the terminal device embodiment is not described herein again. For the terminal device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant points, reference may be made to part of the description of the method embodiment.

Fig. 6 is a schematic structural diagram of an information interaction system according to an embodiment of the present application. This system embodiment corresponds to the method embodiment shown in fig. 1 and 3. The system comprises: a robot 601 and a terminal device 602 associated with the robot 601.

The robot 601 is configured to acquire voice information to be responded, and determine whether response content for the voice information can be determined; if not, sending a take-over request carrying voice information to the terminal equipment; receiving response content aiming at voice information sent by terminal equipment; responding the voice information according to the response content; the succession request is used for indicating the terminal equipment to acquire response content aiming at the voice information;

the terminal device 602 is configured to receive a take-over request with voice information sent by the robot 601, acquire response content for the voice information, and send the response content to the robot.

The embodiment can send the succession request to the associated terminal device when judging that the response content for the voice information cannot be determined, receive the response content for the voice information sent by the terminal device, and respond the voice information according to the response content. The terminal equipment can determine the response content of the voice information through various ways, and when the response content cannot be determined, the terminal equipment is requested to assist in determining the response content, so that the response rate of asking questions of a user can be improved, and the user experience is improved.

In another embodiment of the present application, in the embodiment shown in fig. 6, the robot 601 may be specifically configured to play the response content through a speaker and/or display the response content through a display screen.

In another embodiment of the present application, in the embodiment shown in fig. 6, when the voice information is used to instruct to recognize the image information, the robot 601 may further be configured to:

acquiring image information to be identified; judging whether response content aiming at the voice information and the image information to be identified can be determined; sending a take-over request carrying voice information and image information to be identified to associated terminal equipment; receiving response content aiming at voice information and image information to be identified, which is sent by terminal equipment; the succession request is used for indicating the terminal equipment to acquire response contents aiming at the voice information and the image information to be identified;

the terminal device 602 is specifically configured to:

and receiving a take-over request which is sent by the robot 601 and carries the voice information and the image information to be recognized, and acquiring response contents aiming at the voice information and the image information to be recognized.

In another embodiment of the present application, in the embodiment shown in fig. 6, the robot 601 is specifically configured to:

In another embodiment of the present application, in the embodiment shown in fig. 6, the robot 601 is further configured to:

In another embodiment of the present application, in the embodiment shown in fig. 6, the answer content template library further includes confidence levels of the respective image information; the robot 601 is also used to:

after the image information to be identified is obtained, when an obtaining request for obtaining an interactive video sent by a terminal device is received, obtaining the interactive video, and sending the interactive video to the terminal device; receiving response content aiming at voice information and image information to be identified, which is sent by terminal equipment; wherein, the interactive video is: the video comprises voice information and image information to be identified;

the terminal device 602 is further configured to send an acquisition request for acquiring the interactive video to the robot 601, acquire response content for the voice information and the image information to be recognized, and send the response content to the robot 601.

The embodiment of the application provides a computer-readable storage medium, a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the information interaction method provided by the embodiment of the application is realized. The method comprises the following steps:

acquiring voice information to be responded;

judging whether response content for the voice information can be determined;

if not, sending a take-over request carrying voice information to the associated terminal equipment; the succession request is used for indicating the terminal equipment to acquire response content aiming at the voice information;

receiving response content aiming at voice information sent by terminal equipment;

and responding the voice information according to the response content.

The embodiment of the application provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements another information interaction method provided by the embodiment of the application. The method comprises the following steps:

receiving a take-over request carrying voice information sent by associated intelligent equipment; the intelligent device comprises a succession request, a response content determination module and a response module, wherein the succession request is used for indicating to acquire response content aiming at voice information, the voice information is the voice information to be responded acquired by the intelligent device, and the succession request is sent when the intelligent device judges that the response content aiming at the voice information cannot be determined;

In this embodiment, the terminal device may determine the response content of the voice message through multiple ways when receiving the pickup request, and send the response content to the intelligent device, so that the intelligent device may respond to the voice message according to the response content. When the intelligent device cannot determine the response content, the terminal device is requested to assist in determining the response content, so that the response rate of the user to ask questions can be improved, and the user experience is improved.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. An information interaction method, characterized in that the method comprises:

acquiring voice information to be responded;

detecting whether the voice information carries substantial words or not, and determining that the voice information is used for indicating to identify the image information when the detection result is negative; or detecting whether the voice information is a preset sentence pattern, and when the detection result is yes, determining that the voice information is used for indicating to identify the image information; wherein the substantial words include: noun subject and/or noun object;

acquiring image information to be identified;

if the terminal equipment cannot receive the voice information and the image information to be identified, sending a take-over request carrying the voice information and the image information to be identified to the associated terminal equipment, wherein the take-over request is used for indicating the terminal equipment to acquire response contents aiming at the voice information and the image information to be identified;

receiving response contents aiming at the voice information and the image information to be identified, which are sent by the terminal equipment;

and responding the voice information according to the response content.

2. The method according to claim 1, wherein the step of determining whether or not the response contents for the voice information and the image information to be recognized can be determined comprises:

3. The method according to claim 2, wherein upon receiving response content for the voice information and the image information to be recognized, which is sent by the terminal device, the method further comprises:

4. The method according to claim 2, wherein the step of determining whether the identification content corresponding to the image information to be identified can be determined according to a pre-stored response content template library comprises:

5. The method of claim 4, wherein when the target matching result exists in the matching results, the method further comprises:

6. The method of claim 5, wherein the library of reply content templates further comprises a confidence level for each image information; before determining the response content for the voice information and the image information to be recognized, the method further includes:

7. The method of claim 6, wherein after responding to the voice message, the method further comprises:

8. The method according to claim 1, wherein when the terminal device cannot acquire the response content for the voice message, the method further comprises:

9. The method according to claim 1, wherein after acquiring the image information to be recognized, the method further comprises:

10. An information interaction method, characterized in that the method comprises:

receiving a take-over request carrying voice information and image information to be identified sent by associated intelligent equipment; the succession request is used for indicating to acquire response contents aiming at the voice information and the image information to be recognized, the voice information is acquired by the intelligent equipment and is to be responded, the image information to be recognized is acquired when the intelligent equipment detects that the voice information does not carry substantial words, or is acquired when the voice information is detected to be a preset sentence pattern, and the succession request is sent when the intelligent equipment judges that the response contents aiming at the voice information and the image information to be recognized cannot be determined; wherein the substantial words include: noun subject and/or noun object;

and acquiring response content aiming at the voice information and the image information to be identified, and sending the response content to the intelligent equipment.

11. A robot, comprising: a processor, a memory, and a microphone;

the processor is used for acquiring the voice information from the memory and judging whether response content aiming at the voice information can be determined or not; if not, sending a take-over request carrying the voice information to the associated terminal equipment; receiving response content aiming at the voice information sent by the terminal equipment; responding the voice message according to the response content; the succession request is used for indicating the terminal equipment to acquire response content aiming at the voice information;

the robot also comprises a camera module; the camera module is used for collecting image information to be identified and storing the image information to be identified to the memory;

the processor is specifically configured to:

detecting whether the voice information carries substantial words or not, and determining that the voice information is used for indicating to identify the image information when the detection result is negative; or detecting whether the voice information is a preset sentence pattern, and when the detection result is yes, determining that the voice information is used for indicating to identify the image information; acquiring image information to be identified; judging whether response content aiming at the voice information and the image information to be recognized can be determined; sending a take-over request carrying the voice information and the image information to be recognized to associated terminal equipment, wherein the take-over request is used for indicating the terminal equipment to acquire response contents aiming at the voice information and the image information to be recognized; receiving response content sent by the terminal equipment for the voice information and the image information to be recognized, wherein the substantial words comprise: noun subjects and/or noun objects.

12. The robot of claim 11, further comprising: a speaker and/or a display screen;

13. The robot of claim 11, wherein the processor is specifically configured to:

the processor is specifically configured to:

14. The robot of claim 13, wherein the processor is further configured to:

15. A robot as claimed in claim 13, wherein the processor is specifically configured to:

16. The robot of claim 15, wherein the processor is further configured to:

17. A bot as recited in claim 16, the library of reply content templates further includes a confidence level for each image information; the processor is further configured to:

18. The robot of claim 17, wherein the processor is further configured to:

19. The robot of claim 18, wherein the processor is further configured to:

20. The robot according to claim 11, wherein the camera module is further configured to collect an interactive video including the voice information and the image information to be recognized, and store the interactive video in the memory;

the processor is further configured to:

21. A terminal device, comprising: a processor and a memory;

the processor is used for receiving a take-over request which is sent by the associated intelligent equipment and carries voice information and image information to be identified, acquiring response content aiming at the voice information and the image information to be identified, and sending the response content to the intelligent equipment;

the succession request is used for indicating to acquire response contents aiming at the voice information and the image information to be recognized, the voice information is voice information to be responded acquired by the intelligent equipment, the image information to be recognized is acquired when the intelligent equipment detects that the voice information does not carry substantial words, or the voice information is detected to be a preset period pattern, the succession request is sent when the intelligent equipment judges that the response contents aiming at the voice information and the image information to be recognized cannot be determined, wherein the substantial words comprise: noun subjects and/or noun objects.

22. An information interaction system, comprising: a robot and a terminal device associated with the robot;

the robot is used for acquiring voice information to be responded, detecting whether the voice information carries substantial words or not, and determining that the voice information is used for indicating to identify image information when the detection result is negative; or detecting whether the voice information is a preset sentence pattern, and when the detection result is yes, determining that the voice information is used for indicating to identify the image information; acquiring image information to be identified; judging whether response content aiming at the voice information and the image information to be recognized can be determined; if the terminal equipment cannot receive the voice information and the image information to be identified, sending a take-over request carrying the voice information and the image information to be identified to the terminal equipment; receiving response contents aiming at the voice information and the image information to be identified, which are sent by the terminal equipment; responding the voice message according to the response content; the succession request is used for instructing the terminal equipment to acquire response contents aiming at the voice information and the image information to be identified; wherein the substantial words include: noun subject and/or noun object;

the terminal device is used for receiving a take-over request which is sent by the robot and carries voice information and the image information to be identified, acquiring response contents aiming at the voice information and the image information to be identified, and sending the response contents to the robot.

23. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-9.

24. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of claim 10.