CN115213884A

CN115213884A - Interaction control method and device for robot, storage medium and robot

Info

Publication number: CN115213884A
Application number: CN202110729750.2A
Authority: CN
Inventors: 张站朝
Original assignee: Cloudminds Beijing Technologies Co Ltd
Current assignee: Cloudminds Beijing Technologies Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2022-10-21

Abstract

The disclosure relates to an interactive control method, an interactive control device, a storage medium and a robot of the robot, wherein the method comprises the following steps: the method comprises the steps of obtaining data of multiple modes of a robot under a control target environment, wherein the data of one mode represents one type of data source data; performing perception identification on the data of the multiple modes to obtain semantic environment characteristics corresponding to the various modes in the target environment; predicting target interaction information corresponding to the robot based on the semantic level environment characteristics corresponding to the various modalities and the object semantic relation network corresponding to the target environment; and controlling the robot to interact based on the target interaction information. The method disclosed by the invention can improve the accuracy of robot interaction.

Description

Interaction control method and device for robot, storage medium and robot

Technical Field

The present disclosure relates to the field of robot technologies, and in particular, to a method and an apparatus for controlling interaction of a robot, a storage medium, and a robot.

Background

With the development of robotics, service robots are increasingly appearing in daily life. The service robot performs robot work by interacting with objects in the environment, for example, interacting with objects such as people or objects in the environment.

However, when the service robot in the related art interacts with objects in an environment, the service robot cannot accurately recognize the environmental characteristics due to the complexity and diversity of the objects in the environment, and thus the service robot has a problem of poor interaction accuracy.

Disclosure of Invention

The purpose of the present disclosure is to provide an interaction control method, an interaction control device, a storage medium, and a robot, which solve the problem of poor interaction accuracy of the robot.

In order to achieve the above object, in a first aspect, the present disclosure provides an interaction control method of a robot, the method including:

acquiring data of multiple modes of a robot under a control target environment, wherein the data of one mode represents one type of data source data;

sensing and identifying data of multiple modes to obtain semantic-level environmental features corresponding to the various modes in a target environment;

predicting target interaction information corresponding to the robot based on semantic level environment features corresponding to various modalities and an object semantic relation network corresponding to a target environment;

and controlling the robot to interact based on the target interaction information.

Optionally, after acquiring the data of the plurality of modalities in the target environment, the method further includes:

optimizing the data of multiple modes respectively to obtain optimized data of multiple modes;

performing fusion correction processing on the optimized data of multiple modes to obtain fusion corrected data of multiple modes;

the method comprises the following steps of carrying out perception recognition on data of multiple modes to obtain semantic environment characteristics corresponding to the various modes in a target environment, wherein the semantic environment characteristics comprise:

and carrying out perception identification on the data subjected to fusion correction of multiple modes to obtain semantic-level environmental features corresponding to the various modes in the target environment.

Optionally, predicting target interaction information corresponding to the robot based on semantic-level environment features corresponding to various modalities and an object semantic relationship network corresponding to a target environment, including:

and inputting the semantic level environment characteristics corresponding to various modes and the object semantic relation network corresponding to the target environment into the target interaction information prediction model to obtain the target interaction information corresponding to the robot.

aligning the semantic level environment features corresponding to the various modes to obtain cross-semantic environment features corresponding to the various modes;

performing fusion processing on the cross-semantic environment characteristics corresponding to various modes to obtain fusion environment characteristics in a target environment;

and inputting the fusion environment characteristics and the object semantic relation network corresponding to the target environment into the target interaction information prediction model to obtain target interaction information corresponding to the robot.

Optionally, the training process of the target interaction information prediction model includes:

acquiring a plurality of sample data, wherein each sample data comprises a sample object corresponding to a target environment and an object semantic relationship network;

training the initial neural network model based on a plurality of sample data until a preset training condition is met, stopping training and outputting a target interaction information prediction network.

Optionally, the robot includes a flexible screen limb component, an environmental status sensor, a visual sensor, a voice sensor, and a force sensor, the multi-modal data including tactile data collected by the flexible screen limb component, environmental status data collected by the environmental status sensor, visual data collected by the visual sensor, voice data collected by the voice sensor, and force data collected by the force sensor.

Optionally, the target interaction information includes multi-modal target interaction information, the target interaction information of each modality carries corresponding timing sequence information, and based on the target interaction information, the robot is controlled to interact, including:

and controlling the robot to interact based on the target interaction information of each mode and the time sequence information carried by the target interaction information of each mode.

Optionally, the target interaction information includes an image to be displayed, and based on the target interaction information, the robot is controlled to interact, including:

and controlling the flexible screen limb part to display the image to be displayed.

Optionally, the target interaction information includes position information of a flexible screen limb component, and the controlling of the flexible screen limb component to display the image to be displayed includes:

acquiring the incidence relation between each image area included by the image to be displayed and a preset display position;

acquiring target image areas corresponding to the position information of each flexible screen limb part based on the incidence relation between each image area and a preset display position included in the image to be displayed;

and controlling the flexible screen limb part to display the image of the target image area.

In a second aspect, the present disclosure also provides an interactive control device for a robot, the device including: the multi-modal data acquisition module is used for acquiring data of multiple modalities of the robot in a control target environment, and the data of one modality represents data of one type;

the perception identification module is used for perceiving and identifying data of multiple modes to obtain semantic environmental characteristics corresponding to the various modes in a target environment;

the prediction module is used for predicting target interaction information corresponding to the robot based on semantic level environment characteristics corresponding to various modalities and an object semantic relation network corresponding to a target environment;

and the control module is used for controlling the robot to carry out interaction based on the target interaction information.

In a third aspect, the present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

In a fourth aspect, the present disclosure further provides an interactive control device for a robot, including:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to perform the steps of the method of the first aspect.

In a fifth aspect, the present disclosure also provides a robot, including an actuator, a plurality of types of sensors, and a processing device connected to the plurality of types of sensors and the actuator, wherein the actuator includes a flexible screen limb part disposed on a robot main body;

the processing device is used for: acquiring data of multiple modalities, wherein the data of the multiple modalities comprises a combination of at least two kinds of data of tactile data acquired through a flexible screen limb part, environmental state data, visual data, voice data and force data acquired through multiple types of sensors respectively; sensing and identifying data of multiple modes to obtain semantic environment characteristics corresponding to the modes in a target environment; predicting target interaction information corresponding to the robot based on semantic level environment features corresponding to various modalities and an object semantic relation network corresponding to a target environment; and controlling the corresponding actuator to carry out interactive operation based on the target interactive information.

According to the technical scheme, after the data of multiple modes of the robot in the target environment are obtained, the data of the multiple modes are firstly sensed and identified, and semantic level environmental features corresponding to the various modes in the target environment are obtained; then, target interaction information corresponding to the robot is predicted based on semantic level environment characteristics corresponding to various modalities and an object semantic relation network corresponding to a target environment; and finally, controlling the robot to interact based on the target interaction information. Because the semantic information carried by the environmental features is considered by the semantic-level environmental features, ambiguity existing in the environmental features obtained by recognition can be reduced, and therefore, the accuracy of target interaction information obtained by prediction can be improved through the semantic-level environmental features, and the accuracy of robot interaction is improved.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, but do not constitute a limitation of the disclosure. In the drawings:

fig. 1 is a schematic flowchart of an interaction control method of a robot according to an embodiment.

Fig. 2 is a schematic flowchart of step S13 in the embodiment.

Fig. 3 is a flowchart illustrating a training process of a target interaction information prediction model according to an embodiment.

Fig. 4 is a schematic flowchart of controlling a flexible screen limb part to display an image to be displayed according to an embodiment.

Fig. 5 is a schematic flowchart of another robot interaction control method according to an embodiment.

Fig. 6 is a schematic structural diagram of an interaction control device of a robot according to an embodiment.

Fig. 7 is a schematic structural diagram of an interaction control device of another robot according to an embodiment.

Fig. 8 is a schematic structural diagram of an interaction control device of another robot according to an embodiment.

Detailed Description

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Referring to fig. 1, fig. 1 is a flowchart illustrating an interaction control method of a robot according to an exemplary embodiment of the present disclosure, and as shown in fig. 1, the interaction control method of the robot includes steps S11 to S14. Specifically, the method comprises the following steps:

s11, data of multiple modes of the robot in the target environment are obtained, and data of one mode represent data source data of one type.

The target environment refers to an environment in which the robot to be controlled is currently located. It will be appreciated that the environment serviced by the different service robots will be different. For example, for a guest-greeting robot, the main service is at the entrance of a place, the educational robot is mainly in a classroom, and the conference room service robot is mainly in a conference room. Taking the greeting robot as an example, when the greeting robot serves an entrance of a certain place, the target environment is the environment of the entrance of the place served by the greeting robot.

In the present disclosure, each source or form of information may be referred to as a modality, and data from different data sources may be regarded as data from different modalities, or data from one modality may represent data from one type of data source. For example, data from different sources may be acquired by different sensors, and thus data acquired by a plurality of different sensors may be referred to as multi-modality data. It will be appreciated that different robots may carry different types of sensors depending on the task that the robot needs to perform. So that the acquired data of the multiple modalities are also different.

In some embodiments, the robot may carry basic sensors such as a tactile sensor, an environmental state sensor, a visual sensor, a voice sensor, and a force sensor, in which case the data of the plurality of modalities may include tactile data collected by the tactile sensor, environmental state data collected by the environmental state sensor, visual data collected by the visual sensor, voice data collected by the voice sensor, and force data collected by the force sensor. The vision sensor may be a 3D depth camera, a laser radar, or the like.

In other embodiments, the robot may carry a flexible screen limb part in addition to basic sensors such as an environmental state sensor, a visual sensor, a voice sensor, and a force sensor, and the flexible screen limb part is used as a touch sensor. In this case, the multi-modal data may include tactile data collected by the flexible screen limb member, environmental status data collected by the environmental status sensor, visual data collected by the visual sensor, voice data collected by the voice sensor, and force data collected by the force sensor. Wherein, the flexible screen limb part can be understood as a flexible touch screen shell covered on the robot limb part.

In this embodiment, the robot may be various humanoid robots or robots of other forms. The shell of the robot limb parts can be wholly or partially formed into the 'skin' of the robot by adopting the flexible touch screen.

It can be understood that by arranging the flexible screen limb parts, objects in the environment can acquire the tactile data by contacting any limb part of the robot, so that the tactile data acquisition process is simplified, and the tactile data acquisition is more convenient.

In addition, the flexible screen limb component in the embodiment supports, but is not limited to, single-point touch, multi-point touch, sliding touch, and the like.

There are various ways to obtain data for controlling multiple modalities of the robot in the target environment.

As an embodiment, the multimodal data may be obtained entirely from sensors carried by the robot.

In addition, in a case where it is considered that there may be a special case where a sensor carried by the robot fails or there is a limitation on data acquisition authority, and thus all the required data of various modalities cannot be acquired from the sensor carried by the robot, as another embodiment, the data of multiple modalities may be partially acquired from the sensor carried by the robot and partially acquired from a sensor installed in a target environment.

Specifically, for obtaining the situation from the sensor installed in the target environment, a communication connection may be established with the sensor installed in the target environment, and then data of the corresponding modality is obtained from the sensor installed in the target environment.

And S12, carrying out perception and identification on the data of the multiple modes to obtain semantic-level environmental features corresponding to the various modes in the target environment.

The semantic-level environmental features refer to environmental features in which semantic information is taken into account, and the environmental features can be understood as objects in the environment and attributes of the objects. For example, in the case of not considering semantic information, an apple can be used as an eaten fruit, and can also be understood as a mobile phone brand, so that, in order to make the identified environmental features not ambiguous and avoid controlling the robot to generate an incorrect interactive behavior, in this embodiment, data of multiple modalities can be firstly perceived and identified to obtain semantic-level environmental features corresponding to various modalities in the target environment.

The sensing identification is carried out on the data of the multiple modes to obtain semantic-level environmental characteristics corresponding to the various modes in the target environment, and multiple modes can be provided.

Optionally, a compact hash coding method oriented to multi-modal expression may be used to process data of multiple modalities, the method may consider correlation constraints in and among modalities, and then an orthogonal regularization method is used to further process the obtained hash coding features, so as to reduce redundancy of the hash coding features, and finally obtain semantic-level environmental features corresponding to the various modalities in the target environment.

Optionally, data of multiple modes may also be processed based on a part of the multi-mode sparse coding model regularized by the adaptive similar structure to obtain semantic-level environmental features corresponding to the various modes in the target environment.

And S13, predicting target interaction information corresponding to the robot based on the semantic level environment characteristics corresponding to various modes and the object semantic relation network corresponding to the target environment.

The object semantic relation network corresponding to the target environment can be constructed according to scene information and priori knowledge in the target environment. The target interaction information is information for controlling the robot to interact with.

In some embodiments, only one type of interaction information corresponding to the predicted robot may be used, and in this case, the interaction information may be directly determined as the target interaction information. In other embodiments, the interaction information corresponding to the robot obtained through prediction may be of various types, and in this case, a preset comprehensive decision module may be used to perform a comprehensive decision on the various interaction information corresponding to the robot, so as to provide the optimal interaction information, that is, the target interaction information, from the various interaction information.

As an example, assuming that the target environment is a conference room environment, it can be known that a conference table and a teacup are usually placed in the conference room according to scene information of the conference room, wherein the teacup is placed on the conference table, and an interval between the teacup is 50 cm, when conference staff exist in the conference room, tea water should be put into the teacup, and in addition, according to prior knowledge, it can be known that the tea water does not exceed 80% of a cup edge, and the cup cover needs to be covered after the water is put into the teacup. Under the condition, an object semantic relation network among objects such as a tea table, a tea cup, tea water, a cup cover and the like can be constructed according to scene information and priori knowledge in the target environment.

In this embodiment, after obtaining the semantic-level environment features corresponding to the various modalities and the object semantic relationship network corresponding to the target environment, the target interaction information corresponding to the robot may be obtained through prediction in a semantic reasoning manner.

The target interaction information is instruction information for controlling the robot to interact with objects in the environment. Illustratively, in a meeting room environment, the target interaction information may be a water adding instruction, specifically, tea water is added into a first cup, the water receiving amount is 75% of the cup edge, and after the water receiving is completed, the cup cover of the first cup is covered. As another example, in a guest environment, the target interaction information may be holding a smile, bending down and stretching a hand to handshake with the guest, etc.

In some embodiments, the target interaction information

The target interaction information corresponding to the robot can be predicted in various modes based on semantic level environment characteristics corresponding to various modes and an object semantic relation network corresponding to a target environment.

As an implementation mode, semantic-level environment features corresponding to various modalities and an object semantic relationship network corresponding to a target environment can be directly input into a target interaction information prediction model to obtain target interaction information corresponding to the robot.

As another implementation manner, semantic-level environmental features corresponding to various modalities may be processed first, and then the processed environmental features and an object semantic relationship network corresponding to a target environment are input into a target interaction information prediction model to obtain target interaction information corresponding to a robot. In this case, referring to fig. 2, as shown in fig. 2, step S13 may include the following steps:

s131, aligning the semantic level environment features corresponding to the various modes to obtain cross-semantic environment features corresponding to the various modes.

The alignment processing refers to identifying the corresponding relation between components and elements in different modes, so that the learned cross-semantic-environment feature representation of various modes is more accurate, and a more detailed clue is provided for a subsequent target interaction information prediction model.

Optionally, a specific method of alignment processing includes, but is not limited to, learning the common embedded representation space by using a maximum margin learning manner in combination with a local alignment (e.g., aligning a visual object with a vocabulary, or aligning with a touch body part) and a global alignment (e.g., aligning a picture with a sentence, or aligning with an action (handshake/hug, etc.) corresponding to a touch), and the cross-semantic representation after alignment can better improve the quality of prediction of the target interaction information prediction model.

And S132, performing fusion processing on the cross-semantic environment features corresponding to the various modes to obtain fusion environment features in the target environment.

The fusion processing is to integrate the models and features of different modes, so that the fusion processing can obtain more comprehensive features, improve the robustness of the models and ensure that the models can still work effectively when certain modal information is missing. For example, the user can still effectively work under the condition that the visual expression or the expression of the user is not happy but the language expression is depressed; for another example, speech recognition currently shows that the emotion of the person is happy, but his expression of visual recognition is depressed, and although the emotion of the person is described, it is also necessary to fuse context information of the current scene, such as that which may show that happy speech is of another person.

And S133, inputting the fusion environmental characteristics and the object semantic relation network corresponding to the target environment into the target interaction information prediction model to obtain target interaction information corresponding to the robot.

After the fusion environment characteristics are obtained, the fusion environment characteristics and the object semantic relation network corresponding to the target environment can be input into the target interaction information prediction model to obtain target interaction information corresponding to the robot.

It can be understood that the semantic-level environmental features and the fusion environmental features corresponding to various modalities are semantic-level environmental features, and therefore, the same target interaction information prediction model can be used for processing.

The following describes a training process of the target interaction information prediction model with reference to fig. 3. As shown in fig. 3, the training process of the target interaction information prediction model includes the following steps:

and S21, acquiring a plurality of sample data.

Each sample data comprises a sample object corresponding to the target environment and an object semantic relationship network, and the sample object carries corresponding semantic information.

As one embodiment, the sample objects in the target environment may be determined by human selection and labeled with semantic information.

And S22, training the initial neural network model based on a plurality of sample data until a preset training condition is met, stopping training and outputting a target interaction information prediction network.

Optionally, the preset training condition may be a preset iteration number, and correspondingly, the judgment whether the preset training condition is met is as follows: and judging whether the current iteration times are greater than the preset iteration times or not, and determining that the preset condition is met when the current iteration times are greater than the preset iteration times.

As an embodiment, the initial neural network model may be a reinforcement learning based markov decision chain.

And S14, controlling the robot to interact based on the target interaction information.

It should be noted that the interaction control method for the robot provided by the embodiment of the present disclosure may be executed only in the robot, may also be executed only in the server, and may also be executed partially in the robot and partially in the server.

As an embodiment, when the interaction control method of the robot is executed locally on the robot, the robot may include an actuator, a plurality of types of sensors, and a processing device connected to the plurality of types of sensors and the actuator, and the actuator includes a flexible screen limb part disposed on a robot main body, wherein the flexible screen limb part may be obtained by covering a flexible screen on a limb part of the robot. In this case, the processing device is specifically configured to acquire data of multiple modalities, where the data of multiple modalities includes a combination of at least two of tactile data acquired through the flexible screen limb component, environmental state data, visual data, voice data, and force sense data acquired through the multiple types of sensors, respectively; sensing and identifying data of multiple modes to obtain semantic environment characteristics corresponding to the modes in a target environment; predicting target interaction information corresponding to the robot based on semantic level environment characteristics corresponding to various modalities and an object semantic relation network corresponding to a target environment; and controlling the corresponding actuator to carry out interactive operation based on the target interactive information.

Of course, data for some of the multiple modalities may also be obtained from sensors installed in the target environment.

As another embodiment, when the interaction control method of the robot is performed at the server, the server may perform the above-described steps S11 to S14. In this case, in step S11, the server may acquire data of a plurality of modalities from the sensor carried by the robot, or the server may acquire data from both the sensor carried by the robot and the sensor installed in the target environment. In step S14, the server controls the robot to perform interaction based on the target interaction information, specifically, the server first sends the target interaction information to the local robot, and an actuator local to the robot performs interaction operation. By adopting the mode, the processing process of the multi-modal data can be executed by the server with stronger computing power, and finally, only the target interaction information is required to be sent to the actuator for carrying out interaction operation, so that the hardware requirement on the robot can be reduced.

As still another embodiment, when the interaction control method of the robot is partially performed by the robot locally and partially performed by the server, any of the above steps S11 to S14 may be performed by the robot, the remaining steps may be performed by the server, and when two adjacent steps are performed at the server and at the robot locally, respectively, data of the intermediate process may be transmitted through the network between the robot locally and the server. By adopting the method, when the robot is locally and suddenly failed, but the network function and the actuator function are normal, the processing process is executed by the server which normally runs and has stronger calculation power, so that the interactive control function of the robot can be still realized even when the robot is locally and suddenly failed.

Furthermore, to enhance transmission privacy and security, in some embodiments, the network between the robot local and the server may be a private network.

By adopting the technical scheme, after the data of multiple modes of the robot used for controlling the target environment are obtained, the data of the multiple modes are firstly perceived and identified, and semantic level environmental features corresponding to the various modes in the target environment are obtained; then, target interaction information corresponding to the robot is predicted based on semantic level environment characteristics corresponding to various modalities and an object semantic relation network corresponding to a target environment; and finally, controlling the robot to interact based on the target interaction information. Because the semantic information carried by the environmental features is considered by the semantic-level environmental features, ambiguity existing in the identified environmental features can be reduced, and therefore, the accuracy of target interaction information obtained through prediction can be improved through the semantic-level environmental features, and the accuracy of robot interaction is improved.

As will be appreciated in conjunction with the foregoing, in some embodiments, a robot may include a flexible screen limb member, an environmental status sensor, a visual sensor, a voice sensor, and a force sensor, with the multi-modal data including tactile data collected by the flexible screen limb member, environmental status data collected by the environmental status sensor, visual data collected by the visual sensor, voice data collected by the voice sensor, and force data collected by the force sensor.

In this embodiment, the interaction mode of the robot may be various. Optionally, when the robot includes a flexible screen limb component, the interaction may be performed in a manner of displaying an image through the flexible screen limb, optionally, the interaction may be performed in a manner of transforming the form of the flexible screen limb, optionally, an interaction manner such as performing the interaction in a manner of outputting voice, and optionally, an interaction manner in which multiple interaction manners are simultaneously performed. In this case, step S14 may include a combination of one or more of the following steps:

under the condition that the target interaction information comprises an image to be displayed, controlling the flexible screen limb part to display the image to be displayed; or alternatively

Controlling the robot to move based on the position movement information when the target interaction information comprises the position movement information; or

Under the condition that the target interaction information comprises limb movement information, controlling a flexible screen limb component of the robot to move based on the limb movement information; or alternatively

And controlling the robot to output the content corresponding to the voice information in an audio form under the condition that the target interaction information comprises the voice information.

It is understood that the flexible screen limb part can be used for image display besides acquiring the tactile data, in this case, if the target interaction information includes the image to be displayed, the flexible screen limb part can be controlled to display the image to be displayed.

The flexible screen limb part can display the image to be displayed in various display modes.

In some embodiments, each flexible screen limb may individually display all or part of images corresponding to the images to be displayed, or all of the flexible screen limbs may display all or part of images corresponding to the images to be displayed as a whole. And, the image to be displayed may be displayed in a picture-in-picture form when displayed.

In this embodiment, the flexible screen limb part is used for displaying the image to be displayed, and compared with a mode that only the display screen is used for displaying at the chest position of the robot in the related art, the diversity of image display is increased.

In other embodiments, it is contemplated that the robot may have limb interactions that result in a change in the position of the flexible screen limb components, such as waving a hand up or down. In this case, if the flexible screen limb components of the robot always display images of the same image area, the images displayed by a certain flexible screen limb component may not be fused with the images displayed by other flexible screen limb components into an integral image before and after the position of the flexible screen limb component is changed, which may cause image misalignment. For example, the flexible screen on the arm of the robot displays the image of the upper image area of the image to be displayed at the previous moment, and after the robot makes a downward waving motion, if the image of the upper image area of the image to be displayed is displayed, at this time, the image displayed on the flexible screen on the arm and the image displayed on the flexible screen on the leg have an image misalignment, and cannot form an integral image. In this case, in order to avoid display misalignment and improve the interaction effect of the robot, the target interaction information may include position information of the flexible screen limb part, and in this case, please refer to fig. 4, controlling the flexible screen limb part to display the image to be displayed may specifically include steps S141 to S143. Specifically, the method comprises the following steps:

and S141, acquiring the association relation between each image area included in the image to be displayed and a preset display position.

It will be appreciated that for any one image to be displayed, different image regions may be demarcated. For example, the pixels may be divided into one or more adjacent pixels, and the one or more adjacent pixels may be used as an image display area.

In this embodiment, when the image to be displayed is completely displayed, the display position corresponding to each image area in the image to be displayed may be preset, so as to obtain the association relationship between each image area included in the image to be displayed and the preset display position. Wherein, the display position corresponding to the image area is a position in a two-dimensional or three-dimensional coordinate system.

And S142, acquiring target image areas corresponding to the position information of the flexible screen limb parts based on the incidence relation between each image area included in the image to be displayed and a preset display position.

And S143, controlling the flexible screen limb part to display the image of the target image area.

After the target image area corresponding to the position information of each flexible screen limb part is obtained, the flexible screen limb parts can be controlled to display the image of the target image area.

The target interaction information including the position information of the flexible screen limb part can be in various forms.

Alternatively, the location information may be location information before the interaction and location information after the interaction. In this case, the flexible screen limb part of the robot may display different images before and after interaction, respectively.

Alternatively, the location information may be real-time location information during the interaction. Under the condition, the flexible screen limb part of the robot can dynamically display different images in real time in the interaction process, so that the interaction effect of the robot is further improved.

It can be understood that, in the above process, only the process of controlling the robot to interact based on the image to be displayed included in the target interaction information is described in detail, and it can be known in combination with the foregoing content that the target interaction information may be multi-modal target interaction information, that is, target interaction information in various forms may include information in other forms besides the image to be displayed, for example, voice information to be output, limb movement information, or position movement information, and in this case, in addition to controlling the flexible screen limb part to display the image to be displayed, the voice output module may also be controlled to output voice information, that is, the robot may be controlled to output content corresponding to the voice information in an audio form, or the flexible screen part of the robot is controlled to move according to the limb movement information, or the robot is controlled to move based on the position movement information, and the like. The positional movement information is positional information of the entire movement of the robot. The limb movement information refers to information of movement of the flexible screen limb part, for example, how the robot limb moves specifically, to which position the robot limb moves, and the like, and may be any action that the robot can perform, such as shaking hands, dancing, grabbing, and the like.

In some embodiments, in a case that the target interaction information includes multi-modal target interaction information, the target interaction information of each modality may further carry corresponding timing information, and in this case, the step S14 may specifically include the steps of: and controlling the robot to interact based on the target interaction information of each mode and the time sequence information carried by the target interaction information of each mode.

The time sequence information refers to information of the execution time of the target interactive information. In this embodiment, when the target interaction information of each modality carries timing information, the robot may be controlled to interact based on the target interaction information of each modality and the timing information carried by the target interaction information of each modality.

Optionally, the timing information carried by the target interaction information of different modalities may be different, for example, the robot may perform the target interaction information of different modalities sequentially, for example, waving hands first, then sending out greeting voice, and then displaying the consultation problem in the form of an image to be displayed.

Optionally, the timing information carried by the target interaction information of different modalities may also be the same, for example, the robot may execute the target interaction information of different modalities simultaneously, for example, issue a greeting voice while waving hands, and display the consultation question simultaneously.

Alternatively, the timing information carried by the target interaction information of different modalities may also be partially the same and partially different, for example, a greeting voice is issued while waving hands, and a consultation question is displayed after waving hands and the greeting voice.

By adopting the method of the embodiment, the interactive behavior of the robot can be closer to the real human interactive process.

Referring to fig. 5, fig. 5 is a flowchart illustrating an interaction control method of a robot according to still another exemplary embodiment of the present disclosure. As shown in fig. 5, the method includes steps S31 to S36. Specifically, the method comprises the following steps:

s31, data of multiple modes of the robot in the target environment are obtained, and data of one mode represent data source data of one type.

Step S31 is similar to step S11, and is not described herein again.

And S32, optimizing the data of the multiple modes respectively to obtain optimized data of the multiple modes.

In this embodiment, one or a combination of multiple denoising algorithms, filtering algorithms, and optimization algorithms may be used to perform optimization processing on data in multiple modalities, so as to obtain optimized data in multiple modalities. Thus, the quality of the optimized data can be improved.

And S33, performing fusion correction processing on the optimized data of the plurality of modes to obtain the fusion corrected data of the plurality of modes.

In this embodiment, the fusion correction process is mainly performed to ensure consistency and complementarity of data of multiple modalities.

For example, in some cases, the selected origins or reference points of different sensors are different, which results in considering that there is a deviation or an absence of data collected by different sensors for the same object, and in order to express the data of the same object more accurately, the optimized data of multiple modalities may be subjected to fusion correction processing, so as to obtain fusion corrected data of multiple modalities.

And S34, perceiving and identifying the data after fusion correction of the multiple modes to obtain semantic environmental features corresponding to the various modes in the target environment.

In this embodiment, after the fusion-corrected data of multiple modalities is obtained, the fusion-corrected data of multiple modalities may be subjected to perceptual identification, so as to obtain semantic-level environmental features corresponding to the various modalities in the target environment.

And S35, predicting target interaction information corresponding to the robot based on the semantic level environment characteristics corresponding to various modalities and the object semantic relation network corresponding to the target environment.

And S36, controlling the robot to interact based on the target interaction information.

Steps S34 to S36 are similar to steps S12 to S14, and are not described herein again.

By adopting the method of the embodiment, after the data of the multiple modes of the robot used for controlling the target environment are obtained, the data of the multiple modes are optimized respectively to obtain the optimized data of the multiple modes, then the optimized data of the multiple modes are subjected to fusion correction processing, and then the data subjected to fusion correction of the multiple modes are subjected to perception recognition, so that the quality of the data subjected to perception recognition can be improved, and the accuracy of subsequent robot interaction is further improved.

The interaction control method of the robot according to the embodiment of the present application is described below with reference to a specific example in a service reception or consultation environment, where the method is applied to a robot.

In a public place such as an airport, a train station or a subway station, the robot serves as reception and business consultation services. A user calls a call with a robot through gestures, and says 'robot, hello', the robot sees that a person approaches through a video captured by a vision sensor, accompanies the gestures, and receives voice signals, namely data of multiple modes; the robot carries out perception and identification on data of multiple modalities to obtain semantic environmental characteristics corresponding to the various modalities, namely a male (the sex is recognized by visual perception) of about 25 years old (the age is recognized by visual perception) is coming with pleasure (the expression is recognized by visual perception), and the right hand is used for carrying out a gesture (the gesture is recognized by visual perception) for shaking hands with the robot, so that the problems that the robot is good and I consults are said (voice recognition); the robot predicts target interaction information corresponding to the robot based on semantic level environment characteristics corresponding to various modalities and an object semantic relation network corresponding to a target environment, namely the target interaction information is voice output 'good morning and fun for fatigue', limb actions 'extending hands cater to the holding actions of a user', a flexible screen limb component displays 'pictures of topics such as friendly, warm and professional services', and then the robot can do the following actions: the robot is good in morning and willing to work, and simultaneously takes out the actions of extending hands to cater to the holding of the user, and simultaneously, the skin of the robot body displays a picture of topics such as friendly, warm and professional services.

After the robot hand and the user hold the robot, the robot continues to acquire a palm print through a limb part of a flexible hand screen of the robot, wherein the human face is seen in a video captured by a vision sensor, and then, the data of various modes are continuously perceived and identified to obtain semantic-level environmental characteristics corresponding to various modes, namely which user the human face identification and the palm print identification are specific to; then, the robot predicts the target interaction information corresponding to the robot based on the semantic level environment features corresponding to the various modalities and the object semantic relation network corresponding to the target environment, and the robot can interact based on the target interaction information.

For example, a theme of the color that the user likes is displayed by the flexible display skin (e.g., the holding hand can be displayed as a pinking jade hand, or a cartoon hand, or a monster hand), and a list of questions that he may need to consult currently is displayed by the flexible screen skin (answer question display), and a voice output asks questions (e.g., how do you want to know the weather situation for the purpose.

After completing the inquiry, the robot may loosen the handshake in the natural standing service posture while continuing to receive voice data (e.g., yes), then perform perceptual recognition on the voice data, and predict target interaction information, and then perform interaction based on the target interaction information.

For example, the robot speech outputs the answer to the user question "you will rain when they reach the destination" and displays on the chest flexible screen the various umbrellas that need to be selected for purchase, and the waist displays the view of the destination as raining. Speech output problem again: what do you have a help?

Finally, the reception of speech data is resumed (for example, the user says: none) and the interaction continues, i.e. the robot says: congratulate you for pleasure, "goodbye", while making the act of waving goodbye.

Referring to fig. 6, an exemplary embodiment of the present disclosure further provides an interactive control device 400 for a robot, including:

the multi-modal data acquisition module 410 is configured to acquire data of multiple modalities for controlling the robot in the target environment, where data of one modality represents one type of data source data.

And the perception identification module 420 is configured to perform perception identification on data of multiple modalities to obtain semantic-level environmental features corresponding to the various modalities in the target environment.

And the predicting module 430 is configured to predict target interaction information corresponding to the robot based on semantic environment features corresponding to various modalities and an object semantic relationship network corresponding to a target environment.

And the control module 440 is used for controlling the robot to interact based on the target interaction information.

Optionally, the apparatus 400 further comprises: and the optimization processing module is used for respectively carrying out optimization processing on the data of the multiple modes to obtain the optimized data of the multiple modes. And the correction processing module is used for carrying out fusion correction processing on the optimized data of the plurality of modes to obtain fusion corrected data of the plurality of modes. In this case, the perception identification module 420 is further configured to perform perception identification on the data after fusion correction of multiple modalities to obtain semantic-level environmental features corresponding to the various modalities in the target environment.

Optionally, the prediction module 430 is further configured to input the semantic-level environment features corresponding to the various modalities and the object semantic relationship network corresponding to the target environment into the target interaction information prediction model, so as to obtain target interaction information corresponding to the robot.

Optionally, the prediction module 430 is further configured to perform alignment processing on the semantic-level environment features corresponding to the various modalities to obtain cross-semantic environment features corresponding to the various modalities; performing fusion processing on the cross-semantic environmental features corresponding to the various modes to obtain fusion environmental features in a target environment; and inputting the fusion environmental characteristics and the object semantic relation network corresponding to the target environment into the target interaction information prediction model to obtain target interaction information corresponding to the robot.

Optionally, the apparatus 400 further includes a training module, configured to acquire a plurality of sample data, where each sample data includes a sample object corresponding to the target environment and an object semantic relationship network; and training the initial neural network model based on a plurality of sample data until a preset training condition is met, stopping training and outputting a target interaction information prediction network.

Optionally, the target interaction information includes multi-modal target interaction information, and the target interaction information of each modality carries corresponding time sequence information, in this case, the control module 440 is further configured to control the robot to interact based on the target interaction information of each modality and the time sequence information carried by the target interaction information of each modality.

Optionally, the target interaction information includes an image to be displayed, in which case, the control module 440 is further configured to control the flexible screen limb component to display the image to be displayed.

Optionally, the target interaction information includes position information of the flexible screen limb component, in this case, the control module 440 includes a first obtaining sub-module, a second obtaining sub-module and a control sub-module, where:

and the first acquisition submodule is used for acquiring the association relation between each image area included in the image to be displayed and the preset display position.

And the second acquisition submodule is used for acquiring target image areas corresponding to the position information of each flexible screen limb part based on the incidence relation between each image area included in the image to be displayed and the preset display position.

And the control sub-module is used for controlling the flexible screen limb part to display the image of the target image area.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 7 is a block diagram illustrating an interaction control device 500 of a robot according to an exemplary embodiment, where the interaction control device 500 of the robot may be a part of the robot, for example. As shown in fig. 7, the interactive control apparatus 500 of the robot may include: a processor 501 and a memory 502. The interactive control device 500 of the robot may further comprise one or more of a multimedia component 503, an input/output (I/O) interface 504, and a communication component 505.

The processor 501 is configured to control the overall operation of the interactive control device 500 of the robot, so as to complete all or part of the steps in the above-mentioned interactive control method of the robot. The memory 502 is used to store various types of data to support the operation of the interactive control device 500 of the robot, which may include, for example, instructions for any application or method operating on the interactive control device 500 of the robot, as well as application-related data, such as contact data, messaging, pictures, audio, video, and the like. The Memory 502 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia component 503 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 502 or transmitted through the communication component 505. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 504 provides an interface between the processor 501 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 505 is used for wired or wireless communication between the interactive control device 500 of the robot and other equipment. Wireless Communication, such as Wi-Fi, bluetooth, near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or one or a combination thereof, which is not limited herein. The corresponding communication component 505 may thus comprise: wi-Fi module, bluetooth module, NFC module, etc.

In an exemplary embodiment, the interaction control Device 500 of the robot may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components, and is used to perform the above-mentioned interaction control method of the robot.

In another exemplary embodiment, there is also provided a computer readable storage medium including program instructions, which when executed by a processor, implement the steps of the above-described interaction control method of a robot. For example, the computer readable storage medium may be the memory 502 including the program instructions executable by the processor 501 of the interaction control apparatus 500 of the robot to perform the interaction control method of the robot described above.

Fig. 8 is a block diagram illustrating an interactive control device 600 of a robot in accordance with an exemplary embodiment. For example, the interactive control device 600 of the robot may be provided as a server. Referring to fig. 8, the interactive control device 600 of the robot includes a processor 622, the number of which may be one or more, and a memory 632 for storing a computer program executable by the processor 622. The computer program stored in memory 632 may include one or more modules that each correspond to a set of instructions. Further, the processor 622 may be configured to execute the computer program to perform the above-described interactive control method of the robot.

Additionally, the interactive control device 600 of the robot may further include a power component 626 and a communication component 650, the power component 626 may be configured to perform power management of the interactive control device 600 of the robot, and the communication component 650 may be configured to enable communication, e.g., wired or wireless communication, of the interactive control device 600 of the robot. In addition, the interactive control device 600 of the robot may further include an input/output (I/O) interface 658. The robot interaction control device 600 may operate based on an operating system, such as Windows Server, stored in the memory 632 ^TM ，Mac OS X ^TM ，Unix ^TM ，Linux ^TM And so on.

In another exemplary embodiment, there is also provided a computer readable storage medium including program instructions, which when executed by a processor, implement the steps of the above-described interaction control method of a robot. For example, the computer readable storage medium may be the memory 632 including the program instructions, which are executable by the processor 622 of the robot interaction control apparatus 600 to perform the robot interaction control method described above.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned method of interactive control of a robot when executed by the programmable apparatus.

In another exemplary embodiment, a robot is also provided, the robot comprising an actuator, a plurality of types of sensors, and a processing device coupled to the plurality of types of sensors and the actuator, the actuator comprising a flexible screen limb member disposed on a body of the robot; the processing device is used for: acquiring data of multiple modes, wherein the data of the multiple modes comprises a combination of at least two data of tactile data acquired through a flexible screen limb part, environmental state data, visual data, voice data and force data acquired through multiple types of sensors respectively; sensing and identifying data of multiple modes to obtain semantic environment characteristics corresponding to the modes in a target environment; predicting target interaction information corresponding to the robot based on semantic level environment features corresponding to various modalities and an object semantic relation network corresponding to a target environment; and controlling the corresponding actuator to carry out interactive operation based on the target interactive information.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that the various features described in the above embodiments may be combined in any suitable manner without departing from the scope of the invention. To avoid unnecessary repetition, the disclosure does not separately describe various possible combinations.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. An interactive control method for a robot, the method comprising:

the method comprises the steps of obtaining data of multiple modes of a robot under a control target environment, wherein the data of one mode represents one type of data source data;

performing perception identification on the data of the multiple modes to obtain semantic environment characteristics corresponding to the various modes in the target environment;

predicting target interaction information corresponding to the robot based on the semantic level environment characteristics corresponding to the various modalities and the object semantic relation network corresponding to the target environment;

2. The method of interactive control of a robot of claim 1, wherein after the obtaining data of the plurality of modalities in the target environment, the method further comprises:

optimizing the data of the plurality of modes respectively to obtain optimized data of the plurality of modes;

performing fusion correction processing on the optimized data of the plurality of modes to obtain fusion corrected data of the plurality of modes;

the sensing and identifying the data of the multiple modes to obtain semantic environmental characteristics corresponding to the various modes in the target environment includes:

and performing perception identification on the data subjected to fusion correction of the multiple modes to obtain semantic-level environmental features corresponding to the various modes in the target environment.

3. The interaction control method of a robot according to claim 1, wherein predicting the target interaction information corresponding to the robot based on the semantic-level environment features corresponding to the modalities and the object semantic relationship network corresponding to the target environment comprises:

and inputting the semantic level environment characteristics corresponding to the various modes and the object semantic relation network corresponding to the target environment into a target interaction information prediction model to obtain target interaction information corresponding to the robot.

4. The interaction control method of a robot according to claim 1, wherein predicting the target interaction information corresponding to the robot based on the semantic-level environment features corresponding to the modalities and the object semantic relationship network corresponding to the target environment comprises:

aligning the semantic-level environmental features corresponding to the various modalities to obtain cross-semantic environmental features corresponding to the various modalities;

performing fusion processing on the cross-semantic environment features corresponding to the various modes to obtain fusion environment features under the target environment;

5. The interaction control method for a robot according to claim 3 or 4, wherein the training process of the target interaction information prediction model includes:

acquiring a plurality of sample data, wherein each sample data comprises a sample object corresponding to the target environment and an object semantic relationship network;

and training the initial neural network model based on the plurality of sample data until a preset training condition is met, stopping training and outputting the target interaction information prediction network.

6. A method for interactive control of a robot according to any of claims 1-4, characterized in that the robot comprises a flexible screen limb part, an environmental status sensor, a vision sensor, a speech sensor and a force sensor, and that the data of the plurality of modalities comprises tactile data collected by the flexible screen limb part, environmental status data collected by the environmental status sensor, visual data collected by the vision sensor, speech data collected by the speech sensor and force data collected by the force sensor.

7. The interaction control method of the robot according to claim 6, wherein the target interaction information includes multi-modal target interaction information, each modal target interaction information carries corresponding timing information, and the controlling the robot to interact based on the target interaction information includes:

8. The interaction control method of the robot according to claim 6, wherein the controlling the robot to interact based on the target interaction information comprises one or more of the following steps:

under the condition that the target interaction information comprises an image to be displayed, controlling the flexible screen limb part to display the image to be displayed; or

Controlling the robot to move based on the position movement information when the target interaction information comprises the position movement information; or alternatively

Under the condition that the target interaction information comprises limb movement information, controlling a flexible screen limb component of the robot to move based on the limb movement information; or

9. The interaction control method of a robot according to claim 8, wherein the target interaction information includes position information of a flexible screen limb, and the controlling the flexible screen limb to display the image to be displayed in the case where the target interaction information includes the image to be displayed includes:

acquiring association relations between each image area included in the image to be displayed and a preset display position;

10. An interactive control device for a robot, comprising:

the multi-modal data acquisition module is used for acquiring data of multiple modalities of the robot in a control target environment, and the data of one modality represents data of one type;

the perception identification module is used for perceiving and identifying the data of the multiple modes to obtain semantic environmental characteristics corresponding to the various modes in the target environment;

the prediction module is used for predicting target interaction information corresponding to the robot based on semantic environment characteristics corresponding to the various modalities and an object semantic relation network corresponding to the target environment;

and the control module is used for controlling the robot to interact based on the target interaction information.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.

12. An interactive control device for a robot, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 9.

13. A robot is characterized by comprising an actuator, a plurality of types of sensors and a processing device connected with the sensors and the actuator, wherein the actuator comprises a flexible screen limb part arranged on a robot main body;

the processing device is used for:

acquiring data of multiple modes, wherein the data of the multiple modes comprise a combination of at least two of tactile data acquired through the flexible screen limb part, environmental state data, visual data, voice data and force sense data acquired through the multiple types of sensors respectively;

and controlling the corresponding actuator to carry out interactive operation based on the target interactive information.