WO2023082949A1

WO2023082949A1 - Agent control method and apparatus, electronic device, program, and storage medium

Info

Publication number: WO2023082949A1
Application number: PCT/CN2022/125695
Authority: WO
Inventors: 黄晓庆; 马世奎; 彭飞
Original assignee: 达闼科技（北京）有限公司
Priority date: 2021-11-10
Filing date: 2022-10-17
Publication date: 2023-05-19
Also published as: CN114310870A

Abstract

Embodiments of the present application relate to the technical field of intelligent control. Disclosed are an agent control method and device, an electronic device, a program, and a storage medium. The method comprises: obtaining a target task; generating, according to environment data of a digital twin world, the pose of an agent, and a reinforcement learning network, a control instruction for controlling a digital twin to complete the target task, the digital twin world being obtained by means of simulation mapping of a physical world, the digital twin being located in the digital twin world, and the agent being located in the physical world and corresponding to the digital twin; and controlling, according to the control instruction for completing the target task, the agent to execute the target task.

Description

Intelligent body control method, device, electronic device, program, and storage medium

This application is based on the Chinese patent application with the application number "202111329240.2" and the filing date is November 10, 2021, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference. Apply.

technical field

The embodiments of the present application relate to the technical field of intelligent control, and in particular to an intelligent body control method, device, electronic equipment, program, and storage medium.

Background technique

In the field of artificial intelligence, the data collected by smart devices is usually used as input for learning and training, and the output is used to control the actions of intelligent agents. For example, by collecting RGBD (RGB-Depth Map, RGB color mode and depth map) information as input data.

For RGBD information, it usually requires a camera to perform image acquisition and recognition. However, the data acquired by the camera not only includes RGBD information, but also includes a variety of unnecessary parameters, such as: light and shadow conditions, image data of nearby obstacles, etc. Image data is screened and processed, which inevitably requires a large amount of data calculation process, that is, when RGBD information is used as input data for learning and training, there is a problem of data acquisition difficulties, and the data calculation equipment requires high computing power. The large amount of data that needs to be processed leads to slow training convergence, and in some execution processes, there is also the problem of complex migration of virtual and real data during the calculation process. Due to the complex data processing process, the control efficiency of the training and learning process on the agent is low.

technical solution

The purpose of the embodiments of the present application is to provide a control method, device, electronic equipment, and storage medium for an agent, which reduces the complexity of data processing, thereby improving the control efficiency of the agent.

In order to solve the above-mentioned technical problems, the embodiment of the present application provides a control method of an agent, including the following steps: obtaining the target task; according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, generating A control instruction for controlling the digital twin to complete the target task; wherein, the digital twin world is obtained through simulation mapping of the physical world, the digital twin is located in the digital twin world, and the intelligent body is located in the physical world The world, and corresponds to the digital twin; according to the control instructions for completing the target task, control the intelligent body to perform the target task.

The embodiment of the present application also provides a control device for an intelligent body, including: an acquisition module for acquiring a target task; a generation module for, according to the environmental data of the digital twin world, the pose of the intelligent body and the reinforcement learning network, Generate control instructions for controlling the digital twin to complete the target task; wherein, the digital twin world is obtained through simulation mapping of the physical world, the digital twin is located in the digital twin world, and the intelligent body is located in The physical world corresponds to the digital twin; an execution module is configured to control the agent to execute the target task according to the control instruction for completing the target task.

An embodiment of the present application also provides an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be executed by the at least one processor. instructions, the instructions are executed by the at least one processor, so that the at least one processor can execute the above-mentioned control method of the intelligent body.

The embodiment of the present application also provides a computer program, which implements the above-mentioned intelligent agent control method when the computer program is executed by a processor.

The embodiment of the present application also provides a computer-readable storage medium storing a computer program, and implementing the above-mentioned intelligent agent control method when the computer program is executed by a processor.

In the embodiment of this application, the physical world is simulated through the digital twin world, and there is a digital twin corresponding to the agent in the physical world in the digital twin world; the digital twin is operated through control instructions in the digital twin world, It can simulate the result of the control instruction to operate the agent, and obtain appropriate control instructions through training to make the agent perform the target task. There is no need to consider the process of preprocessing the input parameters such as RGBD, and it also reduces the complexity of data calculation for the control commands output by the agent, and improves the control efficiency for the agent.

In some embodiments, the generation of control instructions for controlling the digital twin to complete the target task according to the environment data of the digital twin world, the pose of the agent, and the reinforcement learning network includes: The pose and the spatial semantic map used to represent the environmental data are input into the reinforcement learning network, and the reinforcement learning network outputs control instructions for controlling the actions of the digital twin; the reinforcement learning network is based on the As a result of executing the control instruction, the digital twin is trained to obtain the control instruction to complete the target task. That is, simulation training is carried out in the digital twin world through the environment data, the pose of the agent and the reinforcement learning network, and it is continuously adjusted according to the feedback results until the control instruction to complete the target task is obtained.

In some embodiments, the initial control instruction output by the reinforcement learning network is generated according to prior data; wherein, the prior data is obtained according to a user's action of controlling the digital twin through an interactive device. The prior data is the data that can achieve the target task or is close to the target task. Using the prior data as the initial control instruction can reduce the number of training times and reduce the complexity of data processing.

In some embodiments, the digital twin world is loaded on a cloud server; the control system for controlling the digital twin to complete the target task is generated according to the environmental data of the digital twin world, the pose of the agent, and the reinforcement learning network. The instructions include: generating control instructions for controlling the digital twin to complete the target task based on the environmental data of the digital twin world, the pose of the agent, and the reinforcement learning network through interaction with the cloud server. Loading the digital twin world on the cloud can greatly reduce the data calculation requirements for the agent itself and reduce the complexity of equipment settings. At the same time, the data processing capabilities of cloud servers are generally high, which can further improve the acquisition and completion of the target tasks. Efficiency of control commands.

In some embodiments, after the acquisition of the target task, before generating the control instruction for controlling the digital twin to complete the target task, it further includes: closing the rendering function; After the control instruction of the target task is completed, the method further includes: enabling the rendering function. The rendering function is used to display to the user, and generally occupies more computing resources; the data before generating the control command to complete the target task generally has no practical effect on the user, so the rendering function is canceled during this period, and the device's Data processing resources are all used to generate control instructions, which can improve the generation efficiency of control instructions. After the control command is obtained, the rendering function is turned on, so that the process of the digital twin executing the control command is visualized for the user, and the user can know the simulated execution process of the control command.

Description of drawings

One or more embodiments are exemplified by the pictures in the corresponding drawings, and these exemplifications do not constitute a limitation to the embodiments. Elements with the same reference numerals in the drawings represent similar elements. Unless otherwise stated, the drawings in the drawings are not limited to scale.

Fig. 1 is a flow chart of the control method of the agent provided according to one embodiment of the present application;

Fig. 2 is a schematic diagram of a control device of an intelligent body provided according to an embodiment of the present application;

Fig. 3 is a schematic diagram of an electronic device provided according to an embodiment of the present application.

Embodiments of the present invention

In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, various implementations of the present application will be described in detail below in conjunction with the accompanying drawings. However, those of ordinary skill in the art can understand that, in each implementation manner of the present application, many technical details are provided for readers to better understand the present application. However, even without these technical details and various changes and modifications based on the following implementation modes, the technical solution claimed in this application can also be realized. The division of the following embodiments is for the convenience of description, and should not constitute any limitation to the specific implementation of the present application, and the embodiments can be combined and referred to each other on the premise of no contradiction.

The terms "first" and "second" in the embodiments of the present application are used for description purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present application, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a system, product or equipment comprising a series of components or units is not limited to the listed components or units, but optionally also includes components or units not listed, or optionally also includes the components or units for these products or Other parts or units inherent in equipment. In the description of the present application, "plurality" means at least two, such as two, three, etc., unless otherwise specifically defined.

An embodiment of the present application relates to a method for controlling an agent. The specific process is shown in Figure 1.

Step 101, acquiring a target task.

Step 102, according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, generate control instructions for controlling the digital twin to complete the target task; wherein, the digital twin world is obtained through the simulation mapping of the physical world, and the digital twin The twin is located in the digital twin world, and the agent is located in the physical world and corresponds to the digital twin.

Step 103: Control the agent to execute the target task according to the control instruction for completing the target task.

In this embodiment, the physical world is simulated through the digital twin world, and there is a digital twin corresponding to the agent in the physical world in the digital twin world; in the digital twin world, the digital twin is operated through control instructions, and the control can be simulated. As a result of the instruction to operate the agent, appropriate control instructions are obtained through training to enable the agent to perform the target task. There is no need to consider the process of preprocessing the input parameters such as RGBD, and it also reduces the complexity of data processing of the control commands output by the agent, and improves the control efficiency for the agent.

The implementation details of the control method of the agent in this embodiment will be described in detail below. The following content is only the implementation details provided for the convenience of understanding, and is not necessary for the implementation of this solution. Among them, the following "training" all represent the process of obtaining control instructions to complete the target task.

In step 101, a target task is acquired. The target task may be obtained from the user, other interactive devices, or the cloud; wherein, the target task is, for example, a task related to a spatial position such as moving a specified item or grabbing a specified item. In addition, the target task does not necessarily require a three-dimensional positional relationship, and it can also be independent of the three-dimensional space position, such as image (two-dimensional) recognition, audio processing, image-to-text conversion, etc., as long as it can be performed by the robot.

In step 102, according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, control instructions for controlling the digital twin to complete the target task are generated; wherein, the digital twin world simulates the physical world According to the mapping, the digital twin is located in the digital twin world, and the agent is located in the physical world and corresponds to the digital twin. Specifically, the digital twin world is obtained by mapping the real physical world, transforming the environment in the physical world into digital content for display, and can simulate the positional relationship of objects in the physical world and related environmental information. In this embodiment It does not limit the acquisition method of the digital twin world, for example, it can be obtained through modeling by a modeler, or directly scanning the physical world, etc.; the intelligent body in the physical world can be a robot, and there are intelligent bodies (robots) in the digital twin world The corresponding digital twin can simulate the behavior of the agent in the digital twin world. Since the digital twin world is a digital embodiment of the physical world, the interaction between the digital twin and the surrounding environment when it is active in the digital twin world, Ability to simulate the consequences of an agent performing the same activity in the physical world. In the digital twin world, it involves the geometric structure corresponding to the physical world, the spatial position, the physical structure constraints of the agent, and the simulation of physical characteristics (such as friction coefficient, gravity, etc.).

In one example, according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, control instructions for controlling the digital twin to complete the target task are generated, for example: the pose and pose of the agent are used to represent the environment The spatial semantic map of the data is input into the reinforcement learning network, and the reinforcement learning network outputs control instructions for controlling the actions of the digital twin; the reinforcement learning network is trained to obtain the control instructions to complete the target task according to the result of the digital twin executing the control instructions. That is to say, inputting the pose of the agent in the current physical world into the reinforcement learning network, the reinforcement learning network can also obtain the spatial semantic map representing the environmental data of the digital twin world; since the digital twin corresponds to the agent in the digital twin world , so the pose of the agent in the physical world acquired by the reinforcement learning network is the initial action of the digital twin, and the reinforcement learning network outputs control instructions for changing the action of the digital twin; the digital twin is in the digital twin world according to The control instruction of the reinforcement learning network changes the action, and the reinforcement learning network obtains the result obtained by the digital twin after changing the action according to the control instruction, compares the result with the target task, and adjusts the control instruction adaptively according to the comparison data, Until the digital twin simulates and completes the target task in the digital twin world according to a certain control instruction of the reinforcement learning network, the control instruction is the control instruction that controls the agent to perform the target task. That is, simulation training is carried out in the digital twin world through the environment data, the pose of the agent and the reinforcement learning network, and it is continuously adjusted according to the feedback results until the control instruction to complete the target task is obtained.

It is understandable that, in addition to obtaining the result obtained after the digital twin changes the action according to this control instruction outside the reinforcement learning network, it can also simultaneously obtain the change of the spatial semantic map after the digital twin performs the action change according to the control instruction. , combined with the result of the digital twin body after the action change, to determine whether to perform the target task.

Among them, the result of the digital twin's action changes according to the control instructions, including but not limited to the chassis and body posture of the digital twin, whether a collision has occurred, whether the target task has been completed, etc.; the content of the control instructions includes but is not limited to: control Digital twin movement, limb movement, etc. In one example, the reinforcement learning network has different interfaces for obtaining relevant information, such as the state observation interface, which is used to collect the state of the digital twin world, involving the chassis of the agent and the pose of the whole body, and the spatial semantic map, such as when the target task When picking up the cup, it can collect the distance from the target cup, etc.; the action control interface is used to output the control instructions of the reinforcement learning network, which can be applied to the digital twin world, such as controlling the movement of the digital twin body, limb movement, etc.; feedback The interface is used to collect the result feedback when the digital twin in the digital twin world performs actions according to the control instructions, such as whether there is a collision, whether the target task is completed, etc.

In some examples, after the digital twin acquires the pose of the agent and uses it as its initial pose, the reinforcement learning network starts to output control instructions to the digital twin gradually based on the digital twin world according to the target task, so that it can Able to complete target tasks. Among them, the reinforcement learning network can divide the target task to be completed into multiple sub-steps. After the control instruction corresponding to each sub-step is sent to the digital twin, it will obtain the feedback after the digital twin executes the control instruction, and judge whether to complete the sub-step. task, so as to gradually acquire a set of control instructions that can complete the target task. It can be understood that in the process of adjusting according to the feedback, not only can the control instruction in each sub-step be adjusted, so that the digital twin can complete the sub-step according to the control instruction; If the time period (presettable) still fails to obtain the control instruction that can realize it, it may be that the setting of the sub-step is unreasonable, the sub-step can be adjusted, or the sub-step can be discarded. Correspondingly, it can continue to adjust before and after The steps and the like are not limited in this embodiment. In addition, when the complexity is too high, the calculation space is too large, or the error rate exceeds the preset threshold, etc., the adjustment of the sub-steps can be considered, that is, the adjustment can be considered when the preset conditions are not met. Examples are not specifically limited.

For the execution of a certain sub-step, in one example, the reinforcement learning network will adjust according to the feedback of the digital twin to execute the control command. For example, in the process of executing the movement command, if there is a collision with the environment, the digital twin can be selected Return to the initial position before executing the movement command, update the movement command by reducing the movement distance or adjusting the action angle, and the digital twin executes the updated movement command until the digital twin completes the sub-step; the digital twin completes this time The sub-steps of the movement instruction, for example, reach the destination of the movement instruction, or reach the destination of the movement instruction without colliding with the surrounding environment. Subsequently, the reinforcement learning network obtains the result of the successful execution of the sub-step by the digital twin, for example, feedback from the digital twin to the reinforcement learning network to obtain the result of the successful execution of this sub-step, or the reinforcement learning network discovers the digital twin through monitoring the digital twin world Complete this sub-step, etc.; after the reinforcement learning network obtains the information that the digital twin successfully executes this sub-step, it can carry out the training process of the control instruction for the next sub-step, if this sub-step is the last sub-step to complete the target task Or if there is only this one step in the target task, then the obtained control instructions of all the sub-steps successfully executed can be integrated to obtain the control instruction to complete the target task. That is, in the control instructions output by the digital twin to execute the reinforcement learning network, through multiple trials and errors, a set of control instructions that can complete the target task is obtained.

In one example, the initial control instruction output by the reinforcement learning network is generated based on prior data; wherein the prior data is obtained based on the actions of the user controlling the digital twin through the interactive device. That is, in order to reduce the number of adjustments of the control instructions of the reinforcement learning network, or to reduce the computing memory usage of the reinforcement learning network, the initial control instructions are generated based on the prior data, and the prior data is obtained according to the actions of the user controlling the digital twin through the interactive device , that is, manual control instructions, or in historical records, control instructions that can complete the target task or are close to completing the target task. When the initial control instruction is generated based on the prior data, since the target task is close to completion, the process of obtaining the control instruction to complete the target task through reinforcement learning network training is more efficient, reduces the number of intermediate debugging, and reduces the memory usage of data operations. Wherein, the interactive device includes one of a mouse, a keyboard, a somatosensory device or any combination thereof. Therefore, the prior data is data that can achieve the target task or is close to achieving the target task. Using the prior data as the initial control instruction can reduce the number of training times and reduce the complexity of data processing.

For example, in the digital twin world, instructions input by the trainer through the mouse, keyboard, and somatosensory equipment can be obtained to control the digital twin to interact with the environment, objects or other data in the digital twin world to generate high-quality professional In order to improve the learning efficiency and quality of reinforcement learning network. Among them, the control instructions obtained from the trainer, compared with the control instructions independently generated by the reinforcement learning network, greatly improve the completion rate of the target task. Without the intervention of external control instructions (such as the trainer instructions here), the reinforcement learning network can randomly generate control instructions according to the target task, or generate different types of control instructions according to part of the label information, that is, it cannot guarantee the original generated Relevance of control instructions to target tasks. If the correlation is not high, there will be a large number of control instructions that need to be adjusted during the training process, which requires a high space occupied during the data processing process and takes a long time to process. However, if there are control instructions obtained from the trainer, it can be used as a priori data to train on the basis of the control instructions input by the trainer with high correlation with the target task, which can greatly reduce the adjustment requirements of the control instructions and reduce the computation time. required storage space and time. For example, for the task of picking up the water cup at a1, the trainer inputs the control command, and the digital twin is successfully completed; when the target task is to pick up the water cup at a2, the query finds that there are similar There are executable control instructions for the task of picking up the water cup at a1, and training based on the aforementioned control instructions input by the trainer can significantly reduce calculations compared to direct training to complete the control instruction for picking up the water cup at a2 The required time is reduced, the computational complexity is reduced, and the user experience is improved.

In one example, the spatial semantic map includes: poses of objects in the digital twin world, 3D collision boxes, object classification information, and object material information. The pose of each object in the digital twin world is used to simulate the position of the surrounding objects in the environment where the agent is located in the physical world; the 3D collision box is used to specify or limit the collision relationship in the digital twin world, making it closer to the physical world The movement situation of the object; the object classification information includes, for example, the physical structure of the object, and the object material information is used to simulate the detailed physical characteristics of the environment in the physical world before and after the agent moves, such as friction coefficient, sliding, etc.

In one example, reinforcement learning networks include: DQN (Deep Q Network, depth Q value) network model; the input of the DQN network model is an RGBD image including the pose of the agent and the spatial semantic map, and the output of the DQN network model is the action of each joint of the robotic arm. DQN (Deep Q Network, depth Q value) network model as an example, the input of the model is an RGBD image, and the output is the movement of each joint of the robotic arm. °], these three actions are replaced by [-1,0,1] in the network. The robotic arm in this example has a total of 7 joints, so for each frame, DQN inputs an RGBD image and outputs a 7×3 array.

In one example, the a priori data is obtained in the following ways: through the interactive device, the user receives the operation instructions for controlling the mechanical arm based on the collected RGBD image; Action: save the RGBD image and the action of each joint of the manipulator as prior data. For example, for the acquisition and use of prior data, in the established digital twin world, the trainer completes the cup grasping task by observing the collected RGBD images, operating the keyboard, mouse or somatosensory devices to control the robotic arm, During the completion of the task, the rotation of each joint is automatically recorded, and these rotations are combined with the RGBD image to form prior data, which is used as the initial data of DQN.

In addition, this embodiment is aimed at different target tasks and reinforcement learning networks, and is not limited to acquiring RGBD images, or only acquiring RGBD image information and agent poses.

In one example, the digital twin world is loaded on the cloud server; according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, control instructions for controlling the digital twin to complete the target task are generated, including: through communication with the cloud The interaction of the server generates control instructions for controlling the digital twin to complete the target task according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network. That is, the processing of the digital twin world requires high-complexity equipment support and takes up more computing resources; loading the digital twin world on the cloud server can reduce the computing power requirements of the agent device, and the computing power of the cloud server is relatively strong. , which can improve the generation efficiency of the control instruction to complete the target task. Among them, the reinforcement learning network can also be located in the cloud server, which can further reduce the data computing resources required by the agent and improve the generation efficiency of control instructions to complete the target task.

In step 103, the agent is controlled to execute the target task according to the control instruction for completing the target task. In some examples, after the reinforcement learning network generates a control instruction to complete the target task according to the simulation training, the agent receives the control instruction and executes the control instruction to complete the target task. That is, loading the digital twin world on the cloud can greatly reduce the data calculation requirements for the agent itself and reduce the complexity of device settings. At the same time, the data processing capabilities of cloud servers are generally high, which can further improve the acquisition and completion of the above goals. Efficiency of the task's control instructions.

In an example, after obtaining the target task, before generating the control instruction for controlling the digital twin to complete the target task, it also includes: closing the rendering function; after generating the control instruction for controlling the digital twin to complete the target task, further Including: Turn on the rendering function. The rendering function is used to display to the user. Before obtaining the control instruction to complete the target task, the rendering function generally occupies more computing resources; the data before generating the control instruction to complete the target task generally has no practical effect on the user. Therefore, turning off the rendering function, not showing the training process to the user, reducing the space required for data calculation, and canceling the rendering function during this time period, and using all the data processing resources of the device to generate control commands can improve the generation efficiency of control commands; After completing the control instruction of the target task, the digital twin will turn on the rendering function during the execution of the control instruction, so that the execution of the control instruction can be visualized for the user, and the user can accurately perceive the execution process of the control instruction. For example, manual intervention can also be performed according to the observed execution of control instructions to improve the efficiency of generating control instructions to complete target tasks. Specifically, in the process of data training, the training process data can be placed in the storage space to ensure that access can be completed, for example, placed in the CPU (central In the processing unit, central processing unit), the process data is not rendered and displayed to reduce the training complexity, and the rendering takes a long time, reducing the rendering can also improve the training efficiency; when the training is completed or almost completed, the Rendering and displaying the training process data, on the one hand, can enable users to perceive the training results, and on the other hand, can observe whether the control instructions obtained from the training conform to human behavior habits. The habit is to pick up the cup with the rim facing up. Although the control instructions obtained in this training did achieve the purpose of picking up the cup, the cup that was finally picked up was with the rim facing down, which does not conform to the behavior habits of ordinary people. The training process cannot accurately discover the results that do not conform to human behavior habits after the execution of such control instructions, but it is easy to detect and further optimize by observing after rendering.

In one example, after controlling the agent to perform the target task according to the control instruction for completing the target task, it also includes: in the case that the agent fails to perform the target task, receiving an auxiliary instruction input by the user through the interactive device, the auxiliary instruction is used for Control the agent to successfully execute the target task; after successfully executing the target task, update the prior data according to the actions of the joints of the robotic arm during the execution of the auxiliary instructions. After the reinforcement learning network converges, failure occurs in the subsequent use process In case of failure, human intervention can be used to provide artificial assistance to generate prior data. These prior data can be updated to the DQN network model to improve the robustness of the agent when facing this situation next time. . To achieve the purpose of learning from failure.

In one example, the digital twin world is updated synchronously with the physical world in real time. Since the digital twin world is to simulate the movement process in the physical world and achieve the purpose of feedback training, if the physical world changes, the data information of the digital twin world needs to be changed synchronously to ensure that the simulation results in the digital twin world conform to the actual physical Motion state and results in the world.

In one example, this embodiment can use 3D reconstruction technology to perform virtual reconstruction of the real physical world to obtain a digital twin world that restores the real world at a ratio of 1:1, and add a digital twin to it. The digital twin and the physical world corresponding to the intelligent agent. Or use ElasticFusion technology to scan the environment with a depth camera to obtain a digital twin world, and refine the results manually. In the digital twin world, the trainer can control the digital twin in the digital twin world through the keyboard and mouse, so that it can complete target tasks (such as grabbing cups, pouring drinks, opening cabinet doors, etc.). Generate sufficient prior data for a specific task, and then start the reinforcement learning network for training based on the prior data. The training process takes place in the digital twin world. After the training converges, the reinforcement learning network can be used to control real-world agents to complete corresponding tasks.

In this embodiment, the physical world is simulated through the digital twin world, and there is a digital twin body corresponding to the agent in the physical world in the digital twin world; in the digital twin world, the digital twin body can be simulated by controlling instructions. The control instruction is the result of operating the agent, and the appropriate control instruction is obtained through training to make the agent perform the target task. There is no need to consider the process of preprocessing the input parameters such as RGBD, and it also reduces the complexity of data calculation for the control commands output by the agent, and improves the control efficiency for the agent. Among them, the intelligent body can be a robot, that is, through the simulation of the digital twin world, the complexity of the control of the robot can be reduced and the control efficiency of the robot can be improved.

In some embodiments, the agent control can also be divided into three stages: first, the digital twin technology is used to realize the 1:1 simulation mapping between the physical world and the digital twin world, and the virtual world is updated synchronously in real time; secondly, in the digital twin world, based on reinforcement learning The network uses the spatial semantic map of the twin world and the pose of the agent as input for training and decision-making, and controls the digital twin corresponding to the agent; finally, the behavior of the digital twin is synchronized to control the agent in the physical world. It effectively avoids the complexity problem of training directly based on RGB-D data, and the algorithm converges quickly. At the same time, the algorithm output does not directly control the physical equipment, which effectively reduces the cost of virtual-real migration.

The step division of the above various methods is only for the sake of clarity of description. During implementation, it can be combined into one step or some steps can be split and decomposed into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. ; Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but not changing the core design of the algorithm and process are all within the scope of protection of this patent.

An embodiment of the present application relates to a control device for an intelligent body, as shown in FIG. 2 , including:

An acquisition module 201, configured to acquire a target task.

The generation module 202 is used to generate control instructions for controlling the digital twin to complete the target task according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network; The simulation mapping shows that the digital twin is located in the digital twin world, and the agent is located in the physical world and corresponds to the digital twin.

The execution module 203 is configured to control the agent to execute the target task according to the control instruction for completing the target task.

The implementation details of the control device of the intelligent body in this embodiment will be described in detail below. The following content is only the implementation details provided for the convenience of understanding, and is not necessary for the implementation of this solution.

For the generation module 202, in an example, according to the environment data of the digital twin world, the pose of the agent and the reinforcement learning network, control instructions for controlling the digital twin to complete the target task are generated, including: the pose of the agent And the spatial semantic map used to represent the environmental data, input the reinforcement learning network, and the reinforcement learning network outputs control instructions for controlling the actions of the digital twin; the reinforcement learning network is trained to complete the target task according to the results of the digital twin executing the control instructions control instructions.

In one example, the initial control instruction output by the reinforcement learning network is generated based on prior data; wherein the prior data is obtained based on the actions of the user controlling the digital twin through the interactive device.

In one example, the spatial semantic map includes: poses of objects in the digital twin world, 3D collision boxes, object classification information, and object material information.

In one example, the digital twin world is loaded on the cloud server; according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, control instructions for controlling the digital twin to complete the target task are generated, including: through communication with the cloud The interaction of the server generates control instructions for controlling the digital twin to complete the target task according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network.

In an example, after acquiring the target task, before generating a control instruction for controlling the digital twin to complete the target task, further includes: closing the rendering function.

In one example, the reinforcement learning network includes: a DQN network model; the input of the DQN network model is an RGBD image including the pose of the agent and the spatial semantic map, and the output of the DQN network model is the action of each joint of the mechanical arm.

For the execution module 203, in an example, after generating the control instruction for controlling the digital twin to complete the target task, it further includes: enabling the rendering function.

In addition, the digital twin world is updated synchronously with the physical world in real time.

In the physical world, agents can be robots.

In this embodiment, the physical world is simulated through the digital twin world, and there is a digital twin body corresponding to the agent in the physical world in the digital twin world; in the digital twin world, the digital twin body can be simulated by controlling instructions. Control instructions operate on the result of the agent, so as to obtain appropriate control instructions to make the agent perform the target task. There is no need to consider the process of preprocessing the input parameters such as RGBD, and it also reduces the complexity of data processing of the control commands output by the agent, and improves the control efficiency for the agent.

It is not difficult to find that this embodiment is a system embodiment corresponding to the above-mentioned embodiments, and this embodiment can be implemented in cooperation with the above-mentioned embodiments. The relevant technical details mentioned in the foregoing embodiments are still valid in this embodiment, and will not be repeated here in order to reduce repetition. Correspondingly, the relevant technical details mentioned in this embodiment can also be applied in the above embodiments.

It is worth mentioning that all the modules involved in this embodiment are logical modules. In practical applications, a logical unit can be a physical unit, or a part of a physical unit, or multiple physical units. Combination of units. In addition, in order to highlight the innovative part of the present application, units that are not closely related to solving the technical problem proposed in the present application are not introduced in this embodiment, but this does not mean that there are no other units in this embodiment.

An embodiment of the present application relates to an electronic device, as shown in FIG. 3 , including at least one processor 301; and a memory 302 connected in communication with at least one processor 301; wherein, the memory 302 stores information that can be processed by at least one The instructions executed by the processor 301 are executed by at least one processor 301, so that the at least one processor 301 can execute the above-mentioned control method of the agent.

Wherein, the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory together. The bus may also connect together various other circuits such as peripherals, voltage regulators, and power management circuits, all of which are well known in the art and therefore will not be further described herein. The bus interface provides an interface between the bus and the transceivers. A transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing means for communicating with various other devices over a transmission medium. The data processed by the processor is transmitted on the wireless medium through the antenna, further, the antenna also receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory can be used to store data that the processor uses when performing operations.

An embodiment of the present application relates to a computer program. When the computer program is executed by a processor, the method for controlling an agent as described in any of the above embodiments is implemented.

One embodiment of the present application relates to a computer-readable storage medium storing a computer program. The above method embodiments are implemented when the computer program is executed by the processor.

That is, those skilled in the art can understand that all or part of the steps in the methods of the above embodiments can be completed by instructing related hardware through a program, the program is stored in a storage medium, and includes several instructions to make a device ( It may be a single-chip microcomputer, a chip, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical discs and other media that can store program codes.

Those of ordinary skill in the art can understand that the above-mentioned embodiments are specific embodiments for realizing the present application, and in practical applications, various changes can be made to it in form and details without departing from the spirit and spirit of the present application. scope.

Claims

A control method for an agent, comprising:

Get the target task;

According to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, control instructions for controlling the digital twin to complete the target task are generated; wherein, the digital twin world is obtained through simulation mapping of the physical world, The digital twin is located in the digital twin world, and the agent is located in the physical world and corresponds to the digital twin;

The agent is controlled to execute the target task according to the control instruction for completing the target task.
The control method of the intelligent body according to claim 1, wherein, according to the environmental data of the digital twin world, the pose of the intelligent body and the reinforcement learning network, a control instruction for controlling the digital twin to complete the target task is generated ,include:

Inputting the pose of the agent and the spatial semantic map used to represent the environmental data into the reinforcement learning network, and the reinforcement learning network outputs control instructions for controlling the actions of the digital twin;

The reinforcement learning network is trained to obtain a control instruction to complete the target task according to the result of the digital twin executing the control instruction.
The control method of the agent according to claim 2, wherein, the reinforcement learning network comprises: a deep Q value network DQN network model;

The input of the DQN network model is an RGBD image including the pose of the agent and the spatial semantic map, and the output of the DQN network model is the action of each joint of the manipulator.
The control method for an agent according to claim 2 or 3, wherein the initial control instruction output by the reinforcement learning network is generated according to prior data;

Wherein, the prior data is obtained according to the action of the user controlling the digital twin through the interactive device.
The control method for an agent according to claim 4, wherein the prior data is acquired in the following manner:

Receive the user's operation instructions for controlling the robotic arm based on the collected RGBD image input through the interactive device;

recording the movements of the joints of the robotic arm during the execution of the operation instruction by the robotic arm;

The RGBD image and the actions of the joints of the manipulator are stored as prior data.
The control method for an agent according to claim 4 or 5, wherein, after controlling the agent to perform the target task according to the control instruction for completing the target task, further comprising:

In the case that the intelligent body fails to perform the target task, receiving an auxiliary instruction input by the user through an interactive device, the auxiliary instruction is used to control the intelligent body to successfully perform the target task;

After the target task is successfully executed, the prior data is updated according to the actions of the joints of the manipulator during the execution of the auxiliary instruction.
The control method of the agent according to any one of claims 2 to 6, wherein the spatial semantic map comprises:

The pose, 3D collision box, object classification information and object material information of each object in the digital twin world.
The control method of the intelligent body according to any one of claims 1 to 7, wherein the digital twin world is loaded on a cloud server;

According to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, the control instruction for controlling the digital twin to complete the target task is generated, including:

Through the interaction with the cloud server, according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network, control instructions for controlling the digital twin to complete the target task are generated.
The control method of the intelligent body according to any one of claims 1 to 8, wherein, after the acquisition of the target task, before the generation of the control instruction for controlling the digital twin to complete the target task, further comprising :

Turn off the rendering function;

After the generation of the control instruction for controlling the digital twin to complete the target task, it also includes:

Turn on the rendering function.
The control method for an agent according to any one of claims 1 to 9, wherein the digital twin world is updated synchronously in real time according to the physical world.
A control device for an intelligent body, comprising:

An acquisition module, used to acquire the target task;

The generation module is used to generate control instructions for controlling the digital twin to complete the target task according to the environmental data of the digital twin world, the pose of the agent and the reinforcement learning network; The simulation mapping of is obtained, the digital twin is located in the digital twin world, the agent is located in the physical world, and corresponds to the digital twin;

An execution module, configured to control the agent to execute the target task according to the control instruction for completing the target task.
An electronic device comprising:

at least one processor; and,

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, the instructions are executed by the at least one processor, so that the at least one processor can perform the operation described in any one of claims 1 to 10 The control method of the agent described above.
A computer program, when the computer program is executed by a processor, the method for controlling the intelligent body according to any one of claims 1 to 10 is realized.
A computer-readable storage medium storing a computer program, the computer program implementing the intelligent body control method according to any one of claims 1 to 10 when the computer program is executed by a processor.