CN117773934A

CN117773934A - Language-guide-based object grabbing method and device, electronic equipment and medium

Info

Publication number: CN117773934A
Application number: CN202311871791.0A
Authority: CN
Inventors: 赵东东; 阎石; 孙万胜; 陆福相; 周兴文
Original assignee: Lanzhou University
Current assignee: Lanzhou University
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-03-29

Abstract

The application provides an object grabbing method and device based on language guidance, electronic equipment and a medium, which belong to the technical field of artificial intelligence, and the object grabbing method and device based on language guidance, wherein the object grabbing method and device based on language guidance comprises the steps of obtaining an environment state image and a language instruction, determining target object position information from the environment state image according to the language instruction, wherein the target object position information is object position information of a target object to be grabbed, generating the target object position information through a preset action generating network, obtaining tail end position information of a mechanical arm, controlling the mechanical arm to grab the target object according to the tail end position information, and improving object grabbing accuracy. The application is funded by: national natural science foundation project (U22B 2040, 62233003).

Description

Language-guide-based object grabbing method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for capturing objects based on language guidance, an electronic device, and a medium.

Background

In the related art, an object grabbing task is performed by using a mechanical arm instead of a human. The traditional mechanical arm control method needs to know the information of the working scene in advance, and once the mechanical arm control method is moved to a new working scene, objects cannot be accurately grabbed. The deep learning-based mechanical arm control method requires a large amount of data to pretrain a network, and the mechanical arm is controlled by the pretrained network to complete the grabbing task. However, the control method of the mechanical arm has low generalization, and cannot adapt to a new scene, so that the accuracy of object grabbing is reduced.

Disclosure of Invention

The embodiment of the application mainly aims to provide an object grabbing method and device based on language guidance, electronic equipment and medium, and aims to improve the accuracy of object grabbing.

To achieve the above object, a first aspect of an embodiment of the present application proposes an object capturing method based on language guidance, the method including:

acquiring an environment state image and a language instruction;

determining target object position information from the environment state image according to the language instruction; the target object position information is object position information of a target object to be grabbed;

generating the target object position information through a preset action generating network to obtain the tail end position information of the mechanical arm;

and controlling the mechanical arm to grasp the object according to the tail end position information.

In some embodiments, the determining target object position information from the environmental status image according to the language instruction includes:

performing language coding on the language instruction to obtain language coding characteristics;

extracting image features of the environmental state image to obtain environmental state features;

And determining the position information of the target object according to the language coding features and the environment state features.

In some embodiments, the action generating network is trained according to the following steps:

acquiring a sample language instruction and sample object state information;

performing action generation on the sample language instruction and the sample object state information through a first preset model to obtain a sample initial action, and obtaining an initial reward obtained by the mechanical arm executing the sample initial action;

updating the sample object state information according to the initial sample action to obtain candidate object state information;

performing action generation on the candidate object state information through a second preset model to obtain a sample candidate action, and obtaining candidate rewards according to the candidate object state information and the sample candidate action;

updating first model parameters of the first preset model according to the initial rewards and the candidate rewards;

updating second model parameters of the second preset model according to the first model parameters;

and generating a network by taking the updated first preset model or the updated second preset model as the action.

In some embodiments, updating the second model parameters of the second preset model according to the first model parameters includes:

taking the first model parameter as the second model parameter;

or,

performing parameter control on the first model parameters to obtain control data; and updating the second model parameters according to the control data.

In some embodiments, obtaining an initial reward for the robotic arm performing the sample initial action includes:

acquiring execution result data of the mechanical arm for executing the initial action of the sample;

determining a first reward according to the execution result data;

determining a second reward based on the sample initial action and the sample object state information;

determining the initial prize according to the first prize and the second prize.

In some embodiments, the language encoding the language instruction to obtain a language encoding feature includes:

global vector coding is carried out on the language instruction to obtain a language embedded vector;

and extracting vector features of the language embedding vector to obtain the language coding features.

In some embodiments, the determining the target object location information based on the linguistic coding feature and the environmental state feature comprises:

Performing first feature alignment on the language coding feature and the environment state feature to obtain an object intention feature;

and performing second feature alignment on the object intention feature and the environment state feature to obtain the target object position information.

To achieve the above object, a second aspect of the embodiments of the present application proposes an object gripping device based on language guidance, the device comprising:

the acquisition module is used for acquiring the environment state image and the language instruction;

the position determining module is used for determining target object position information from the environment state image according to the language instruction; the target object position information is object position information of a target object to be grabbed;

the action generating module is used for generating the action of the target object position information through a preset action generating network to obtain the tail end position information of the mechanical arm;

and the control module is used for controlling the mechanical arm to grasp the object according to the tail end position information.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the language-guide-based object capturing method according to the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer readable storage medium storing a computer program, which when executed by a processor, implements the language-guide-based object gripping method described in the first aspect.

According to the object grabbing method based on the language guidance, the object grabbing device based on the language guidance, the electronic equipment and the computer readable storage medium, the robot arm is guided to execute the object grabbing task based on the language instruction by acquiring the environment state image and the language instruction. According to the language instruction, the target object position information is determined from the environment state image, the language instruction can enable the robot arm to understand the object intention, so that the object position information of the object to be grabbed is determined from the environment state image based on the object intention, interference of other environment information on an object grabbing task is avoided, and the object grabbing accuracy is improved. And generating the target object position information through a preset action generating network, so that the model can adapt to the environment information, the action to be executed by the mechanical arm is determined, the tail end position information of the mechanical arm is obtained, and the tail end of the mechanical arm is moved to the space position represented by the tail end position information to grasp the object. According to the tail end position information, the mechanical arm is controlled to grasp the object, and accurate grasping of the object by the mechanical arm is achieved.

Drawings

FIG. 1 is a schematic diagram of a robotic arm simulation environment provided by an embodiment of the present application;

FIG. 2 is a flow chart of an object grabbing method based on language guidance provided in an embodiment of the present application;

fig. 3 is a flowchart of step S220 in fig. 2;

fig. 4 is a flowchart of step S310 in fig. 3;

fig. 5 is a flowchart of step S330 in fig. 3;

FIG. 6 is a schematic diagram of a feature alignment process provided by an embodiment of the present application;

FIG. 7 is a flow chart of an action generating network training process provided by an embodiment of the present application;

fig. 8 is a flowchart of step S720 in fig. 7;

fig. 9 is a flowchart of step S760 in fig. 7;

FIG. 10 is a schematic illustration of an object gripping process provided by an embodiment of the present application;

FIG. 11 is a schematic structural diagram of an object gripping device based on language guidance according to an embodiment of the present application;

fig. 12 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

With the rapid development of artificial intelligence technology, mechanical arms have been widely used in various industries, and the mechanical arms are used to replace manual work to perform object grabbing tasks. For example, in the agricultural field, the task of manually judging the maturity of fruits and picking the fruits to prevent pesticides from injuring workers and manually spraying pesticides is replaced; in the manufacturing field, the workshop production work with large batch, repeatability and heavy load is completed instead of manual work; in the medical field, the task of drug delivery, medical surgery assistance and the like is completed instead of workers; in the military field, the mechanical arm is utilized to complete dangerous tasks such as weapon manufacture, ammunition fuze assembly and the like; in the field of aerospace, eight kinds of on-orbit tasks such as carrying out goods outside the cabin, checking the state outside the cabin, maintaining large equipment outside the cabin and the like are completed by replacing astronauts.

In the related art, the mechanical arm control method based on the conventional method needs to know the working scene and model information, and the method can only complete a single simple task, and if the prior work needs to be repeated twice when moving to a new scene, great cost is required. Deep learning based robotic arm control methods typically require a large amount of data to pre-train the network to complete a series of sub-tasks of the robotic arm. These methods do not have autonomous exploration ability, nor can they adapt to new scenes, resulting in reduced accuracy in object capture.

Based on the above, the embodiment of the application provides an object grabbing method based on language guidance, an object grabbing device based on language guidance, electronic equipment and a computer readable storage medium, which aim to improve the accuracy of object grabbing through language guidance.

The language-guide-based object capturing method, the language-guide-based object capturing device, the electronic device and the computer-readable storage medium provided in the embodiments of the present application are specifically described through the following embodiments, and the language-guide-based object capturing method in the embodiments of the present application is described first.

The embodiment of the application provides an object grabbing method based on language guidance, and relates to the technical field of artificial intelligence. The object grabbing method based on language guidance provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements the language-guide-based object gripping method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Referring to fig. 1, the embodiment of the application builds a mechanical arm simulation environment, and performs an object grabbing task in the mechanical arm simulation environment. Auxiliary equipment such as an object model, a mechanical arm model, a clamping jaw model, a camera model, a table model and the like are loaded in the mechanical arm simulation environment, and grabbing of the object model is achieved through interaction of the mechanical arm model and the clamping jaw model. The mechanical arm model adopts a UR5E mechanical arm, the camera model is used for acquiring object image data of the object model, and the camera model can adopt a Kinect depth camera. The mechanical arm simulation environment is built based on a robot operating system (Robot Operating System, ROS). The robotic arm model, jaw model, camera model are imported into the ROS in a unified robot description format (Unified Robot Description Format, URDF), and a movet configuration file is generated by a system assistant (setup assistant) in the ROS. In this configuration file the jaw model is connected to the end of the robot arm model as a whole. Based on the built simulation environment, loading the object to a desktop, and controlling the mechanical arm to grasp the object in a proper pose.

Fig. 2 is an optional flowchart of a language guidance-based object capturing method provided in an embodiment of the present application, and is applied to a robotic arm, where the method in fig. 2 may include, but is not limited to, steps S210 to S240.

Step S210, acquiring an environment state image and a language instruction;

step S220, determining target object position information from the environment state image according to the language instruction; the target object position information is object position information of a target object to be grabbed;

step S230, performing action generation on the target object position information through a preset action generation network to obtain the tail end position information of the mechanical arm;

and step S240, controlling the mechanical arm to grasp the object according to the tail end position information.

In step S210 of some embodiments, in order to achieve object gripping, the position of the object needs to be perceived using a perceiving device. In the process of executing the grabbing task, the state of the object is changed at any time, and the state of the object at each moment is represented by a height chart. According to the depth camera, object perception is achieved through the depth camera, the depth camera is fixed above an object, and object height map collection is achieved. The depth camera comprises a color camera and a depth camera, wherein the color camera is used for collecting color images, and the depth camera is used for collecting depth images. And fusing the two image data of the color image and the depth image to obtain an environment state image. The environment state image is an RGB-D image, and is used for representing object states, clamping jaw states and the like in the working scene of the mechanical arm.

The specific acquisition process of the environment state image is as follows: the color image and the depth image are projected onto a three-dimensional point cloud, orthogonally back projected upward in the direction of gravity to construct a height map image representation having color and bottom height channels. And respectively extracting characteristic values of the color image and the depth image to obtain color information and depth information. The color information and the depth information are fused, so that the image can obtain spatial feature vectors except for color and depth features. And predicting the pose state of the mechanical arm by using the feature vector, and performing loop iteration optimization on the pose state of the mechanical arm, thereby completing the related grabbing task.

The language instructions are unstructured natural language used to describe the underlying intent of the object. For example, the language instruction may be a grab object. The embodiment of the application does not limit the form of language instructions, and the language instructions can be text or voice. The speech may be obtained in a verbal description and converted to text. The verbal instructions may be used to control jaw opening, closing, or initialization.

The embodiment of the application controls the opening and closing of the clamping jaw based on a serial port method. The clamping jaw is connected with an upper computer, a serial port is set to be port= '/dev/ttyUSB0', the baud rate is designated to be 115200 bits/second, the data bit is 8 bits, the stop bit is 1 bit, and no parity check bit exists. And initializing serial port parameters corresponding to the clamping jaws into [0x01,0x06,0x01,0x00,0x A5,0x48,0x4D ] through language instructions. After initialization is completed, an instruction is sent to open the clamping jaw, serial port parameters of the clamping jaw in an open state are [0x01,0x06,0x01,0x03, 0xE8,0x78,0x88], an instruction is sent to close the clamping jaw, and serial port parameters of the clamping jaw in a closed state are [0x01,0x06,0x01,0x03,0x00,0x 78,0x36].

The three-dimensional position of the object is predicted based on the height map. Specifically, if the language instruction indicates to grasp the object, determining the area of the target object to be grasped from the environmental state image according to the language instruction, and obtaining target object position information, wherein the target object position information is object position information of the target object. By taking the language instruction as a condition, the mechanical arm can be controlled in finer granularity, and meanwhile, the scene ambiguity of a working scene is reduced. A specific calculation method of the target object position information is described below.

Referring to fig. 3, in some embodiments, step S220 may include, but is not limited to, steps S310 to S330:

step S310, language coding is carried out on the language instruction to obtain language coding characteristics;

step S320, extracting image features of the environmental state image to obtain environmental state features;

step S330, determining the position information of the target object according to the language coding feature and the environment state feature.

In step S310 of some embodiments, the language instruction is language-coded to convert the language instruction into a vector that can be processed by a computer, resulting in a language-coded feature. The language encoding feature is a vectorized representation of the language instructions.

In step S320 of some embodiments, in order to extract a more abstract object state feature representation, to improve object recognition accuracy of the mechanical arm, image features of an environmental state image are extracted through a preset neural network, so as to obtain the environmental state feature. The preset neural network may be a master RCNN. The environmental state features are used to characterize spatial state information of the objects, and the environmental state features may be spatial features such as shape, position, size, distribution among the objects, and the like.

In step S330 of some embodiments, an area where the target object is located is determined from the environmental status image according to the language coding feature and the environmental status feature, two-dimensional coordinates of the area where the object is located are mapped according to the internal reference and the external reference of the depth camera, the coordinates are converted into a mechanical arm coordinate system, a three-dimensional position coordinate of the target object is obtained, and the three-dimensional position coordinate is used as the target object position information.

In the steps S310 to S330, the area where the target object is located may be determined from the environmental status image by the language instruction, so as to adjust the pose status of the mechanical arm according to the area where the target object is located, thereby reducing the interference of irrelevant environmental variables. The grasping process of the mechanical arm can be guided by language prompts based on language instructions so as to control the mechanical arm in a finer granularity.

Referring to fig. 4, in some embodiments, step S310 may include, but is not limited to, steps S410 to S420:

step S410, global vector coding is carried out on language instructions to obtain language embedded vectors;

step S420, extracting vector features of the language embedding vector to obtain language coding features.

In step S410 of some embodiments, to convert the language instruction into processable information, word embedding encoding is performed on the language instruction, and the language instruction is converted into a vector in a high-dimensional space, resulting in a language embedding vector. The coding mode adopts global vector coding (Global Vectors for WordRepresentation). The language instructions may be globally vector coded by a pre-trained global vector coding model. The language instructions include a plurality of words, and the language embedding vector is a word vector representation of the plurality of words.

The global vector coding model obtains word frequency statistics based on the general corpus, and learns word characterization based on the word frequency statistics. The global vector coding model is trained by: a text corpus is obtained, the text corpus including a plurality of sentences, each sentence including a word. The frequency of every two words appearing in the same sentence (context) is counted, and the co-occurrence matrix is determined according to the frequency. And carrying out vectorization processing on the words through a preset model to obtain word vectors. Loss data is calculated from the co-occurrence matrix and the word vector. And adjusting model parameters of a preset model by minimizing loss data to obtain a global vector coding model. The calculation method of the loss data is shown in the formula (1).

Wherein J is loss data; f () is a weight function; x is a co-occurrence matrix extracted from the text corpus in a statistical manner; x is X _i,j Representing how frequently word i and word j occur in the same context; v _i A word vector that is word i; v _j A word vector for word j; b _i And b _j Is a bias term; n represents the number of words in the text corpus. The weight function is shown in formula (2).

Wherein x represents an input parameter; x is x _max Representing the maximum value of the input parameter.

In step S420 of some embodiments, vector feature extraction is performed on the language embedding vector through a GRU (Gated Recurrent Unit) network to extract hidden context features in the language embedding vector, capture long-term dependencies, and obtain language coding features. The GRU network comprises an update gate and a reset gate, wherein the update gate is used for controlling the hidden state of the previous time step and the influence of the input of the current time step on the current time step, and the reset gate is used for controlling the hidden state of the previous time step and the influence of the input of the current time step on the current time step. Specifically, the candidate hidden state is calculated by resetting the gate, the hidden state of the last time step and the input information of the current time step, and the hidden state of the current time step is calculated by updating the gate according to the candidate hidden state and the hidden state of the last time step. And obtaining language coding features according to the hidden state of each time step.

Through the steps S410 to S420, semantic information of the voice command can be extracted to control the state of the clamping jaw based on the semantic information, so as to achieve object capturing.

Referring to fig. 5, in some embodiments, step S330 may include, but is not limited to, steps S510 to S520:

step S510, aligning the language coding feature and the environment state feature to obtain an object intention feature;

and step S520, aligning the object intention characteristic and the environment state characteristic to obtain the target object position information.

In step S510 of some embodiments, as shown in fig. 6, after language coding is performed on a language instruction through a GRU network and image feature extraction is performed on an environmental status image through a fast RCNN network, the language coding feature obtained by the language coding and the environmental status feature obtained by the image feature extraction are multiplied to perform feature alignment, so that a jaw status is matched with the language instruction, and an object intention feature is obtained. The object intent feature is used to describe the actions that the object wishes the robotic arm to perform in the object state characterized by the environmental state image. For example, the verbal instruction is to grasp an object and the object intent feature may be that the jaws are in a closed state. The language instruction is to put down the object and the object intention feature may be that the jaws are in an open state.

In step S520 of some embodiments, the object intention feature is activated by the softmax function, and the object intention feature after the activation is multiplied by the environmental state feature to perform second feature alignment, so as to match the area of the target object appearing in the language instruction from the environmental state image, realize the matching of the object to be grabbed on the desktop, avoid grabbing other objects, and obtain the position information of the target object. For example, the language instruction is used for grabbing the bottle, and the bottle can be identified from the environmental state image through characteristic alignment, so that three-dimensional position information of the bottle is obtained, and the bottle object is accurately grabbed.

In the steps S510 to S520, through the first feature alignment, language instructions of different types may be distinguished from corresponding actions of the mechanical arm, so as to implement matching between the languages and the actions, and through the second feature alignment, a target object may be identified from the environmental status image, so that the mechanical arm performs corresponding actions on the target object. The mechanical arm is guided to grasp the object through the language instruction, so that the accuracy of grasping the object is improved.

In step S230 of some embodiments, the target object position information is input to the motion generation network to perform motion generation, so as to obtain the end position information of the mechanical arm. The action generating network is a deep Q network. The end position information may be the end three-dimensional coordinates of the mechanical arm, or may be the end three-dimensional coordinates and the end rotation angle of the mechanical arm.

In order to enable the mechanical arm to quickly adapt to a new scene, the exploration capability of the mechanical arm to the scene is enhanced, and the embodiment of the application constructs a motion generation network through deep reinforcement learning, and adjusts the behavior motion of an agent through continuous interaction of the agent (the mechanical arm) and the environment (the working space of the mechanical arm). According to the embodiment of the application, the non-structural natural language of the language prompt is fused into the reinforcement learning based on the reinforcement learning, so that the mechanical arm can clearly understand the intention of a human and execute related tasks, a visual motion strategy with the language as a condition is obtained, and the mechanical arm can be conditioned on instructions during operation, so that the reinforcement learning strategy can be controlled in a finer granularity, and meanwhile, the scene ambiguity is reduced.

Referring to fig. 7, in some embodiments, the training process of the action generating network may include, but is not limited to, steps S710 to S770:

step S710, obtaining a sample language instruction and sample object state information;

step S720, performing action generation on the sample language instruction and the sample object state information through a first preset model to obtain a sample initial action, and obtaining an initial reward obtained by the mechanical arm executing the sample initial action;

Step S730, updating the state information of the sample object according to the initial action of the sample to obtain the state information of the candidate object;

step S740, performing action generation on the candidate object state information through a second preset model to obtain a sample candidate action, and obtaining candidate rewards according to the candidate object state information and the sample candidate action;

step S750, updating first model parameters of a first preset model according to the initial rewards and the candidate rewards;

step S760, updating the second model parameters of the second preset model according to the first model parameters;

in step S770, the updated first preset model or the updated second preset model is used as the action generating network.

In step S710 of some embodiments, training data is collected at the simulation platform ROS, the training data comprising a plurality of samples, each sample comprising a sample language instruction and sample object state information. The sample language instruction is in a voice or text form, the sample object state information is in a high-level graphic form, and the sample object state information is used for representing the position relation of an object relative to the tail end of the mechanical arm. And randomly selecting one sample from the training data to carry out a subsequent reinforcement learning process.

The training data collection process is as follows: loading an object on a desktop in a simulation environment, transmitting three-dimensional position information of the object to a mechanical arm, after the mechanical arm receives the three-dimensional position information, executing actions close to the object, after the mechanical arm approaches to the object, sending an instruction for closing a clamping jaw to control the clamping jaw to clamp the object, and recording actions of successfully grabbing the object by the mechanical arm through a depth camera to obtain sample object state information. The sample object state information is a height chart of successfully grabbing an object by the mechanical arm and is used for reflecting the relation among the clamping jaw state, the tail end position information of the mechanical arm and the object position information.

In step S720 of some embodiments, in order to alleviate the overestimation problem in the conventional Q learning and improve the stability and convergence of training, the embodiments of the present application use two neural networks for reinforcement learning, where the two neural networks are a first preset model and a second preset model, respectively, the first preset model is a source network, and the second preset model is a target network. The first preset model is a network for performing action selection and updating Q values, and in each training step, it selects an action according to the current state and calculates the Q value of the action. The second preset network is used for providing a target Q value of a next action selected according to the next state so as to update model parameters of the first preset model by using the target Q value.

Initializing a first preset model and a second preset model through a pre-trained expert model. The expert model is obtained by training sample object state information acquired by a simulation platform by adopting a two-layer linear network. The model parameters of the first preset model are first model parameters, and the model parameters of the second preset model are second model parameters. And respectively copying model parameters of the expert model to the first model parameters and the second model parameters to initialize the first preset model and the second preset model.

And carrying out vectorization processing on the sample language instruction by a glove method to obtain an embedded vector, and carrying out feature extraction on the embedded vector by a GRU network to obtain language coding features. And extracting the characteristics of the sample object state information through a Faster RCNN network to obtain the object state characteristics. And matching the target object from the sample object state information according to the language coding features and the object state features to obtain the three-dimensional position information of the target object. In the design of the reinforcement learning algorithm of the mechanical arm, the decision process of the mechanical arm is modeled as a Markov decision model, and the actions of the mechanical arm are classified as continuous or discrete in the face of different working scenes, and generally, the mechanical arm is required to do discrete action actions under various conditions. Inputting three-dimensional position information into a first preset model, wherein the first preset model selects one action from a discrete action space according to the three-dimensional position information to serve as a sample initial action, the sample initial action serves as an action for controlling the mechanical arm to move to a preset position, and the preset position is the tail end position of the mechanical arm output by the first preset model. And obtaining initial rewards obtained by the mechanical arm executing the initial actions of the samples.

In step S730 of some embodiments, performing the sample initial action updates the sample object state information to the next state, resulting in candidate object state information. If the sample object state information is the object state information of the current time step, the candidate object state information is the object state information of the next time step of the current time step.

In step S740 of some embodiments, the motion of the candidate object state information is performed through the second preset model to generate a sample candidate motion, where the sample candidate motion is the motion with the largest Q value in the motion space under the object state represented by the candidate object state information. And obtaining candidate rewards according to the candidate object state information and the sample candidate actions, wherein the calculation method of the candidate rewards is shown in a formula (3).

∑ _s′∈S p(s′|s，a)max _a′ Q (s ', a') equation (3)

Wherein S is a state space; s' and s are object state information; a and a' are actions; s 'represents a new state to which action a is transferred in state s, i.e. s represents sample object state information, s' represents candidate object state information; a' represents a sample candidate action; p (s '|s, a) represents the probability of transitioning to a new state s' with action a at state s; max (max) _a Q (s ', a') represents the Q value of action a 'taken in the new state s'.

In step S750 of some embodiments, the initial rewards and the candidate rewards are added to obtain the target rewards by the first preset model using a bellman equation, and the first model parameters of the first preset model are updated by maximizing the target rewards. The target rewards are total rewards obtained by adopting initial sample actions under the state information of the sample object. The calculation method of the target rewards is shown in the formula (4).

Q(s，a)＝r(s，a)+γ∑ _s′∈S p(s′|s，a)max _a′ Q (s ', a') equation (4)

Wherein r (s, a) is an initial prize; q (s, a) is a target prize; gamma is a weight factor used to measure the importance of the initial and candidate rewards.

In order to increase generalization of mechanical arm actions, environment can be better explored by the mechanical arm in the training process, and noise is introduced to randomize actions. The noise is gaussian noise, which is a random value generated using a gaussian function. Gaussian functions follow normal distribution, i.e. N (0, sigma) ² ) 0 represents the mean value, sigma ² Representing the variance. And adding the initial rewards, the candidate rewards and the Gaussian noise to obtain the target rewards.

In step S760 of some embodiments, the second model parameters of the second preset model are updated with the first model parameters every preset time period. The preset time length can be set according to actual conditions.

In step S770 of some embodiments, when the convergence end condition is reached, the network is generated using the most recently updated first or second preset model as an action. The convergence end condition may be that the target prize fluctuation for multiple iterations is less stable.

In the steps S710 to S770, the deep reinforcement learning is combined with the natural language processing, and the language model is embedded into the reinforcement learning model, so that the mechanical arm can clearly understand the object intention and execute the complex task, and the action generating network can be obtained, so that the action of the mechanical arm is generated based on the action generating network, accurate object grabbing is realized, the intelligent performance of the mechanical arm is improved, and the guiding significance is provided for the reinforcement learning in the actual production application.

Referring to fig. 8, in some embodiments, step S720 may include, but is not limited to, steps S810 to S840:

step S810, obtaining execution result data of an initial action of executing a sample by the mechanical arm;

step S820, determining a first reward according to the execution result data;

step S830, determining a second reward according to the sample initial action and the sample object state information;

step S840, determining an initial prize according to the first prize and the second prize.

In step S810 of some embodiments, the execution result data is used to characterize whether the object was successfully captured or not successfully captured after the robotic arm performed the sample initial motion.

In step S820 of some embodiments, the initial prize is divided into two parts, the first part being a first prize and the second part being a second prize. The initial reward is used for measuring the gap between the mechanical arm decision and the actual execution, namely the gap between the mechanical arm end position output by the action generating network and the target end position in the actual execution. If the execution result data indicates that the mechanical arm executes the initial sample motion to successfully grasp the object, the first reward is determined to be 1.5. If the execution result data indicates that the mechanical arm does not successfully grab the object after executing the initial sample action, punishment is needed to be carried out on the initial sample action, and the first reward is determined to be-0.1.

In step S830 of some embodiments, the first reward is sparse data, and cannot measure how well the mechanical arm performs the action, so a penalty term needs to be added. The second reward is a negative distance norm penalty term for measuring the extent to which the arm completes the motion based on the distance between the end position of the arm and the object position. If the distance is smaller, the action executed by the mechanical arm is better, the added punishment is smaller, and the total rewards are larger. If the distance is greater, the penalty to be added is greater and the total prize is smaller. The sample initial motion may reflect the robot arm end position and the sample object state information may reflect the object position. The object position can be obtained by mapping the height map reflecting the state information of the sample object by the internal reference and the external reference of the depth camera. If the initial sample motion is denoted as grasp_pos, and if the sample object state information is denoted as obj_pos, the second prize is denoted as ||obj_pos, grasp_pos||, ||obj_pos, and grasp_pos|| denote a distance between the end position of the robot arm and the object position, which may be a euclidean distance, a mahalanobis distance, or the like.

In step S840 of some embodiments, the first prize and the second prize are added to obtain an initial prize. The calculation method of the initial prize is shown in formula (5).

Through the steps S810 to S840, an initial reward can be obtained, and the mechanical arm is facilitated to better adjust the behavior of the next state in the learning process through the initial reward, so as to reduce the gap between decision making and execution. Through continuous feedback and adjustment, the mechanical arm can gradually grasp the capability of tasks such as grabbing objects, opening doors and the like.

Referring to fig. 9, in some embodiments, step S760 may include, but is not limited to, including step S910 or step S920:

step S910, taking the first model parameter as a second model parameter;

step S920, performing parameter control on the first model parameters to obtain control data; and updating the second model parameters according to the control data.

In step S910 of some embodiments, the first model parameter is denoted as θ, the second model parameter is denoted as θ ', and the first model parameter θ is completely copied to the second model parameter θ'. The first model parameter and the second model parameter are both weight parameters of the neural network, which are typically expressed as a function of the state s and the action a.

In step S920 of some embodiments, the second model parameters are updated in a soft update manner. The soft updating mode does not copy the first model parameters to the second preset model directly, but copies part of the parameters of the first model parameters to the second preset model, so that the network parameters of the second preset model can change smoothly, and the purpose of stable learning is achieved. The degree of parameter duplication is controlled by the super parameter tau, and the value range of the super parameter is between 0 and 1. The hyper-parameters are multiplied by the first model parameters to obtain control data. Subtracting the hyper-parameters from 1 yields the weights of the second model parameters. Multiplying the weight by the second model parameter, and adding the multiplication result and the control data to obtain an updated second model parameter. The method of soft update is shown in equation (6).

θ '=τθ+ (1- τ) θ' equation (6)

Wherein τ is a superparameter; θ is a first model parameter; θ' is a second model parameter.

Through the steps S910 to S920, two neural networks can be obtained to improve the stability and convergence of training and accelerate the training process of the model.

In step S240 of some embodiments, the control arm moves to a position characterized by the end position information, and closes the clamping jaw to perform object capturing on the target object. The process of grabbing the object by the mechanical arm is shown in fig. 10, and as can be seen from fig. 10, the mechanical arm can successfully grab the target object in the simulation environment.

Referring to fig. 11, the embodiment of the present application further provides an object capturing device based on language guidance, which may implement the above-mentioned object capturing method based on language guidance, where the object capturing device based on language guidance includes:

an acquisition module 1110, configured to acquire an environmental status image and a language instruction;

a position determining module 1120, configured to determine target object position information from the environmental status image according to the language instruction; the target object position information is object position information of a target object to be grabbed;

the action generating module 1130 is configured to perform action generation on the target object position information through a preset action generating network, so as to obtain end position information of the mechanical arm;

The control module 1140 is configured to control the mechanical arm to perform object capturing on the target object according to the end position information.

The specific implementation manner of the object capturing device based on language guidance is basically the same as the specific embodiment of the object capturing method based on language guidance, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the object grabbing method based on language guidance when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 12, fig. 12 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 1210 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

memory 1220 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). Memory 1220 may store an operating system and other application programs, and when implementing the technical solutions provided in the embodiments of the present application by software or firmware, relevant program codes are stored in memory 1220 and invoked by processor 1210 to perform the language-guide-based object grabbing method of the embodiments of the present application;

An input/output interface 1230 for implementing information input and output;

the communication interface 1240 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

bus 1250 for transferring information between the various components of the device (e.g., processor 1210, memory 1220, input/output interface 1230, and communication interface 1240);

wherein processor 1210, memory 1220, input/output interface 1230 and communication interface 1240 are communicatively coupled to each other within the device via bus 1250.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the object grabbing method based on language guidance when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. The coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. An object grabbing method based on language guidance, which is characterized by comprising the following steps:

acquiring an environment state image and a language instruction;

2. The language guidance-based object capturing method according to claim 1, wherein the determining target object position information from the environmental status image according to the language instruction includes:

3. The language guidance-based object capturing method according to claim 1, wherein the action generating network is trained according to the steps of:

Acquiring a sample language instruction and sample object state information;

4. The language-guide-based object capturing method according to claim 3, wherein updating the second model parameters of the second preset model according to the first model parameters comprises:

Taking the first model parameter as the second model parameter;

or,

5. The language-guide-based object capturing method according to claim 3, wherein obtaining an initial reward for the robotic arm performing the sample initial action comprises:

determining a first reward according to the execution result data;

6. The language guidance-based object capturing method according to claim 2, wherein the language coding the language instruction to obtain a language coding feature includes:

7. The language guidance-based object capturing method according to claim 2, wherein the determining the target object position information based on the language code feature and the environmental status feature includes:

8. An object gripping device based on language guidance, characterized in that the device comprises:

9. An electronic object, characterized in that it comprises a memory storing a computer program and a processor implementing the language-guide based object gripping method according to any one of claims 1 to 7 when said computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the language guidance based object gripping method of any one of claims 1 to 7.