CN116852353A

CN116852353A - Method for capturing multi-target object by dense scene mechanical arm based on deep reinforcement learning

Info

Publication number: CN116852353A
Application number: CN202310744751.3A
Authority: CN
Inventors: 沈捷; 李鑫; 曹恺; 王莉; 陈旭; 吴宇轩
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-10-10

Abstract

The invention discloses a method for capturing multiple target objects by a dense scene mechanical arm based on deep reinforcement learning, which comprises the steps of adding a mechanical arm and an RGB-D camera into Coppelia sim simulation software to build a scene similar to a real environment, generating height maps of different angles by orthogonal transformation and rotation of image data captured by the camera, inputting the multiple height maps into a full convolution neural network consisting of a feature extraction module DenseNet121, two parallel pushing networks and a capturing network, outputting a pixel-level Q value map, screening out meaningful pixel points by using a mask function, and acquiring a maximum Q value of a pushing action at a moment tAnd maximum Q value of grabbing actionSelecting action according to the maximum Q value in the current state, wherein the pushed reward value r _p By pushingWhether the difference of the average relative distances of all objects in the scene before and after the movement is larger than a threshold value is determined. In the capturing of a real scene, a pushing and capturing collaborative strategy network trained by a simulation end is applied to a physical platform, and a mechanical arm makes a decision according to action selection rules in the real environment to complete a capturing task. The method can enable the mechanical arm to concentrate on effective grabbing and pushing to promote efficient learning of the model, can evaluate the influence of each candidate pushing on object density in a working space, provides enough space for grabbing, and improves grabbing success rate.

Description

Method for capturing multi-target object by dense scene mechanical arm based on deep reinforcement learning

Technical Field

The invention relates to the field of grabbing of mechanical arms in complex scenes, in particular to a method for grabbing multiple target objects by a dense scene mechanical arm based on deep reinforcement learning.

Background

Grabbing objects is one of the basic directions in the field of robot research, and plays an important role in a plurality of application scenes. With the increasing difficulty of work tasks, the robotic arm faces many difficulties in completing the grabbing task, and therefore, how to achieve accurate grabbing of all objects by the robotic arm in an unstructured operating environment is still challenging. Although object grabbing problems can be effectively solved by utilizing deep learning object identification, detection and pose estimation, the methods consume a large amount of time to collect and manufacture large-scale sample data, and meanwhile, the method only aims at a known object model, so that the problems of poor identification capability, large positioning error and the like still exist when a multi-object task is processed. Reinforcement learning can be continuously interacted with the environment through an intelligent agent, a preset reward is utilized as a feedback signal, manual marking is not needed, and deep learning and reinforcement learning are considered to be combined and applied to a grabbing task of a mechanical arm, but when objects which are closely stacked are processed, the situation that proper grabbing points are difficult to find on the objects under the condition of no collision is difficult. Inspired by human behaviors, the synergistic effect between pushing and grabbing becomes a solution for coping with the situation, and the grabbing success rate can be effectively improved. However, this type of approach remains a major challenge when training samples that are inefficient, and the partial reward function only considers feature information in the local workspace, resulting in a model that easily predicts actions that are not valid from a global perspective.

Therefore, how to design a new push-grip cooperative method based on deep reinforcement learning, which can improve the training efficiency and assist the robot in improving the gripping success rate by means of an efficient reward function is a considerable problem.

Disclosure of Invention

The invention provides a method for capturing multiple target objects by a dense scene mechanical arm based on deep reinforcement learning, which utilizes the deep reinforcement learning to obtain an excellent pushing and capturing cooperative strategy, designs a mask function for screening effective actions and improving training efficiency, and simultaneously introduces the difference value of average relative distances of all objects in a working space as an index of a pushing rewarding function to link rewarding of pushing actions with the change of environmental intensity in order to provide a larger capturing space for a robot. The method can enable the mechanical arm to concentrate on effective grabbing and actions to promote efficient learning of the model, can evaluate the influence of each candidate pushing on the object density in the working space, provides enough space for grabbing, and improves the grabbing success rate.

1. The invention provides a method for capturing multiple target objects by a dense scene mechanical arm based on deep reinforcement learning, which comprises the following steps:

step S1: building an object dense scene for training the mechanical arm to complete the grabbing task in virtual simulation software CoppelianSim;

step S2: image data captured by a camera is subjected to orthogonal transformation and rotation to generate height maps of different angles, and the plurality of height maps are input into a feature extraction module DenseNet121 and two parallel pushing networksAnd grab network->The full convolution neural network is formed, a pixel level Q value diagram is output, meaningful pixel points are screened out by using a mask function, and the maximum Q value of the pushing action at the moment t is obtained>And maximum Q value of grab action->

Step S3: selecting action according to the maximum Q value in the current state, wherein the pushed reward value r _p By pushing the difference D of the average relative distances of all objects in the scene before and after ^improved Whether the threshold value is greater than or equal to the threshold value;

step S4: the method comprises the steps of building a dense grabbing environment of a real scene, applying a pushing and grabbing collaborative strategy network trained by a simulation end to a physical platform, and making a decision by a mechanical arm according to action selection rules under the real environment to finish grabbing tasks.

2. Optionally, step S1 specifically includes the following steps:

step S1-1: adding a mechanical arm model and a two-finger mechanical clamping jaw model in a CoppelianSim simulation environment;

step S1-2: adding an RGB-D camera in a CoppelianSim simulation environment, and setting a working space for grabbing by a mechanical arm;

step S1-3: and S1-2, adding simulation training objects in the working space set in the step S1-2 in a random falling mode.

3. Optionally, step S2 specifically includes the following steps:

step S2-1: capturing RGB image I within a workspace using an RGB-D camera _c And depth image I _d Obtaining a color top view H through orthogonal transformation and denoising operation _c And depth-height map H _d As state s _t Generating a feature map H by fusing two pictures _cd ；

Step S2-2: the characteristic diagram H generated in the step S2-1 is processed _cd By rotating clockwise 16 times in turn, the rotation angle of each timeObtaining 16 sets of rotated feature atlases H by affine transformation _cdk ；

Step S2-3: the 16 groups of feature graphs generated in the step S2-2 are respectively input into a feature extraction module DenseNet121 and two parallel pushing networksAnd grab network->The method comprises the steps of forming a full convolution neural network, and outputting a group of pixel-level Q value graphs corresponding to each action at t time, wherein each pixel point corresponds to the pushing or grabbing position and angle;

step S2-4: in order to solve the problem of lower sample efficiency of the mechanical arm during training, a mask function for screening effective actions is designed, the mechanical arm can concentrate on effective action positions to promote efficient learning of a model, and the model is based on a depth-height diagram H _d Respectively generating grabbing masks m _g And push mask m _p Screening the Q value mapping diagram output by the full convolution network by using a mask, and calculating the maximum Q values corresponding to the pushing action and the grabbing action respectively:

wherein the method comprises the steps ofRepresenting the Hadamard product in the matrix operation, m is the push mask m _p And a grab mask m _g Wherein Q is a Q value map based on pixel level mapping output by a network, meaningful pixel points are screened out by using a mask function, and the maximum Q value of a push action at t moment is obtained>And maximum Q value of grab action->

4. Optionally, step S3 specifically includes the following steps:

step S3-1: based on the obtained depth-height map H _d Generating a mask image, wherein the mask image is subjected to image corrosion processing, pixel point coordinates (u, v) of the central position of an external rectangle are obtained through each object area, coordinates (x, y) under a mechanical arm coordinate system are calculated through coordinate transformation of the obtained pixel points of the image, position coordinates of all objects in a working space are obtained, and the average relative distance between the objects is defined as:

here o _i And o _j Representing the positions of objects i and j, setting D with n as the number of all objects in the working space _t The average relative distance of all objects in the scene at the moment t;

step S3-2: maximum Q value combined with the contributions predicted in step S2-4And maximum Q value of grab action->If it isThe mechanical arm will execute pushing action, otherwise execute grabbing action, and the optimal action is marked as a _t ；

Step S3-3: in order to be able to evaluate the effect of each candidate push on object density in the workspace, a push reward function was designed that uses the difference in average relative distance of all objects as an indicator:

where η is a preset parameter, we set η=0.01, d ^improved ＝D _t -D _t-1 Can be judged globallyWhether the mechanical arm is used for effectively separating the object before and after the pushing action is executed, when D ^improved Above a certain threshold we consider the pushing action of the previous moment to be beneficial, and at the same time, in order to prevent the mechanical arm from pushing the object out of the working space to cause the camera to be unable to capture, we judge whether the pushing action of the mechanical arm is punished by the position of the object in the working space.

5. Optionally, step S4 specifically includes the following steps:

step S4-1: performing eye-to-hand eye calibration by using a calibration plate and an RGB-D camera, calculating a transformation matrix of the camera, and transforming two-dimensional pixel coordinates (u, v) under a camera coordinate system into three-dimensional space coordinates (x, y, z) under a mechanical arm coordinate system;

step S4-2: the pushing and grabbing collaborative strategy network trained by the simulation end is transplanted to a physical platform, the image data acquired by the camera is input into a full convolution network and two-dimensional pixel coordinates (u) corresponding to the maximum Q value are output ^* ，v ^* ) The motion type and the optimal rotation angle theta, and the three-dimensional coordinates (x) of the mechanical arm in the actual space are output by combining the conversion relation of the step S4-1 ^* ，y ^* ，z ^* ) The mechanical arm clamping jaw rotates around the z-axis by an angle theta;

step S4-3: the clamping jaw of the mechanical arm is controlled to rotate to an angle theta corresponding to the characteristic diagram with the maximum Q value, and the motion of the tail end of the mechanical arm to the coordinate (x) is calculated by using an inverse kinematics resolver in the mechanical arm ^* ，y ^* ，z ^* ) The angles required by the rotation of all joints drive the mechanical arm to execute the optimal action a _t ；

Step S4-4: repeating the steps S4-1 to S4-3 until the grabbing task of all objects in the working space is completed.

6. Compared with the prior invention, the invention has the beneficial effects that:

(1) In order to solve the problem that the sample efficiency of the mechanical arm is low during training, a mask function for screening effective actions is designed based on the generated depth-height diagram, and the mechanical arm can concentrate on the effective action positions to promote efficient learning of the model;

(2) A new push reward function is designed, and in order to evaluate the influence of each candidate push on the object density in the working space, the push reward function which takes the difference value of the average relative distances of all objects as an index is designed, so that the network output is facilitated, and enough space is provided for grabbing.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a diagram of a robot arm grabbing scene in a dense scene of the present invention;

FIG. 3 is a color map and a depth map obtained in the simulation environment of the present invention;

FIG. 4 is a diagram of a full convolutional network of the present invention;

FIG. 5 is a diagram illustrating a pixel level Q map generation process according to the present invention;

FIG. 6 is a push and grasp heat map of the initial output of the present invention;

FIG. 7 is a push mask and a grasp mask of the present invention;

FIG. 8 is a push and grasp heatmap of the present invention after masking function screening;

FIG. 9 is a diagram of an example push and pull bonus of the present invention;

FIG. 10 is a graph comparing the proposed method (our) with the classical method (VPG) capture success rate;

fig. 11 is a block diagram of a real scene pushing and grabbing system according to the present invention.

Detailed Description

The invention provides a method for capturing multiple target objects by a dense scene mechanical arm based on deep reinforcement learning, which is described in detail and completely with reference to the drawings and the technical method in the embodiment of the invention. The concrete explanation is as follows:

FIG. 1 is an overall flow chart of the method of the present invention, comprising the following steps:

step S1-1: the learning mode of reinforcement learning is to interact with the environment through continuous trial and error, and to utilize preset rewards as feedback information, in order to reduce damage to a real mechanical arm, a training intensive scene is required to be built in virtual simulation software copple, a UR5 mechanical arm and an Rg2 mechanical claw are imported in the copple simulation software, and a D435 camera is added to acquire object information of a working space;

step S1-2: setting 0.448×0.448m in a simulation environment ² The object blocks are added in a random falling mode in the space to simulate an unstructured dense scene, a mechanical arm performs action on a mobile phone sample and trains a strategy network, and an experimental platform environment is shown in figure 2.

Step S2: image data captured by a camera is subjected to orthogonal transformation and rotation to generate height maps of different angles, and the plurality of height maps are input into a feature extraction module DenseNet121 and two parallel pushing networksAnd grab network->The full convolution neural network is formed, a pixel level Q value diagram is output, meaningful pixel points are screened out by using a mask function, and the maximum Q value of the pushing action at the moment t is obtained>And maximum Q value of grab action->The method comprises the following specific steps:

step S2-1: capturing RGB image I of size 640 x 480 within a workspace using a realsenseD D435 camera _c And depth image I _d RGB image I is converted by positive and negative conversion _c And depth image I _d Converting the camera coordinates into robot coordinates, and denoising to obtain a 224×224 color top view H _c And depth-height map H _d As state s _t The height diagram is shown in fig. 3, and the feature diagram H is generated by fusing two pictures _cd ；

Step S2-2: the characteristic diagram H generated in the step S2-1 is processed _cd By rotating clockwise 16 times in turn, the rotation angle of each timeObtaining 16 sets of rotated feature atlases H by affine transformation _cdk ＝{H _cd0 ，H _cd1 ，H _cd2 ，...，H _cd15 }；

step S2-3-1: constructing a model-free deep Q learning framework, modeling into a feedforward full convolution network through a Q function, fitting the Q function by using the full convolution network, and enabling the full convolution neural network to be composed of a feature extraction module DenseNet121 and two parallel pushing networksAnd grab network->The same structure is shared by the grabbing network and the pushing network, wherein the structure of the full convolution network is shown in figure 4, and the structure comprises two convolution layers of batch normalization, a ReLU activation function and a convolution kernel of 1 multiplied by 1 and a double bilinear interpolation upsampling layer;

step S2-3-2: the feature atlas H obtained in the step S2-2 _cdk In the input full convolution network, the color channel (RGB) of the height map and the depth channel (DDD) of the single channel copied into three channels are respectively madeFor input and output of the DenseNet module, feature stitching is performed according to channel level, and the feature extraction module outputs a 20×20×2048-dimensional feature vector I corresponding to time t _t . The feature vectors are then input into the grabbing network and the pushing network respectively, and the Q value map with the total output of 32 pixel levels of 224×224 size comprises 16 prediction maps Q of pushing prediction _pt ＝{Q _p1 ，Q _p2 ，...，Q _p15 Predicted 16-sheet Zhang Yuce graph Q _gt ＝{Q _g1 ，Q _g2 ，...，Q _g15 The Q-value map of 32 pixel level map represents the predictive value of pushing or grabbing actions performed at the corresponding positions and rotation angles, the Q-value prediction process is shown in fig. 5, and the Q-value map visualization information of the output pixel level is shown in fig. 6;

step S2-4: in order to solve the problem that the sample efficiency of the mechanical arm is low during training, a mask function for screening effective actions is designed, so that the mechanical arm can concentrate on the effective action positions to promote efficient learning of the model;

step S2-4-1: based on the generated depth-height map H in the top-view direction _d If the height value at the pixel point is larger than the lowest height of the working space, the pixel point is set to 255, otherwise, the pixel point is set to 0, and a push mask m is generated _p On the basis of pushing the mask, acquiring pixel point coordinates of the central position of a rectangle circumscribed by each object area, drawing a circle by taking the pixel points as circle centers and the length of two pixels as radius, setting all pixel points contained in the area as 255, and generating a grabbing mask m by taking the rest of the pixel points as 0 _g Mask map visualization information is shown in fig. 7;

step S2-4-2: screening the Q value mapping diagram output by the full convolution network by using the mask diagram generated in the step S2-4-1, and calculating the maximum Q values corresponding to the pushing action and the grabbing action respectively:

wherein the method comprises the steps ofRepresenting the Hadamard product in the matrix operation, m is the push mask m _p And a grab mask m _g Wherein Q is a Q value map based on pixel level mapping output by a network, meaningful pixel points are screened out by using a mask function, and a push action with the maximum Q value at t moment is obtained +.>And grab action->Wherein the screening result and the visual information are shown in fig. 8.

Step S3: selecting action according to the maximum Q value in the current state, wherein the pushed reward value r _p By pushing the difference D of the average relative distances of all objects in the scene before and after ^improved Whether the threshold value is larger than or not is determined, and the specific steps are as follows:

step S3-1: based on the generated depth-height map H _d If the height value of the pixel point is larger than the lowest height of the working space, setting the pixel point as 255, otherwise, generating a mask map, wherein the mask map is subjected to image corrosion processing, the pixel point coordinates (u, v) of the central position of the circumscribed rectangle are obtained through each object area, and the coordinates (x, y) under the mechanical arm coordinate system are calculated through coordinate transformation of the obtained pixel point of the image, wherein the transformation formula is as follows:

where hr is the image resolution, since the acquired image occupies about 4mm2 per pixel, hr=0.002, w ₀ ，w ₁ Respectively minimum values of x-axis and y-axis directions based on a global coordinate system in a working space;

step S3-2: acquiring position coordinates of all objects in a working space, wherein the average relative distance between the objects is defined as:

step S3-3: the optimal action is marked as a _t ，a＝<m，d，θ>Wherein m represents the action type of the mechanical arm, m epsilon { push, grasp } is respectively a pushing action and a grabbing action, d= (x, y, z) represents the three-dimensional coordinate of the action point of the mechanical arm, θ represents the rotation angle of the tail end of the mechanical arm around the z axis,wherein (u, v) is the optimal pixel point predicted by the Q value graph, θ corresponds to the optimal rotation angle predicted by the Q value graph, and (x, y) is obtained by the coordinate transformation of the optimal pixel point (u, v), z is the height value corresponding to the depth-height image pixel point (u, v);

step S3-4: maximum Q value combined with the push predicted in step S2-4-2And maximum Q value of grab action->If it isThe mechanical arm will perform a pushing action, otherwise a gripping action. The action type is defined as follows: pushing: closing the clamping jaw of the mechanical arm to predict the optimal pushing point +.>Starting from the start, the jaw rotates to the optimum rotation angle of the Q-value diagram +.>Pushing 10 cm along a straight line towards the optimal direction of Q value graph prediction, and grabbing: order theThe clamping jaw of the mechanical arm is completely opened, and the clamping jaw rotates to the optimal rotation angle of the Q value graph>To the best grasp point->The gripping action is carried out by closing the clamping jaw with the depth of 3 cm downwards;

step S3-5: in order to be able to evaluate the effect of each candidate push on object density in the workspace, a push reward function was designed that uses the difference in average relative distance of all objects as an indicator:

where η is a preset parameter, we set η=0.01, d ^improved ＝D _t -D _t-1 Can globally judge whether the mechanical arm effectively separates the objects before and after executing the pushing action, when D ^improved When the position of the object is greater than a certain threshold value, the pushing action of the mechanical arm is considered to be beneficial, and meanwhile, in order to prevent the mechanical arm from pushing the object out of the working space to cause the camera to be unable to capture, whether the pushing action of the mechanical arm is in the working space is rewarded or punished by judging the position of the object;

in simulation software, when the robot executes the grabbing action, the end effector will reach the target position first, then close the clamping jaw to finish grabbing and return to the initial set position, at this time, if the distance between the clamping jaw fingers is greater than zero, then the grabbing is considered to be successful once, if the object is not grabbed, the distance between the clamping jaw fingers will be closed to zero, so the corresponding grabbing function is set as:

step S3-6: the mechanical arm executes the optimal action a _t The working scene will change correspondingly, and the steps are re-executedS2-1 and S2-2 acquire the altitude map at time t+1 as the state S _t+1 ；

Step S3-7: according to the state s after the execution of the action _t+1 Using a reward function r _g And r _p Evaluate action a _t Whether or not there is positive effect on the multi-target grabbing task and calculating the rewards value r _t An example of a prize value is shown in FIG. 9, where if grabbing an object is successful, the prize value is r _t =1.5, if by pushing an active separate object, the prize value is r _t ＝0.5；

Step S3-8: storage state transition variable quadruple(s) _t ，a _t ，s _t+1 ，r _t ) Into a sample experience pool;

step S3-9: the prize value r calculated according to the step S3-6 _t Calculating an objective function Wherein gamma is E [0,1 ]]For the discount factor, setting γ=0.5, ++for reflecting the influence degree of future Q value on present>And->The weights of the network at the time t and the time t-1 are respectively;

step S3-10: the strategy network is trained by minimizing the Huber loss function:

in the method, in the process of the invention,for the estimated value output by the strategy network at the moment t, Q _t To calculate the true value at time t from the estimated Q value at time t +1,alpha is the learning rate of deep learning, the step length and speed of strategy network convergence are affected, and alpha=0.0001 is set;

step S3-11: push network with constant iterative update by minimizing Huber loss functionAnd grab network->The training can be stopped until the loss function converges when the grabbing success rate curve tends to be stable, and after multiple training, the new method has a success rate of approximately 94% for grabbing multi-target objects in a dense scene, compared with the classical method VPG grabbing success rate pair such as shown in fig. 10, the method can finish more accurate grabbing with higher action efficiency compared with the classical method VPG.

Step S4: the method comprises the steps of building a dense grabbing environment of a real scene, applying a pushing and grabbing collaborative strategy network trained by a simulation end to a physical platform, and making a decision by a mechanical arm according to action selection rules under the real environment to complete grabbing tasks, wherein the method comprises the following specific steps of:

step S4-1: the real mechanical arm grabbing system is shown in fig. 11, performs eye-to-hand-eye calibration work by using a calibration plate and an RGB-D camera, calculates a transformation matrix of the camera, and transforms two-dimensional pixel coordinates (u, v) under a camera coordinate system into three-dimensional space coordinates (x, y, z) under a mechanical arm coordinate system;

step S4-2: building a dense grabbing environment of a real scene, transplanting a pushing and grabbing collaborative strategy network trained by a simulation end to a physical platform, inputting image data obtained by a camera into a full convolution network and outputting an optimal action a _t Corresponding action type, two-dimensional pixel coordinatesAnd an optimal rotation angle>Combining the conversion relation of the step S4-1, outputting the mechanical arm in the actual spaceAction type, three-dimensional coordinates->And the rotation angle of the clamping jaw of the mechanical arm around the z-axis +.>

Step S4-3: in a real scene, the action process of the mechanical arm is the same as that of a simulation environment, and the clamping jaw of the mechanical arm is controlled to rotate to an angle corresponding to a characteristic diagram with the maximum Q valueCalculating the motion of the tail end of the mechanical arm to the coordinate +.>The angles required by the rotation of all joints drive the mechanical arm to execute the optimal action a _t ；

Step S44: repeating the steps S4-1 to S4-3 until the grabbing task of all objects in the working space is completed;

because the simulation end has a larger gap with the actual reality environment, after multiple tests of the real scene, the real end grabs multiple target objects in the dense scene and is lowered compared with the simulation end to about 90%, and the method provided by the invention is proved to be feasible.

The foregoing is a detailed description of the present invention in connection with the specific embodiments, and it is not intended that the invention be limited to the specific embodiments described above. For the related technical field of the method of the invention, the changes made by the technicians within the original scope of the invention also belong to the protection scope of the invention.

Claims

1. A method for capturing multiple target objects by a dense scene mechanical arm based on deep reinforcement learning comprises the following steps:

2. The method for capturing multiple objects by using the dense scene manipulator based on deep reinforcement learning according to claim 1, wherein the step S1 specifically comprises the following steps:

3. The method for capturing multiple objects by using the dense scene manipulator based on deep reinforcement learning according to claim 1, wherein the step S2 specifically comprises the following steps:

step S2-4: in order to solve the problem of lower sample efficiency of the mechanical arm during training, a mask function for screening effective actions is designed, the mechanical arm can concentrate on effective action positions to promote efficient learning of a model, and the model is based on a depth-height diagram H _d Respectively generating grabbing masks m _g And pushingMask m _p Screening the Q value mapping diagram output by the full convolution network by using a mask, and calculating the maximum Q values corresponding to the pushing action and the grabbing action respectively:

4. The method for capturing multiple objects by using the dense scene manipulator based on deep reinforcement learning according to claim 1, wherein the step S3 specifically comprises the following steps:

here o _i And o _j Representing the positions of objects i and j, setting D with n as the number of all objects in the working space _t The average relative distance of all objects in the scene at time t is:

where η is a preset parameter, we set η=0.01, d ^improved ＝D _t -D _t-1 Can globally judge whether the mechanical arm effectively separates the objects before and after executing the pushing action, when D ^improved Above a certain threshold we consider the pushing action of the previous moment to be beneficial, and at the same time, in order to prevent the mechanical arm from pushing the object out of the working space to cause the camera to be unable to capture, we judge whether the pushing action of the mechanical arm is punished by the position of the object in the working space.

5. The method for capturing multiple objects by using the dense scene manipulator based on deep reinforcement learning according to claim 1, wherein the step S4 specifically comprises the following steps: