CN111331607A

CN111331607A - Automatic grabbing and stacking method and system based on mechanical arm

Info

Publication number: CN111331607A
Application number: CN202010260136.1A
Authority: CN
Inventors: 张伟; 张钧皓; 宋然; 马林; 李贻斌
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2020-06-26
Anticipated expiration: 2040-04-03
Also published as: CN111331607B

Abstract

The invention discloses an automatic grabbing and stacking method and system based on mechanical arms, wherein images of grabbing areas and stacking areas of objects to be stacked are obtained, and the images are input into an automatic grabbing and stacking network; the automatic grabbing and stacking network predicts a grabbing position and a stacking position according to the learned grabbing strategy and stacking strategy; when the automatic grabbing code network is combined with deep reinforcement learning, an optimal strategy for maximizing the expected sum of future rewards is adopted; and the mechanical arm selects the required objects in the grabbing area according to the prediction result and places the objects in the grabbing area at proper positions in the current and future states. According to the technical scheme, the Grabbing and Stacking Network (GSN) learns the grabbing strategy and the stacking strategy at the same time, so that the mechanical clamp can pick up the object to be stacked from the table and correctly stack the object at a proper position.

Description

Automatic grabbing and stacking method and system based on mechanical arm

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to an automatic grabbing and stacking method and system based on a mechanical arm.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In the last decades, the gripping action of the robot arm has reached a high level of precision in highly ordered environments such as automobile assembly and welding. In many task scenarios, however, the robotic arm system must handle unpredictable objects. A cluttered desktop can cause the gripper system to fail completely. Even if the grabbing is successful, the preset fixed stacking position can cause objects with different shapes to collide. Accordingly, there is an urgent need in the expanding retail industry for intelligent palletizing systems that can be used in warehouses.

The objective of reinforcement learning is to train the agent into contact with the environment to maximize the expected value of the cumulative reward in the future, which is related to policy optimization in the Markov Decision Process (MDP). the markov decision process can be represented by a tuple of formulas as M (S, G, a, r, γ), where S ∈ S represents a defined state space, G ∈ G represents a list of possible targets, a ∈ a represents the action space, r represents the state reward function, γ ∈ (0, 1) is a discount factor.

Traditional reinforcement learning methods such as tabular reinforcement learning suffer from "dimensional disasters" when high dimensional state space and action space are encountered, which has been difficult to solve before. With the rise of deep learning technology in recent years, the combination of deep neural network technology and reinforcement learning technology becomes an important means for solving the 'dimensional disaster'. Using deep neural networks, states can also be represented in the form of images, which makes it more convenient to solve visual problems with reinforcement learning techniques.

At present, the fact that deep reinforcement learning falls on a real robot, particularly a mechanical arm, is difficult, is mainly because reinforcement learning is a continuous trial and error method essentially, a large number of experiments need to be carried out, and the real mechanical arm is easy to damage when being subjected to a large number of experiments and needs a long time to collect samples. In addition, the action dimension of the mechanical arm is very high, for example UR5 has 6 joints, i.e. 6 degrees of freedom, which makes the robot difficult to control in the learning process.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides the automatic grabbing and stacking method and system based on the mechanical arm, so that an intelligent body can independently choose to grab objects and neatly stack the objects in another area only by means of visual state input.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

an automatic grabbing and stacking method and system based on mechanical arms comprises the following steps:

acquiring images of a stacking area and a grabbing area of an object to be stacked, and inputting the images into an automatic grabbing and stacking network;

the automatic grabbing and stacking network predicts a grabbing position and a stacking position according to the learned grabbing strategy and stacking strategy;

when the automatic grabbing code network is combined with deep reinforcement learning, an optimal strategy for maximizing the expected sum of future rewards is adopted;

and the mechanical arm selects the required objects in the grabbing area according to the prediction result and places the objects in the grabbing area at proper positions in the current and future states.

According to the further technical scheme, the automatic grabbing and stacking network comprises a grabbing network and a stacking network, the grabbing position and the stacking position are respectively predicted, the characteristics of the images in the stacking area and the characteristics of the images in the grabbing area are fused, and information of the stacking area is transmitted to the grabbing network.

According to the further technical scheme, when the grabbing strategy and the stacking strategy are automatically grabbed and stacked through network learning, auxiliary training is carried out based on task related information, and the method comprises the following steps:

predicting the number of objects left in the grabbing area by using the features extracted from the grabbing network sensing layer;

predicting a height of the pile at a pixel level using information obtained from a stacking network aware layer;

feature learning centered on the item to ensure that items disappearing from the grip area are similar to items at the feature level and added to the stack.

According to the further technical scheme, the automatic grabbing and stacking network learns to grab articles with different sizes and tightly stack the articles in a stacking area by using distributed prior experience playback.

In a further technical scheme, when the automatic grabbing and stacking network is combined with deep reinforcement learning, an optimal strategy of maximizing the expected sum of future rewards is adopted, and objects required in a grabbing area are selected and grabbed and stacked at proper positions in the current and future states.

According to the further technical scheme, before the image of the object to be stacked (the grabbing area) is input into the automatic grabbing and stacking network, the image is processed: the 3-channel color data is combined with the depth data to be projected orthogonally to the overhead view angle and rotated counterclockwise by different angles to generate a new front view.

According to the further technical scheme, for the stacking state representation of the stacking area, the RGB images shot by the camera facing the stacking area are used.

According to the further technical scheme, two Q functions are modeled through a capture network and a stacking network, at each time step, the capture network can evaluate the capture Q function of each pixel in the capture state, and the stacking network can evaluate the stacking Q function of each position unit in the stacking state of the object.

According to a further technical scheme, a capture network and a stacking network extract features from original image data; for the convolution layers in the capture network and the stacking network, fusing the high-level characteristics of the stacking state of the object generated by the convolution layers in the stacking network with the high-level characteristics of the capture state generated by the convolution layers in the capture network;

for a grab network, the blended low-level features are processed by two convolutional layers and then fed into a bilinear upsampling layer, another function of which is also used to predict the number of objects on the table by a global average pooling layer followed by an activation function and a linear layer.

Mechanical arm snatchs system of putting things in good order based on degree of depth reinforcement study includes:

the mechanical arm, the camera and the control system;

the camera collects images of a grabbing area and a stacking area where objects to be stacked are placed, and inputs the images to an automatic grabbing and stacking network of the control system;

and the mechanical arm selects and carries out grabbing according to the prediction result and then stacks the objects to be stacked.

The above one or more technical solutions have the following beneficial effects:

according to the technical scheme, the Grabbing and Stacking Network (GSN) learns the grabbing strategy and the stacking strategy at the same time, so that the mechanical clamp can pick up objects to be stacked from the table and correctly stack the objects at a proper position.

The present disclosure uses information obtained from a stacking network (SNet) aware layer to predict the height of a pile at the pixel level. This task helps the network to extract the stack's profile features, which contain useful information to evaluate the current state. Another is an item-centric feature learning task to ensure that items missing from the gripping area are similar to items added to the stack at a feature level. That is, ensuring that the item features captured from different perspectives (desktop image and image of pile) are close.

The present disclosure formulates the entire grab-put process as a Q-learning problem. The technical scheme of the application uses distributed prior experience playback to learn the strategy that boxes with different sizes can be grabbed and closely stacked on a platform. Experiments were performed both in the simulation environment and in the real world to verify the effectiveness of the proposed method.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a schematic diagram of a system according to an embodiment of the present invention;

fig. 2 is a schematic diagram of network learning according to an embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

The embodiment discloses an automatic grabbing and stacking method based on mechanical arms, which comprises the following steps:

and the mechanical arm selects and carries out grabbing according to the prediction result and then stacking the objects to be stacked.

The mechanical arm grabbing and stacking method based on the deep reinforcement learning is a model embedded method based on a DQN algorithm, so that an intelligent body can independently choose to grab an object and neatly stack the object in another area only by means of visual state input. The multiple object grabbing and stacking task can be handled in an end-to-end manner.

The automatic crawling for stacking network (GSN) proposed in the above embodiment example is composed of two parts, namely, a crawling network (GNet) and a stacking network (SNet). The grab position and the stacking area can be predicted separately. To communicate the information of the landing zone to the GNet, the features of the landing zone photo and the features of the desktop photo are fused. Thus the GNet can consider not only which item is easily picked up, but also which item is needed in the stacking area.

By learning the grab and stack strategy simultaneously through the Grab and Stack Network (GSN), the mechanical gripper can pick the cassette from the table and stack it correctly on the platform.

To speed up the learning process, let the network focus more on task related information and provide additional training signals, three auxiliary tasks are introduced. The first is a table top object quantity prediction task that uses features extracted from the GNet perception layer to predict how many items remain on the table. The second is the heap height prediction task, which uses information obtained from the SNet perceptual layer to predict the height of the heap at the pixel level. This task helps the network to extract the stack's profile features, which contain useful information to evaluate the current state. The last is an item-centric feature learning task to ensure that items that disappear from the desktop are similar to items that are at the feature level and added to the stack. That is, the item features captured from different perspectives (desktop image and image of pile) should be close.

The entire capture and stacking process is formulated as a Q learning problem. Distributed priority empirical playback is used to learn strategies that can grab different sized boxes and place them tightly on the platform. Experiments were performed both in the simulation environment and in the real world to verify the effectiveness of the proposed method.

In a specific implementation, the operation is performed using a robotic arm system equipped with two fingers. The machine manipulation process may be represented as a Markov Decision Process (MDP). State s with time step at t_tThen, the system respectively shoots and stacks through two camerasAnd (5) grabbing the area. The robot then follows a strategy pi parameterized by theta_θ(s_t) An action (including grabbing and stacking an object) is selected and implemented, and the strategy can be learned through training the deep network. The status is then updated with the instant prize r(s)_t,a_t) S of_t+1. After training, the training can be carried out by finding an optimal strategy

To solve the reinforcement learning problem, the strategy maximizes the sum (∑) of the expectation of future rewards (E) at T ═ 1,2, …, T by adjusting θ with a discount factor γ.

The above framework provides a solution to such decision-making problems, but training is difficult due to the difficulty of data collection. Collecting a large amount of experience is critical to the performance of reinforcement learning networks. Compared with the same strategy learning, the different strategy learning method can utilize the collected data for multiple times so as to train when the data collection is difficult. To train the network efficiently, a hetero-strategy Q learning algorithm is employed to learn the strategy for estimating the Q function by minimizing bellman errors:

after training, this strategy will act on the value function by maximizing the optimal state

(s_t,a_t) To select operations to form an optimal strategy

In other words, it will choose to be in state s_tAction a of generating the maximum jackpot_tI.e. to select the required stacking areaAnd place their grab bar in the appropriate position in the current and future states.

Grabbing state s through 4-channel RGBD image taken in front of table_gtModeling, combining 3-channel color data with depth data to project orthogonally to the overhead view angle and rotate counterclockwise by different angles n 22.5 °, n ∈ 0,1,2.. 7. this strategy generates eight new front views for the stacked state representation, the RGB images taken with the camera facing the stacking area_stCan be fully represented by 2-dimensional RGB pictures.

Defining a representation of a grab action as a_gtThe representation of the stacking action is defined as a_stAn action is generated at each time step. As for the grabbing action a_gtWhich contains a Cartesian motion command [ x ]_g,y_g,z_g,θ_g]Wherein [ x ]_g,y_g,z_g]Corresponding to the centre of the gripper in the gripping, theta_gIs the rotation angle of the wrist around the z-axis. The technical scheme divides 180 degrees into 8 independent theta_gAnd (4) rotating. As for the stacking action a_stThe stacking area is divided into 14 positions along the x-axis, denoted as s_i∈[0,13]. These areas also represent the center of the palletized object, since the center of the object is learned to be grabbed after training. Since most objects in a task are 3 cells wide, neither the leftmost nor the rightmost cell area is contained in the operating space (if put at the edge, a part of the object will not be visible). The mechanical arm not only presses f_x(s_i) Designated x-coordinate stacking object (f)_xIs to mix s_iDiscrete function mapped to x coordinate), also by s)_stThe inferred z-coordinate is used to stack the object. Other commands (e.g. y-coordinate and gripper orientation) have been fixed during the stacking operation, which simplifies this palletizing problem and facilitates stacking in a dense fashionAnd (4) a box.

The reward in reinforcement learning comprises two parts, wherein one part is used for evaluating the effect of putting and the other part is used for evaluating the effect of grabbing. Because the box that puts things in good order is put very tightly and is the level at the top, consequently this application technical scheme definition puts in good order reward r_sThe following were used:

r_s＝B^--H⁺-O⁺-L,

wherein B is^-Indicates a ruggedness reduction value (evaluated by calculating the column height variance, thresholds set to 0.3, 0, and 1). H⁺Indicating a maximum height increase value (threshold set to 0 or 0.7). O is⁺Indicating an increased number of holes. Once the gap is covered, a hole is formed that cannot be refilled. L is a binary value indicating whether the top of the stack is all horizontal. By mixing s_pt+1And s_ptThe comparison can calculate these four values. The first three are different piecewise functions, the inputs of which are respectively the image s_pt+1And s_ptThereby guiding the strategy to achieve the code placement with high adaptability. Inspired by early termination strategies, if the learned strategy fails to achieve a level of code placement, the metric L will take effect and a restart signal will be sent since the status reward at a later time may be inaccurate.

Snatch reward r_gIs defined as:

where G represents the result of the grab, 0 represents a grab failure, and 1 represents a grab success. D, which represents the distance between the center of the object and the grasping position, is crucial for achieving high stacking accuracy.

By incorporating different functions in the full convolutional network (GNet), the deep Q network is extended. The two Q functions are modeled by two convolutional networks (GNet and SNet). At each time step, GNet evaluates s_gtThe grab Q function of each pixel in the set, while SNet evaluates s_stThe code Q function of each position unit in the array.

Both networks in the architectureFeatures are extracted from the raw image data using the first 3 cells of ResNet-50. Since depth information is critical for accurate grabbing, the input layer of the GNet is adjusted from 3 channels to 6 channels (i.e., RGB is changed to RGBDDD by connecting together the RGB channel and the per-channel replicated depth channel). The input channel of the ResNet component in SNet remains unchanged. Will s_gtThe captured pictures (resolution of 224x224) and s are represented_stThe size of the representative code-positioned pictures (with a resolution of 256x128) are all adjusted to 320x320, which will generate a feature map of the appropriate size, which is then used for the upsampling and auxiliary tasks.

For subsequent convolutional layers in the GNet and SNet, they share the same network architecture as shown in fig. 2. The convolution kernel size of each convolution layer is 1x1, which helps to reduce size and mitigate possible overfitting. To incorporate characteristic information about which objects are required in a palletization to form an ordered layout and to be easily grasped in a particular grasping direction, the convolutional layer Φ in the SNet is_sGenerated s_stHigh-level features of (1) and convolution layer phi in GNet_gGenerated s_gtAre fused. Considering that GNet is a full convolution network that encodes position information in a feature map, Φ cannot be scaled down at the channel level_sAnd phi_gConnected in series, using two linear layers_sConversion to along-channel weights ω_gThen, ω will be_g(between 0 1) times phi_gThe following were used:

Φ_m＝λω_gΦ_g+(1-λ)Φ_g

wherein phi_mRepresenting the fused features, and λ is a scale factor, balancing the original features (containing a preliminary prediction of the location of the object that is easy to hold) with the fused features (emphasizing the feature from s)_gtThe features are useful for selecting those objects that are suitable for the status of the stacking area). In the operation of the present embodiment, λ is set to 0.25.

The aforementioned features are sent to different layers for different tasks. For GNet, mixed low-layer features Φ_mProcessed by two convolutional layers and then fed into a bilinear upsampling layer. Phi_gIs used to predict the number of objects on the table by global averaging after pooling the layers using the ReLU activation function and the linear layers, which helps to perceive objects with different layer regions and maintains sensitivity to smaller objects. In SNet, the high-level feature Φ_sNot only for facilitating the GNet to make a broader perception, but also for predicting the Q value and column-wise height value of each codespace. Using two separate linear layer modules, by calculating the value function V_s(s_s) And a merit function A_s(s_s,a_s) Jointly estimating the Q value Q_s(s_s，a_s). For the task of predicting the height of the stacking area, the height of the object which is predicted column by column when the object is stacked can represent the upper boundary of a pile, wherein auxiliary information for predicting the Q value of the stacking area is contained.

Another auxiliary task, not shown in fig. 2, is an object-centric feature learning task. When the robot arm picks up and stacks an object from the desktop (desktop image and stack image in fig. 1), the features of both scenes extracted by the network should change according to the features of the object that is removed from the desktop and appears on the stack. The task will successfully capture the perception layer characteristics phi before and after_gWith the characteristics of the sensing layer before and after stacking_pA comparison is made. To compute these features, the feature map Φ is processed by applying global average pooling and the ReLU non-linear layer_gAnd phi_p. Thus, the perception modules in GNet and SNet have the ability to identify the same object by similar features, which facilitates the above-described feature fusion.

The GNet and SNet are co-trained using a deep Q network as a Q function approximator. Specifically, the Q function Q will be grabbed_g(s_p+g，a_g) Modeling as a full convolution network (GNet), decoding a Q function Q_s(s_s，a_s) Modeled as a deep network (SNet). Dual Q learning is used to train GNet and SNet. Compared with a Q learning method, the method adopts a target network and an improved maximum operator, so that the method is more reliable. The target network shares the same architecture as the network in fig. 2 (without the auxiliary task module), with its parameters extracted from the online learning model every 300 steps. For theMaximum operator, dual Q learning use Current Q_θAnd from target Q_θ ^-The operation of maximizing the obtained value, that is, the loss functions for the grab Q function and the put-on Q function are both:

i ∈ { g, s } in each training round, the expected value E is calculated from a small batch of samples, the remaining variables being described above_gt) The parameterization of (a) allows the convolution characteristics to be shared between position and orientation. In the technical scheme of the application, the Q value Q predicted by GNet_gWhat represents the grabbing and grabbing at a position where it is likely to be successful is the object required for the stacking strategy. Thus, the Q function Q_g(s_p+g，a_g) The reward of (a) should include both grab and bet rewards. In this way,. phi_sTo omega_gThe feature fusion layer of (2) can be self-adjusted in the training process.

To speed up learning, a distributed learning framework is implemented. There are 16 samplers sampling in an asynchronous manner. After collecting 200 samples, the sampler transfers the experience of samples with different priorities to the learner and duplicates the parameters from it. Meanwhile, the learner performs training through an empirical replay buffer with two priorities that stores empirical indexes with different priorities related to grab and put, and alternates use of samples with high priorities. Where samples with larger prediction errors will have higher priority. The pseudo code of the learner training routine is listed in Algorithm 1.

The system is trained in a simulated environment to improve efficiency. A UR5 robot equipped with a Robotiq85 gripper was used in V-REP. Four types of cassettes were designed, with sizes of 3x 3x 3, 3x 9x 3, 6x 9x 3 and 9x 9x 3 (in centimeters).

In addition to estimating the loss of the Q function during the learning process, the auxiliary task is also lost differently. The task of predicting the landing zone height and the number of objects is trained using the smoothing L1 loss function, while the task of learning features centered on objects uses the n-pair loss function. The GNet and SNet were trained simultaneously using a random gradient descent with a learning rate of 0.0001. Both Q learning methods employ an epsilon-greedy exploration strategy, with epsilon initialized to 0.9 for SNet and 0.5 for GNet, respectively, and then annealed to 0.05 in the training.

The system (GSN) of the present solution is evaluated in a simulation environment and in a real scenario. Boxes of different sizes and colors are randomly stacked on a table, and the robot needs to grab and stack the boxes one by one to form a stable stack. Three experiments were performed:

1) comparative study between the reinforcement learning framework and the supervised learning method of the embodiment of the application;

2) ablation studies to assess the contribution of each component of the system of the present application in terms of overall performance;

3) it is demonstrated that the system of the present application can be applied to real robots to perform pick and place tasks.

The present application uses a UR5 robotic arm and a Robotiq85 gripper (with an attached Realsense camera) to perform the same test task on a real environment robot. In practical tests, the stacking performance was evaluated by the height difference Hd between the highest and lowest surfaces of the stack. If Hd is less than or equal to 2, the task of code placement is regarded as successful. The method achieves 75% (15/20) of stacking success rate in the box stacking task, and the supervised learning method can only achieve 15% (3/20) of success rate.

In one embodiment, an autonomous grasping and palletizing method and system based on a mechanical arm includes:

the mechanical arm, the camera and the control system;

the camera collects images of a grabbing area and a stacking area of an object to be stacked, and inputs the images to an automatic grabbing and stacking network of the control system;

When the system works, specific reference is made to specific steps of the automatic grabbing and stacking method based on the mechanical arm in the embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. An automatic grabbing and stacking method based on mechanical arms is characterized by comprising the following steps:

2. The method for automatically grabbing and palletizing based on the mechanical arm as claimed in claim 1, wherein the automatic grabbing and palletizing network comprises a grabbing network and a palletizing network, the grabbing position and the palletizing area are respectively predicted, and the characteristics of images of the palletizing area and the characteristics of images of the grabbing area are fused to realize the transmission of information of the palletizing area to the grabbing network.

3. The method for automatically grabbing and palletizing based on the mechanical arm as claimed in claim 1, wherein when an automatic grabbing and palletizing network learns a grabbing strategy and a palletizing strategy, training is performed based on task-related information, and the method comprises the following steps:

4. The robotic-arm-based autonomous grasping and palletizing method according to claim 1, wherein the automated grasping and palletizing network learns to grasp and closely stack articles of different sizes over a palletizing area using distributed, prior empirical playback.

5. The robotic-arm-based autonomous grasping and palletizing method as in claim 1, wherein the image of the grasping area of the object to be palletized is processed before being input into the automatic grasping and stacking network: combining the 3-channel color data with the depth data to project orthogonally to the overhead view and rotate counterclockwise by different angles, a new elevation is generated.

6. The robot-based autonomous grasping and palletizing method as set forth in claim 1, wherein for the representation of the palletizing state of the palletizing region, RGB images taken by a camera facing the palletizing region are used.

7. The robotic-arm-based autonomous grasping and palletizing method according to claim 1, characterized in that two Q-functions are modeled by a grasping network and a stacking network, the grasping network evaluating, at each time step, the grasping Q-function for each pixel in the grasping state, and the stacking network evaluating the stacking Q-function for each position unit in the stacking state of the object.

8. The robotic-arm-based autonomous grasping and palletizing method according to claim 1, wherein the grasping network and the palletizing network extract features from raw image data; for the convolution layers in the capture network and the stacking network, fusing the high-level characteristics of the stacking state of the object generated by the convolution layers in the stacking network with the high-level characteristics of the capture state generated by the convolution layers in the capture network;

for a grab network, the mixed low-level features are processed by two convolutional layers and then fed into a bilinear upsampling layer, and another function of the convolutional layers is also used to predict the number of objects on the table by passing in an activation function and a linear layer after globally averaging the pooling layer.

9. The robotic-arm-based autonomous grasping and palletizing method according to claim 1, wherein a deep Q-network is used as a Q-function approximator to train the grasping network and the palletizing network together.

10. The utility model provides an independently snatch and pile up neatly system based on arm, characterized by includes:

the mechanical arm, the camera and the control system;