CN116061187B

CN116061187B - Method for identifying, positioning and grabbing goods on goods shelves by composite robot

Info

Publication number: CN116061187B
Application number: CN202310206998.XA
Authority: CN
Inventors: 吴波; 张春生; 董芹鹏; 郑随兵
Original assignee: Ruiman Intelligent Technology Jiangsu Co ltd
Current assignee: Ruiman Intelligent Technology Jiangsu Co ltd
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-06-16
Anticipated expiration: 2043-03-07
Also published as: CN116061187A

Abstract

The invention relates to a method for identifying, positioning and grabbing goods on a goods shelf by a composite robot, and belongs to the technical field of robot control. The invention collects goods images of goods shelves through a binocular structured light infrared camera arranged at the tail end of a mechanical arm, takes goods of each specification of each brand of each goods as a category, builds a target detection network by using a deep learning method, uses the trained target detection network to infer images collected by the camera in real time, uses a mean Hash algorithm and a three-histogram algorithm to verify the prediction result of the target detection network, and when a target goods is found, converts the goods position coordinates obtained from RGB images into a world coordinate system by combining the depth values of the goods obtained from the depth images, and transmits the goods position coordinates to the mechanical arm to execute a target goods grabbing task. The method can effectively identify and detect goods on the shelves with various types and dense arrangement, is insensitive to illumination change by verification, and has high detection accuracy for goods on the shelves with small appearance difference.

Description

Method for identifying, positioning and grabbing goods on goods shelves by composite robot

Technical Field

The invention belongs to the technical field of robot control, and particularly relates to a method for identifying, positioning and grabbing goods on a goods shelf by a composite robot.

Background

Grabbing the target object is a common action in production and is a basic function required by a robot, and correct identification and accurate positioning of the target object are the precondition of successful grabbing. In the identification and positioning of a target object, the conventional method is used for grabbing objects in a fixed area according to a program flow written in advance, the types of the objects are few, and when the actual placement position of the objects deviates from a program setting position, the situation of grabbing failure easily occurs.

Goods on the goods shelves are various in variety and densely arranged, and the goods with different tastes or different capacities of the same brand are very little in difference in appearance, and only the corresponding tastes and capacities are noted on the packages. The existing robot grabbing method aims at various goods, and goods on a goods shelf with unfixed placement positions cannot be accurately grabbed.

Disclosure of Invention

Based on the method, the invention provides a composite robot goods shelf goods identification, positioning and grabbing method which is used for accurately grabbing goods shelf goods with various types and unfixed placing positions.

In order to achieve the above purpose, the present invention provides the following technical solutions: the invention provides a method for identifying, positioning and grabbing goods on a goods shelf by a compound robot, which comprises the following steps:

s1: the tail end of the mechanical arm of the compound robot is provided with a binocular structured light infrared camera and a two-finger hand-grabbing device; acquiring goods shelf image data sets of different types, different light environments and different visual angles by using a binocular structured light infrared camera to generate a training set and a testing set;

each sample comprises an RGB image, a depth image and an image labeling result of the goods shelf, the RGB image labeling result is stored as a txt file, and each row in the txt file comprises a category code number of the target goods, a u coordinate of a center point of the target goods, a v coordinate of the center point of the target goods, a horizontal length proportion and a vertical height proportion of the target goods in the images; taking each goods shelf commodity of each specification of each brand of each commodity as a category and having a unique category code number;

s2: building a target detection network by using a deep learning method and training;

the input of the target detection network is an RGB image of a goods shelf commodity, and the class code and the commodity position of the commodity are output, wherein the commodity position is the coordinates (u, v) of a commodity center in the image;

the target detection network selects a yolov5 network, and deletes a Resize function for changing the commodity size in a data set loading class DataLoader of the target detection network;

s3: building a target detection reasoning frame, and reasoning by using a trained target detection network;

the target detection reasoning framework is connected with the input of the binocular structured light infrared camera and the target detection network which is trained to obtain the optimal weight, the RGB image and the depth image which are acquired by the binocular structured light infrared camera are acquired in real time, and the RGB image is input into the target detection network to predict the commodity category and the position of the goods shelf;

s4: verifying the prediction result of the target detection network by using a mean value hash algorithm and a three-histogram algorithm, when both the mean value hash algorithm and the three-histogram algorithm pass verification, indicating that the prediction is accurate, continuing the next step, if the mean value hash algorithm and the three-histogram algorithm do not pass verification, indicating that the prediction is wrong, and turning to S3 to continue detecting the next frame of image;

acquiring a front RGB image as a template of each type of goods according to each type of goods;

the mean hash algorithm verification means that a mean hash algorithm is used for calculating the Hamming distance of a current RGB image and a goods shelf commodity template of a target detection network prediction type, when the Hamming distance is smaller than 4, the prediction is correct, otherwise, the prediction is incorrect;

the three-histogram algorithm verification means that three-histogram algorithm is used for calculating the identity of the current RGB image and the goods shelf templates of the target detection network prediction category, the average value of the Babbitt coefficients of three channels is used as a similarity value, when the similarity is larger than 0.8, the prediction is correct, and otherwise, the prediction is wrong;

s5: judging whether the current goods shelf goods are target goods required by the user, if not, continuing to transfer to S3 to detect the next frame of image; if so, acquiring a depth value Z of the goods on the goods shelf according to the current depth image, converting the coordinates of the goods on the goods shelf in the image into a world coordinate system, transmitting the coordinates to a mechanical arm of the compound robot, and executing a target goods grabbing task by a two-finger manual grabbing device.

In step S3, the compound robot obtains the types of the goods placed on each layer on each shelf in advance, when the compound robot receives the demands of the goods of the user, the corresponding shelf and the layer where the goods are located are determined according to the types of the target goods, the compound robot is moved to the corresponding shelf, the mechanical arm of the compound robot is moved, the binocular structured light infrared camera is used for shooting the images of the goods in the shelf layer where the target goods are located, and the target goods are searched.

In the step S2, when training the target detection network, a loss function is set to be composed of three parts, namely a bounding box regression loss, a target confidence loss and a category loss; the method comprises the steps that a bounding box regression Loss CIoULSs is calculated by using a CIoU Loss function, a target confidence Loss BCELoss is calculated by using a binary cross entropy Loss function, and a category Loss Focalloss is calculated by using a Focal Loss function; the total Loss function Loss is: loss=ciouloss+bceloss+focalloss.

In the step S5, the internal reference of the binocular structured light infrared camera is obtained in advance

，/>

，/>

Then converting coordinates (u, v) of the goods on the shelf in the image into a camera coordinate system to obtain coordinates (X, Y, Z), wherein Z is obtained from the acquired depth values, and the conversion matrix is as follows:

。

converting coordinates (X, Y, Z) of the goods on the shelf under a camera coordinate system to a world coordinate system, and transmitting the coordinates to the mechanical arm; the origin of the world coordinate system is set at the center of the loading plane of the mechanical arm, the z-axis vertical loading plane is outwards, and the y-axis vertical horizontal plane is upwards.

In summary, the invention has the following advantages:

according to the method, goods on the goods shelf are grabbed based on machine vision, the coordinates of the target goods in the images are output by combining the deep learning method and the traditional image processing method, the depth value of the target goods in the images is output by using the depth images, the accurate coordinates of the target goods in the world coordinate system are output to be grabbed by the mechanical arm through coordinate conversion, the situation that grabbing fails when the actual placement position of the goods deviates from the programmed position is effectively avoided, and the grabbing accuracy of the mechanical arm is greatly improved;

according to the method, aiming at the characteristics of the goods to be identified, the target identification is carried out by combining the deep learning method and the traditional image processing method, so that goods with various goods on shelves and dense placement can be effectively identified and detected, the goods on the shelves are insensitive to illumination change and have high detection accuracy aiming at small appearance difference;

according to the method, the composite robot lifting mobile platform is used for loading the mechanical arm, so that the tasks of lifting of the mechanical arm in a long stroke, placing and operating of the mechanical arm in a horizontal direction after moving, steering and grabbing to a target position can be realized, and the working efficiency is further improved. The tail end grabbing mechanism simulates a device for grabbing objects by two fingers of a human body, adopts an independent high-precision motor to drive and feed back grabbing force, can grab goods on shelves of different sizes and shapes and is not harmful to the goods.

Drawings

FIG. 1 is a schematic diagram of a compound robot of the present invention performing merchandise capture;

FIG. 2 is a schematic view of a compound robotic positioning and gripping mechanism of the present invention;

FIG. 3 is a schematic diagram of a process for detecting and positioning a commodity by the composite robot;

FIG. 4 is a schematic diagram of the relationship of a pixel coordinate system, a camera coordinate system and a world coordinate system used in the present invention.

In the figure: 1. a goods shelf; 2. a composite robot lifting platform; 3. a composite robot moving platform; 4. a mechanical arm; 5. a binocular structured light infrared camera; 6. a two-finger hand grip.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

As shown in fig. 1 and 2, the composite robot for identifying, positioning and grabbing goods on a goods shelf 1 according to the method for identifying, positioning and grabbing goods on the goods shelf comprises a lifting platform 2, a moving platform 3 and a mechanical arm 4, wherein the mechanical arm 4 is arranged on the lifting platform 2, the lifting platform 2 realizes the up-down space movement of the mechanical arm 4, the lifting platform 2 is carried on the moving platform 3, and the moving platform 3 moves to realize the movement of the mechanical arm in the front-back left-right space. The tail end of the mechanical arm 4 is provided with a binocular structured light infrared camera 5 and a two-finger hand-grabbing device 6.

The method for identifying, positioning and grabbing goods on the goods shelf by the composite robot is realized by a main process shown in fig. 3, and is described in 5 steps.

Step 1: and acquiring goods shelf commodity image data sets with different types, different light environments and different visual angles by using a binocular infrared camera in advance.

The binocular infrared camera of the embodiment of the present invention uses a D435i camera under the RealSense series, which is proposed by Intel corporation.

The embodiment of the invention collects the goods shelf commodity image data sets with different types, different light environments and different visual angles, and is concretely implemented as follows: the D435i camera under the RealSense series, which is proposed by Intel corporation, is used for collecting two front sides, one side, one back side RGB image and the corresponding depth image of different kinds of goods on shelves under the environments of sufficient light and insufficient light respectively. Using LabelImg labeling software to select different types of goods shelves and commodity frames in the RGB image and respectively giving numbers: 0.1, 2.n-1, n is the total category number of the commodity, and the result of each RGB image label is derived into a txt file. In the invention, commodities with different specifications of each brand of commodity are taken as a category, and different code numbers are assigned. Each row in the txt file represents a commodity in the image, and the contents of each row are respectively: the category code of the commodity, the u coordinate of the commodity center point, the v coordinate of the commodity center point, the proportion of the horizontal length of the commodity in the image and the proportion of the vertical height of the commodity in the image. The invention acquires the (u, v) coordinates of the commodity in the image from the RGB image, and acquires the z coordinates of the commodity in the image from the depth image.

In the embodiment of the invention, each layer of shelves is used for placing different brands of the same kind of goods, and different layers of shelves are used for placing different kinds of goods. In the embodiment of the invention, 10 goods on a goods shelf are sampled, 20 goods with the same brand and the same specification are selected for each type of goods, 5 RGB images and 5 depth images are sampled for each goods under the environment of sufficient light and insufficient light, a goods shelf goods data set consisting of 2000 images is obtained, and the goods shelf data set is divided into 1800 images of a training set and 200 images of a test set. In the embodiment of the invention, in the image acquisition process, the distance between the camera and the commodity is kept 30cm in the horizontal direction, and the distance between the camera and the commodity is 15cm higher than the plane of a goods shelf on each layer in the vertical direction.

Step 2: the target detection network is built and trained using a deep learning method.

In the embodiment of the invention, the goods shelf image data set manufactured in the step 1 is directly input to a target detection network for training, wherein the target detection network selects 6.0 version of yolov 5. The input of the target detection network is RGB image and depth image, and the category code and the commodity position of the goods shelf commodity are output, wherein the commodity position is the coordinates (u, v) of the goods shelf commodity center in the image.

Because the convolutional neural network has the characteristic of scale invariance, and different commodities often have the same appearance and different capacities, the size Resize function of the commodities in the training data set is deleted and transformed in the data set loading class DataLoader of the target detection network, and the detection accuracy of the model on the commodities with different sizes is improved. In the model training process, finding out the critical point of the model overfitting according to the accuracy-loss curve recorded by the visualization tool TensorBoard, and storing the training weight stored by the critical point as the optimal weight for the reasoning process of the target detection network in the step 3.

In the embodiment of the invention, when the target detection network is trained, the loss function consists of three parts, namely, a boundary box regression loss, a target confidence coefficient loss and a category loss, wherein the boundary box regression loss is calculated by using a CIoU loss function, and the target confidence coefficient loss is calculated by using a binary cross entropy loss function. Because the goods of the goods shelf are huge in category number and are mixed with a plurality of samples which are similar in appearance and difficult to distinguish, the method uses the Focal Loss function to replace the binary cross entropy Loss function to calculate category Loss, so that the category Loss is focused on the samples which are difficult to distinguish, and the overall performance of the model is improved. The calculation formula of Focal Loss is as follows:

。

focal Loss can also be expressed as:

。

wherein Lfl represents a Focal Loss function value;

the predicted probability size; y is a label, in the two classifications y=0 represents a negative sample, and y=1 represents a positive sample; a is category weight, which is used for balancing the problem of unbalance of positive and negative samples, and the unbalance of the number of the positive and negative samples can be restrained by adjusting a; />

Representing the weight of the refractory sample for measuring the refractory sample and the easily separable sample by adjusting +.>

Simple/indistinguishable sample size imbalance can be controlled; probability->

Reflects the proximity to the real class y, < ->

The larger the specification the more accurate the classification.

The total Loss function Loss of the invention is as follows:

Loss=CIoULoss+BCELoss+FocalLoss

wherein CIoUloss is a bounding box regression loss, BCELoss is a target confidence loss, and FocalLoss is a category loss.

In the embodiment of the invention, the total iteration time is set to 90 epochs when the target detection network is trained, the batch size is set to 16, and the optimizer selects Adam. The target detection model uses the wakeup preheating training in the training process, so that the damage to the original weight caused by the overlarge learning rate at the beginning of the training is avoided, and the stability of the model is ensured. The specific process is as follows: the learning rate of the bias layer is rapidly reduced from 0.1 to 0.01 by the first 5 epochs of training, the learning rate of other parameters is slowly increased from 0 to 0.01, and the learning rate is updated by using a cosine annealing learning algorithm from the 6 th epochs, so that the learning rate is changed according to a cosine curve.

Step 3: and (3) constructing a target detection reasoning framework, and reasoning by using a trained target detection network.

In the embodiment of the invention, a target detection reasoning frame is built based on a detection script of yolov5, a python SDK (python software development kit) of a D435i camera under a RealSense series which is proposed by Intel corporation is downloaded, and the python SDK is accessed into the target detection reasoning frame, so that the reasoning frame can acquire RGB images and depth images acquired by the camera in real time.

In the embodiment of the invention, the specific implementation of reasoning by using the trained target detection network is as follows: firstly, loading the optimal weight obtained by training in the step 2 in a target detection reasoning frame, directly inputting the RGB image acquired by the binocular camera in real time into the target detection reasoning frame, and obtaining the predicted category of the current goods shelf commodity and the two-dimensional coordinates (u, v) of the current goods shelf commodity in the RGB image through reasoning of the target detection network.

In the embodiment of the invention, each layer of shelves is provided with different brands of different specifications of the same type, for example, one layer of shelves is mineral water, mineral water with different sizes of different brands is provided, the compound robot acquires the type information of the goods placed on each layer of each shelf in advance, for example, the first layer of a certain shelf is mineral water, the second layer is carbonated beverage and the like, when the compound robot receives the demands of the goods of a user, the corresponding shelf and the layer where the goods are positioned are firstly acquired according to the type of the goods which are required, the compound robot is moved to the corresponding shelf, each commodity image of the layer where the goods are required is shot by using the binocular structured light infrared camera 5, and when each commodity image is shot, the binocular structured light infrared camera 5 is enabled to reach a set shooting angle, namely, the horizontal direction of the camera keeps a distance of 30cm with the goods, and the vertical direction is 15cm higher than the plane of the shot shelf layer.

Step 4: and verifying the detection result of the target detection network by using a traditional image processing method. And if and only if the detection result passes the verification of the mean hash algorithm and the three-histogram algorithm, the detection result is correct. And when the detection result is wrong only through verification of the mean hash algorithm or both algorithms are not passed, the system continues to detect the next frame of image.

Because the convolution nature has translation invariance, the goods on the goods shelf are rotated by a certain angle, and the goods on the goods shelf can not be identified by the network after the light color is changed, the detection result of the target detection network is verified by using the traditional image processing method, so that the network model has higher robustness.

In the embodiment of the invention, after the step 2 is finished, a front RGB image is collected for each goods shelf commodity to be used as a template of the commodity, and the similarity degree of the template image of the commodity and the RGB image of the current commodity is used as a main basis for judging whether the detection result is correct. In the network model reasoning process, when the target detection network identifies the category of the goods on the goods shelf, a mean value hash algorithm is used for calculating the similarity between the RGB image of the goods on the goods shelf and the corresponding template image, when the Hamming distance is smaller than 4, the prediction is correct, and otherwise, the prediction is incorrect. In the embodiment of the invention, the mean Hash algorithm comprises the following calculation steps: firstly, scaling two images into images with 8 x 8 pixels, setting the length of the images, converting the images into gray level images, calculating the average value of the pixels of the gray level images, marking the pixel value of the gray level images as 1 which is greater than or equal to the average value and 0 which is smaller than the average value, and counting how many digits in the two images are different, thus obtaining the Hamming distance.

In the embodiment of the invention, in the process of reasoning of the network model, after the goods on the goods shelf are verified by the mean hash algorithm, the similarity between the goods on the goods shelf and the corresponding template photos is calculated by using a three-histogram algorithm, and when the similarity of the two images is larger than 0.8, the prediction is correct, namely the detection result of the network passes the verification of the mean Hash algorithm and the three-histogram algorithm. Similarity is calculated using the Pasteur coefficient ρ, with the following formula:

。

wherein the method comprises the steps of

Image histogram data representing the source image and the candidate image respectively, i representing the position of each pixel point, and N representing the total number of pixel points in the image. And adding the data point products of the same pixel point positions i after square division, wherein the obtained result is the image similarity value of the Papanicolaou coefficient, and the range is between 0 and 1.

The three-histogram algorithm comprises the following calculation steps: separating RGB channels of the two images, counting the histogram of each channel, calculating the Pasteur coefficient of the histogram of the two images under each channel, and taking the average value of the Pasteur coefficients of the three channels as the acquaintance value of the two images.

Step 5: if the goods shelf commodity identified by the current target detection network is the target commodity required by the user, carrying out coordinate conversion, otherwise, continuing to transfer to the step 3, and shooting the next commodity image by using the binocular structured light infrared camera 5. The coordinate conversion is to convert the two-dimensional coordinates and depth values of the goods on the shelf in the image into coordinates in the world coordinate system, and transmit the coordinates to the mechanical arm, and the two-finger hand-grabbing device 6 executes the grabbing task of the target goods.

The application scene of the invention relates to three coordinate systems, namely a pixel coordinate system, a camera coordinate system and a world coordinate system, as shown in fig. 4. The pixel coordinate system is built in an image shot by the camera, the origin of the u-axis and the v-axis of the pixel coordinate system is positioned at the upper left corner of the image, the u-axis is horizontal to the right, and the v-axis is vertical to the lower; the camera coordinate system is a three-dimensional rectangular coordinate system O-XYZ which is established by taking the focusing center of the binocular structured light infrared camera as an origin O and taking the optical axis as a Z axis; the origin o of the world coordinate system is set at the center of a loading plane of the mechanical arm, the z-axis is vertical to the loading plane outwards, the y-axis is vertical to the horizontal plane upwards, and the x-axis, the z-axis and the y-axis form a right-hand system. The world coordinate system is used to define the objective position of the stereoscopic space and is a reference for measuring other points or other coordinate systems in the stereoscopic space.

The invention uses the commodity center position identified from RGB image by the target detection network as the coordinate (u, v) in the pixel coordinate system, and obtains the depth value of the target commodity from the depth image, and carries out the conversion under the pixel coordinate system and the camera coordinate system, as follows:

。

。

the form written in matrix is as follows:

。

wherein (X, Y, Z) is the coordinates of the target goods shelf commodity under the camera coordinate system,

for pixel coordinates at +.>

Scaling factor on axis,/>

For pixel coordinates at +.>

Scaling factor on axis,/>

，/>

，/>

，/>

Are all internal parameters of the camera. The coordinate Z in the camera coordinate system is directly obtained from the acquired depth values.

And converting the coordinates (u, v) of the target goods shelf commodity under the pixel coordinate system into the coordinates (X, Y, Z) under the camera coordinate system, converting the coordinates into the coordinates (X, Y, Z) under the world coordinate system, and transmitting the converted coordinates to the mechanical arm to execute the target goods grabbing task.

When the gesture of grabbing goods on the goods shelf is not concerned, converting matrix from camera coordinate system to world coordinate system

Represented by the formula: />

。

Conversion matrix from camera coordinate system to world coordinate system when attention is paid to grabbing goods posture of goods on goods shelves

Represented by the formula: />

。

Wherein a represents the world coordinate system and B represents the camera coordinate system. When (when)The coordinate transformation matrix from the camera coordinate system to the world coordinate system can be obtained by

Representation of->

Representing the coordinates of the goods on the shelf in the world coordinate system, < + >>

Representing the coordinates of the goods on the shelf under the camera coordinate system, < + >>

For the pose matrix of the camera coordinate system relative to the world coordinate system, < >>

Is a matrix of locations of the camera coordinate system relative to the world coordinate system. When the pose of the grabbed goods is considered, the target goods are regarded as an object coordinate system C, and the coordinate transformation matrix from the camera coordinate system to the world coordinate system is formed by +.>

Representation, i.e. coordinate transformation matrix of camera coordinate system to world coordinate system>

Coordinate transformation matrix from object coordinate system to camera coordinate system>

Is a product of (a) and (b). In the present example, the gesture matrix +.>

、/>

And position matrix->

、/>

And updating in real time by using a correlation function built in the mechanical arm.

The test platform and the experimental environment of the embodiment of the invention are as follows: windows 10 professional operating system, NVIDIA GeForce RTX 3060 Ti video card, 8GB video memory size, CPU configured as Intel Kui ™ i5-12400 processor, CUDA version 11.3.1, pytorch version 1.12.0, python language environment 3.8.1.

In order to verify the effectiveness of the method, the method is tested on a scene where 10 different types of commodities are placed and the placement positions are not fixed by using a deep learning method and a conventional image processing method, and the performance index comparison results of the different methods are shown in the following table 1.

Table 1 comparison of the identification effect of different methods

Method	Identification accuracy/%	FPS	Times/s
				Conventional methods use only conventional image processing methods and use only deep learning methods the methods herein	20408099	—58.644.230.4	—0.0180.0260.043

As can be seen from the table, compared with the traditional method, the method only uses the deep learning method and the traditional image processing method, the method has the highest identification accuracy, and all 10 different types of commodities are correctly identified in the test scene. The method comprises a deep learning method and a traditional image processing method, so that the recognition speed is lower than that of other methods, compared with the method which only uses the deep learning method, the FPS is reduced by 13.8, the time for processing each image is increased by 17ms, but the increased time is not long and the user experience is not influenced on the premise of realizing accurate grabbing from goods on shelves with various goods and unfixed placement positions.

Although embodiments of the invention have been shown and described, the detailed description is to be construed as exemplary only and is not limiting of the invention as the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples, and modifications, substitutions, variations, etc. may be made in the embodiments as desired by those skilled in the art without departing from the principles and spirit of the invention, provided that such modifications are within the scope of the appended claims.

Claims

1. The method for identifying, positioning and grabbing goods on the goods shelf by the composite robot is characterized by comprising the following steps:

each sample comprises an RGB image, a depth image and an image marking result of the goods shelf, the RGB image marking result is stored as a txt file, and each row in the txt file comprises a category code number of the goods shelf, a u coordinate of a goods shelf center point, a v coordinate of the goods shelf center point, a horizontal length proportion and a vertical height proportion of the goods shelf in the image; taking each specification commodity of each brand of each commodity as a category and having a unique category code number;

2. The method according to claim 1, wherein when the binocular structured light infrared camera is arranged to collect images, the horizontal direction of the camera is kept at a distance of 30cm from the commodity, and the vertical direction is 15cm higher than the plane of the shelf layer where the commodity is currently photographed.

3. The method according to claim 1 or 2, wherein in the step S3, the compound robot obtains the type of the commodity placed on each layer on each shelf in advance, when the compound robot receives the commodity demand of the user, the compound robot first determines the corresponding shelf and the layer where the commodity is located according to the type of the target commodity, moves the compound robot to the corresponding shelf, moves the mechanical arm of the compound robot, and uses the binocular structured light infrared camera to capture the commodity image in the shelf layer where the target commodity is located.

4. The method according to claim 1, wherein in the step S2, when training the target detection network, the set loss function is composed of three parts, namely, a bounding box regression loss, a target confidence loss and a category loss; the method comprises the steps that a bounding box regression Loss CIoULSs is calculated by using a CIoU Loss function, a target confidence Loss BCELoss is calculated by using a binary cross entropy Loss function, and a category Loss Focalloss is calculated by using a Focal Loss function; the total Loss function Loss is: loss=ciouloss+bceloss+focalloss.

5. The method according to claim 1 or 2, wherein in step S5, the internal reference of the binocular structured optical infrared camera is obtained in advance

，/>

，/>

Then the coordinates of goods on the goods shelf in the image are calculatedu,v) Converting into a camera coordinate system to obtain a coordinate [ (x-ray) of the cameraX,Y,Z) Wherein Z is derived from the acquired depth values, the transformation matrix is as follows: