CN113468969A

CN113468969A - Aliasing electronic component space expression method based on improved monocular depth estimation

Info

Publication number: CN113468969A
Application number: CN202110618580.0A
Authority: CN
Inventors: 顾寄南; 雷文桐; 张可; 高伟
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2021-06-03
Filing date: 2021-06-03
Publication date: 2021-10-01
Anticipated expiration: 2041-06-03
Also published as: CN113468969B

Abstract

The invention discloses an aliasing electronic component space expression method based on improved monocular depth estimation, which relates to the field of machine vision and comprises an image acquisition module, a target detection network module, a semantic segmentation network module, an HSV (hue, saturation, value) module and an RGB (red, green and blue) module; the image acquisition module is used for acquiring RGB images of aliasing electronic components of different types in the bin; the target detection network module is used for processing the RGB image acquired by the image acquisition module to obtain a depth image A; the semantic segmentation network module is used for segmenting the depth image A processed by the target detection network module to obtain rough depth information; and the HSV and RGB modules refine the rough depth information segmented by the semantic segmentation network module to obtain the detailed depth information of each electronic component. The invention can effectively solve the problem of autonomous recognition under the complex working scene of aliasing among electronic components.

Description

Aliasing electronic component space expression method based on improved monocular depth estimation

Technical Field

The invention relates to the field of machine vision, in particular to an aliasing electronic component space expression method based on improved monocular depth estimation.

Background

The automatic identification of the electronic components is the basis of the visual control of the intelligent assembly robot, and the understanding of the complex scene is the basic support of the automatic identification of the electronic components. The electronic components can be accurately and autonomously identified, and the accuracy and the efficiency of the intelligent assembling robot can be directly related. In actual production application, the mechanical arm is assisted by a machine vision technology to assemble electronic components, so that the problems of low production efficiency, high labor input and large worker burden are solved, and the transformation from traditional flow production to intelligent production is fundamentally realized.

The existing electronic component identification method based on machine vision mainly aims at the identification of scattered and uniformly distributed electronic components, but still does not solve the autonomous identification problem among the electronic components and under the complex working scene of aliasing of the electronic components and the background.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an aliasing electronic component space expression method based on improved monocular depth estimation, which can effectively solve the autonomous recognition problem in a complex working scene of aliasing among electronic components.

The present invention achieves the above-described object by the following technical means.

An aliasing electronic component space expression method based on improved monocular depth estimation comprises an image acquisition module, a target detection network module, a semantic segmentation network module, an HSV (hue, saturation, value) module and an RGB (red, green and blue) module;

the image acquisition module is used for acquiring RGB images of aliasing electronic components of different types in the bin;

the target detection network module is used for processing the RGB image acquired by the image acquisition module to obtain a depth image A;

the semantic segmentation network module is used for segmenting the depth image A processed by the target detection network module to obtain rough depth information;

and the HSV and RGB modules refine the rough depth information segmented by the semantic segmentation network module to obtain the detailed depth information of each electronic component.

Furthermore, the target detection network module comprises an input image module, a data enhancement module, a feature extraction network module, a feature fusion module, a down-sampling module, a full connection layer module, a classifier and a prediction output module; specifically, the acquired RGB image is subjected to data enhancement processing, feature extraction, feature fusion, down sampling, full connection layer, classifier and prediction output.

Further, the RGB image is randomly zoomed for 2 times through a data enhancement module to obtain 2 images a and b; randomly cutting for 2 times to obtain c and d;

the feature extraction network module comprises a lightweight network and a deep convolution network, feature extraction is carried out on the images a and c by utilizing a lightweight network algorithm, and feature extraction is carried out on the images b and d by utilizing a deep convolution network algorithm;

the feature fusion module performs three-time hierarchical feature fusion: fusing the shallow features and the deep features of the graphs a and b to obtain a feature graph x, fusing the shallow features and the deep features of the graphs c and d to obtain a feature graph y, and fusing the shallow features and the deep features of the graphs x and y to obtain a feature graph z; and (3) the feature graph z is subjected to down-sampling module, full connection layer module and classifier and then is output by prediction output module.

Further, the prediction output module predicts and outputs a depth image a, electronic component position information, and a category and a probability distribution of the electronic component.

Further, the depth image a is an RGB color image.

Further, the HSV and RGB modules comprise an HSV color model, an HSV cone model, an RGB three-dimensional coordinate model and an RGB value classifier;

firstly, a depth image A is segmented by a semantic segmentation network module to obtain rough depth information, the rough depth information is input into an HSV color model, and values of three attributes are output H, S, V; secondly, visualizing the H, S, V attribute values to a color cone model by the HSV cone model, converting the HSV cone model into an RGB three-dimensional coordinate model, and obtaining R, G, B values of a depth map; three ranges of R, G, B values were refined with an RGB value classifier.

Furthermore, the HSV color model determines colors according to H, S, V attributes, namely hue, saturation and brightness; wherein, the hue H is measured by angle, the value range is 0-360 degrees, the red is 0 degree, the green is 120 degrees and the blue is 240 degrees according to the counter-clockwise direction calculation from the red; the saturation S represents the degree of color approaching spectral color, and usually ranges from 0% to 100%, and the larger the value is, the more saturated the color is; lightness V represents the degree of brightness of the color, for a light source color, the lightness value is related to the lightness of the illuminant; for object colors, this value is related to the transmittance or reflectance of the object.

Furthermore, the manipulator device further comprises a manipulator control module, and the manipulator control module realizes positioning, grabbing and assembling according to the position information of the electronic components, the category and probability distribution of the electronic components and the RGB depth information.

Compared with the prior art, the technical scheme of the invention has at least the following benefits:

1. the invention combines the lightweight network and the deep convolutional network, thereby not only ensuring the comprehensiveness of image characteristics and detail information, but also improving the speed of model prediction and realizing real-time target detection on mobile equipment and embedded equipment.

2. The invention carries out three times of hierarchical feature fusion on the extracted image features, fuses the low-level detail features and the high-level semantic features, and greatly improves the detection performance of the network.

3. Compared with a general target detection algorithm, the method increases the output of the depth image, divides the aliasing electronic components in the depth direction, realizes the spatial expression of the aliasing electronic components, and solves the problem that the aliasing electronic components are difficult to understand by a computer.

Drawings

FIG. 1 is a schematic flow chart of an aliasing electronic component spatial representation based on improved monocular depth estimation according to an embodiment of the present invention;

FIG. 2 is a schematic flow diagram of the target detection module of FIG. 1 according to the present invention;

FIG. 3 is a schematic diagram illustrating a specific operation of the target detection module of FIG. 2 according to the present invention;

FIG. 4 is a schematic flow diagram of the HSV and RGB modules of FIG. 1 according to the present invention.

Reference numerals:

1-an image acquisition module; 2-target detection network module; 3-semantic segmentation network module; 4-HSV, RGB module; 5-a manipulator control module; 6-input image module; 7-a data enhancement module; 8-feature extraction network module; 9-a feature fusion module; 10-a down-sampling module; 11-full connectivity layer module; 12-a classifier; 13-a prediction output module; 14-HSV color model; 15-HSV conical model; 16-RGB three-dimensional coordinate model; 17-RGB value classifier.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Acquiring images of different types of aliasing electronic components in the material box by using a camera to obtain RGB images of the electronic components; 2 times of random zooming and 2 times of random cutting are carried out on each picture; carrying out feature extraction on the processed image by using a lightweight algorithm and a depth convolution algorithm in target detection; fusing the extracted shallow features and deep features; after downsampling, full connection layers and classifiers, depth images, electronic component position information, class prediction and probability scores of electronic components are obtained; segmenting the depth image by using a semantic segmentation module by taking the color as a standard, thereby providing image understanding at a pixel level; the color of the depth map and the distance between the electronic component and the lens are accurately limited in a range through setting of parameters and hyper-parameters and training of a network; combining the segmented depth image with HSV and RGB methods, and refining the range to obtain the depth D of each electronic component; combining the position information (x, y, w, h) of each electronic component with the depth D of each electronic component, and expressing the complete three-dimensional position information of each electronic component by 5 parameters (x, y, w, h, D); the camera coordinate system is converted into a manipulator coordinate system, so that aliasing electronic component accurate position information and depth information are provided for the manipulator end effector, and the manipulator is convenient to position, grab and assemble with high precision.

The image acquisition module comprises a monocular color high-resolution CCD camera, an electronic component placing platform, a telescopic bracket and a light source; the image pixel of the CCD camera is 38 ten thousand points, the color resolution is 480 lines, and the black-white resolution is 600 lines. The camera is placed on a telescopic bracket with the height of 15cm of the experiment platform, and the distance between the lens and the platform surface of the experiment platform is 10 cm.

The image acquisition object of the invention is an electronic component, in particular to different types and aliasing electronic components. The electronic components comprise resistors, capacitors and inductors, the shapes of the electronic components comprise cylinders, squares, tubes, coils and the like, and the number of the electronic components is controlled to be 15-25. The electronic components are placed in a material box on the experimental platform, and the length, the width and the height of the material box are respectively 10cm, 10cm and 5 cm.

The invention can carry out 2 times of random resize and 2 times of random crop on the collected electronic components, and enhance the data, thereby improving the model precision and enhancing the model stability.

The feature extraction method used by the invention comprises two deep convolution networks and a lightweight network; the deep convolutional network can better extract image characteristics and detail information, including the color, shape, size, edge characteristics and corner characteristics of electronic components; the lightweight network can reduce network parameters, does not lose network performance, solves the storage problem of the model, can improve the speed of model prediction, and realizes real-time target detection on mobile equipment and embedded equipment.

The extracted features are subjected to three-time hierarchical feature fusion, the resolution of low-level features is higher, and more position and detail information is contained; the high-level feature resolution is very low and the semantic information is higher. And by means of three-time hierarchical feature fusion, low-level detail features and high-level semantic features are fused, so that the detection performance of the network is improved.

The target detection algorithm has three outputs (a depth image, an electronic component position image, a category and a corresponding probability score), compared with a general target detection algorithm, the target detection algorithm increases the output of the depth image, divides the aliasing electronic components in the depth direction, and can effectively solve the problem that a computer is difficult to understand the aliasing electronic components.

According to the invention, the color is taken as a standard, the obtained depth map is subjected to image semantic segmentation, and the electronic components are roughly divided into an upper-layer electronic component, a middle-layer electronic component and a bottom-layer electronic component; and then combining the divided depth image with HSV and RGB, refining again to obtain the depth of each electronic component, and controlling the precision to be 0.1 mm.

The invention uses 5 parameters (x, y, w, h and D) to express the complete three-dimensional position information of each electronic component, realizes the spatial expression of the aliasing electronic components, provides the accurate position information and depth information of the aliasing electronic components for the manipulator end effector, and is convenient for the high-precision positioning, grabbing and assembling of the follow-up intelligent assembling robot.

Specifically, an aliasing electronic component space expression method based on improved monocular depth estimation is characterized in that an industrial CCD is used for carrying out image acquisition on different types of aliasing electronic components in a bin to obtain RGB images of the electronic components; carrying out image enhancement on the electronic component by utilizing random zooming and random cutting; carrying out feature extraction on the image by using a lightweight network and a deep convolution network in a target detection network; fusing shallow features and deep features by utilizing feature fusion in a target detection network; obtaining a depth image A, electronic component position information B, class prediction C of the electronic component and a probability score P by utilizing down-sampling, a full connection layer and a classifier; utilizing a semantic segmentation network module to segment the depth graph to obtain rough depth information; refining the rough depth information after semantic segmentation by using HSV and RGB modules to obtain detailed depth information D of each electronic component; the position information (x, y, w, h) of each electronic component is combined with the detailed depth information depth D of each electronic component, and the complete three-dimensional position information of each electronic component is expressed by 5 parameters (x, y, w, h, D), so that the manipulator is convenient to position, grab and assemble with high precision.

The feature extraction is carried out by the lightweight network and the deep convolution network, and the deep convolution network can better extract image features and detail information, including the color, shape, size, edge features and corner features of electronic components; the lightweight network can reduce network parameters, does not lose network performance, solves the storage problem of the model, can improve the speed of model prediction, and realizes real-time target detection on mobile equipment and embedded equipment.

Feature fusion: the extracted image features are subjected to three-time hierarchical feature fusion, so that the resolution of low-level features is higher, and more position and detail information is contained; the high-level feature resolution is very low and the semantic information is higher. And by means of three-time hierarchical feature fusion, low-level detail features and high-level semantic features are fused, so that the detection performance of the network is improved.

Compared with a general target detection algorithm, the output of the depth image A is increased by utilizing the output after downsampling, full connection layers and classifiers, aliasing electronic components are divided in the depth direction, and the problem that aliasing electronic components are difficult to understand by a computer can be solved.

The semantic segmentation module is used for performing image semantic segmentation on the obtained depth map by taking the color as a standard, and roughly dividing the electronic components into an upper-layer electronic component, a middle-layer electronic component and a bottom-layer electronic component;

the HSV and RGB module inputs rough depth information obtained by the depth image A through a semantic segmentation network into an HSV color model and outputs H, S, V values of three attributes; visualizing H, S, V values of the three attributes onto a color cone model using the HSV cone model; converting the HSV conical model into an RGB three-dimensional coordinate model to obtain an R, G, B value of the depth map; the three ranges in the semantic segmentation network are refined by an RGB classifier, and the distance precision is controlled to be 0.1mm, so that the detailed depth information depth D (namely the distance from the camera) of each electronic component is obtained.

The positioning, grabbing and assembling of the manipulator express the complete three-dimensional position information of each electronic component by 5 parameters (x, y, w, h and D), the camera coordinate system is converted into the manipulator coordinate system, and aliasing accurate position information and depth information of the electronic component are provided for the manipulator end effector, so that the manipulator is convenient to position, grab and assemble with high precision.

With reference to fig. 1, an aliasing electronic component space expression method based on improved monocular depth estimation comprises an image acquisition module 1, a target detection network module 2, a semantic segmentation network module 3, an HSV, an RGB module 4 and a manipulator control module 5;

the image acquisition module 1 acquires RGB images of different types of aliasing electronic components in the material box; the target detection network 2 performs data enhancement processing, feature extraction, feature fusion, down-sampling, full connection layer and classifier on the acquired RGB image to obtain a depth image, electronic component position information, class prediction and probability score of the electronic component; the semantic segmentation network module 3 is used for segmenting the depth image by taking color as a standard, and roughly limiting the color of the depth image and the distance between an electronic component and a lens within a range through parameter setting and network training; the HSV and RGB module 4 refines the range of the depth image to obtain the depth D of each electronic component, combines the position information (x, y, w, h) of each electronic component with the depth D of each electronic component, and expresses the complete three-dimensional position information of each electronic component by 5 parameters (x, y, w, h, D); the manipulator control module 5 converts the camera coordinate system into a manipulator coordinate system, provides aliasing accurate position information and depth information of the electronic components for the manipulator end effector, and realizes high-precision positioning, grabbing and assembling of the manipulator.

In specific implementation, the monocular camera is a color high-resolution CCD camera, the image pixels are 38 ten thousand dots, the color resolution is 480 lines, and the black-and-white resolution is 600 lines. The camera adopts the fixed mode of shooing, puts on the telescopic bracket who is 15cm of a height of experiment platform, and the camera lens is 10cm apart from the distance of experiment platform mesa, locks behind the fixed position, guarantees in the experimentation that monocular camera can not remove and slide.

In specific implementation, the image acquisition objects of the invention are different types of aliasing electronic components. The electronic components comprise resistors, capacitors and inductors, and the electronic components are cylindrical, square, tubular, coil-shaped and the like. Electronic components places in the workbin on the experiment platform, and the length of workbin, width, height are 10cm, 5cm respectively, and the workbin passes through the spout to be fixed on the experiment platform, guarantees can not remove and shake in the experimentation. The number of the electronic components is controlled to be 15-25, the specific number can be adjusted according to the sizes of the electronic components, aliasing and shielding among the electronic components are ensured, and the electronic components cannot exceed the horizontal plane at the upper end of the feed box.

In specific implementation, the semantic segmentation module adopts PSPNet, and the PSPNet extracts abstract features through a residual error network ResNet; aggregating context information through a pyramid pooling module, wherein the pyramid level is 4, and obtaining 4 pieces of information with different scales; reducing the number of channels of the feature maps of 4 levels to 512 by convolution (conv/BN/ReLU); and respectively restoring the spatial dimension of each feature map to the spatial dimension of the input of the pyramid pooling module through bilinear linear interpolation upsampling, namely restoring the output dimension of each level of feature map to (60, 512).

In specific implementation, colors of the depth map are divided into 6 ranges through semantic segmentation, and the depth is from light to deep: "blue" (0, 119) - (0, 255), "cyan" (0, 255) - (0,119,119), "green" (0,119,119) - (119,255,0), "yellow" (255,199,0) - (199,255,0), "orange" (255,0,0) - (255,119,0), "red" (255,119,0) - (119,0, 0). The result after segmentation is divided into three parts, which are respectively: a relatively short distance from the camera (blue-cyan display), a distance of 5-7cm, a medium distance from the camera (green-yellow display), a distance of 7-9cm, a relatively long distance from the camera (orange-red display), a distance of 9-10 cm. Corresponding to three ranges of electronic components respectively: upper layer electronic components (no shielding), middle layer electronic components (partial shielding, shielding part less than half/only one layer shielding), and bottom layer electronic components (partial shielding, shielding part greater than half/multiple layers shielding).

In specific implementation, in training, target detection and semantic segmentation both adopt transfer learning, all parameters are trained after weights are loaded, shallow learning parameters of a well-learned network are transferred to a new network, and the new network also has the capability of recognizing bottom-layer general features.

In specific implementation, the transformation of the camera coordinate system and the manipulator coordinate system is as follows: assume OXY is the robot coordinate system and O ' X ' Y ' is the camera coordinate system. theta is the included angle between the two coordinate systems, the coordinate transformation relationship is as follows:

x＝x′*r′*cos(theta)-y′*r*sin(theta)+x₀(1)

y＝x′*r*sin(theta)-y′*r*cos(theta)+y₀(2)

where r is the millimeter pixel ratio, (mm/pixel) refers to the number of pixels in one millimeter, theta is the angle between the two coordinate systems, and (x0, y0) is the distance from the origin of image coordinates to the origin of mechanical coordinates.

As shown in fig. 2 and 3, the target detection network module includes an input image module 6, a data enhancement module 7, a feature extraction network module 8, a feature fusion module 9, a down-sampling module 10, a full connection layer 11, a classifier 12, and a prediction output module 13.

The image input by the input image module 6 is an RGB image of different types of aliasing electronic components in a material box collected by the image collection module 1; the data enhancement module 7 respectively obtains 2 images a and b from the preprocessed pictures through 2 times of random scaling (resize), and respectively obtains 2 images c and d from the preprocessed pictures through 2 times of random cropping (crop); the feature extraction network 8 comprises a lightweight network (MobileNet V3) and a deep convolution network (DenseNet), the images a and c are subjected to feature extraction by utilizing a MobileNet V3 algorithm, the images b and d are subjected to feature extraction by utilizing a DenseNet algorithm in target detection, and the sizes of the images are unified into 900 x 900 after feature extraction; the feature fusion module 9 comprises three times of hierarchical feature fusion, namely fusing the shallow features and the deep features of the images a and b to obtain a feature map x, fusing the shallow features and the deep features of the images c and d to obtain a feature map y, and fusing the shallow features and the deep features of the images x and y to obtain a feature map z; passing the feature map z through a downsampling module 10, a full connection layer 11 and a classifier 12(softmax) to obtain a prediction output 13; the prediction output 13 includes a depth image a, electronic component position information B, electronic component category C, and probability score P.

In a specific implementation, the depth image a is an RGB color image. In one depth map, the part closer to the camera is shown as blue-cyan (from near to far), the part equidistant from the camera is shown as cyan-green-yellow (from near to far), and the part farther from the camera is shown as yellow-orange-red (from near to far).

In specific implementation, the electronic component position information B is represented by coordinate values, the upper right corner and the upper left corner are used as original points and are marked as (0, 0), a minimum circumscribed rectangle is drawn around each electronic component, the central point of the electronic component is (x, y), the width of the minimum circumscribed rectangle is w, and the height of the minimum circumscribed rectangle is h. The positional information of each electronic component is expressed by four parameters (x, y, w, h).

In specific implementation, the category C and the probability score P of the electronic components appear in the upper right corner of the minimum circumscribed rectangle of each electronic component in the image and are respectively represented by Chinese characters and decimal numbers. The category one is C +1 (including a category background), and is referred to as "resistance", "capacitance", "inductance" and "background" 4 in the present invention. The probability score refers to the probability that the object framed by the minimum circumscribed rectangle is in the category, the probability score value P is between 0 and 1, and the precision is controlled to be 0.01.

Referring to fig. 4, the HSV and RGB module includes an HSV color model 14, an HSV pyramid model 15, an RGB three-dimensional coordinate model 16, and an RGB value classifier 17; inputting the rough depth information obtained by the depth image A through the semantic segmentation network module 3 into the HSV color model 14, and outputting H, S, V values of three attributes; the HSV pyramid model 15 visualizes H, S, V the values of the three attributes onto one color pyramid model; converting the HSV conical model 15 into an RGB three-dimensional coordinate model 16 to obtain an R, G, B value of the depth map; the three ranges of the R, G, B values were refined by the RGB classifier 17, and the distance accuracy was controlled to 0.1mm, thereby obtaining the depth D of each electronic component (i.e., the distance from the camera).

In particular, the HSV color model 14 determines color from H, S, V, which are hue, saturation, and lightness, respectively. Wherein, the hue H is measured by an angle, the value range is 0-360 degrees, the red is 0 degree, the green is 120 degrees and the blue is 240 degrees according to the anticlockwise calculation from the red. Their complementary colors are: yellow 60 °, cyan 180 °, magenta 300 °; the saturation S represents the degree to which the color approaches the spectral color. The value range is usually 0% -100%, and the larger the value is, the more saturated the color is; lightness V represents the degree of brightness of the color, for a light source color, the lightness value is related to the lightness of the illuminant; for object colors, this value is related to the transmittance or reflectance of the object. Values typically range from 0% (black) to 100% (white).

In particular, the HSV pyramid model visualizes H, S, V the values of the three attributes onto a colored inverted pyramid. At the apex (i.e., origin) of the cone, V is 0, H and S are undefined and represent black. S-0, V-1, H is undefined and represents white at the center of the top surface of the cone. The V-axis in the HSV model corresponds to the principal diagonal in the RGB color space.

In specific implementation, X, Y, Z axes in the RGB three-dimensional coordinate model respectively correspond to R, G, B three channels, and the value range is 0-255. When H is more than or equal to 0 and less than 360, S is more than or equal to 0 and less than or equal to 1, and V is more than or equal to 0 and less than or equal to 1, the conversion formula of HSV and RGB is as follows:

C＝V×S (3)

X＝C×(1-|(H/60°)mod2-1|) (4)

m＝V-C (5)

(R，G，B)＝((R′+m)×255，(G′+m)×255，(B′+m)×255) (7)

in specific implementation, the RGB classifier 17 refines 6 color span ranges "blue" (0,0,119) - (0,0,255), "cyan" (0,0,255) - (0,119,119), "green" (0,119,119) - (119,255,0), "yellow" (255,199,0) - (199,255,0), "orange" (255,0,0) - (255,119,0), "red" (255,119,0) - (119,0,0) to each channel value (R, GB), and controls the distance accuracy of the camera from the electronic component to 0.1mm, thereby obtaining detailed depth information D (i.e., the distance from the camera) of each electronic component.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made in the above embodiments by those of ordinary skill in the art without departing from the principle and spirit of the present invention.

Claims

1. An aliasing electronic component space expression method based on improved monocular depth estimation is characterized by comprising an image acquisition module, a target detection network module, a semantic segmentation network module, an HSV (hue, saturation and value) module and an RGB (red, green and blue) module;

2. The aliasing electronic component spatial expression method based on improved monocular depth estimation according to claim 1, wherein the target detection network module comprises an input image module, a data enhancement module, a feature extraction network module, a feature fusion module, a down-sampling module, a full-link layer module, a classifier and a prediction output module; specifically, the acquired RGB image is subjected to data enhancement processing, feature extraction, feature fusion, down sampling, full connection layer, classifier and prediction output.

3. The aliasing electronic component spatial expression method based on improved monocular depth estimation according to claim 2,

carrying out 2-time random scaling on the RGB image through a data enhancement module to obtain 2 images a and b; randomly cutting for 2 times to obtain c and d;

4. The aliasing electronic component spatial expression method based on improved monocular depth estimation according to claim 3, wherein the prediction output module predicts the output to include a depth image A, electronic component position information, and category and probability distribution of the electronic component.

5. The aliasing electronic component space expression method based on improved monocular depth estimation of claim 1, wherein depth image a is an RGB color image.

6. The aliasing electronic component space expression method based on improved monocular depth estimation of claim 1, wherein the HSV, RGB modules comprise an HSV color model, an HSV pyramid model, an RGB three-dimensional coordinate model, and an RGB value classifier;

firstly, a depth image A is segmented by a semantic segmentation network module to obtain rough depth information, the rough depth information is input into an HSV color model, and values of three attributes are output H, S, V; secondly, visualizing the H, S, V attribute values to a color cone model by the HSV cone model, converting the HSV cone model into an RGB three-dimensional coordinate model, and obtaining R, G, B values of a depth map; and refining the three ranges of the R, G, B values by using an RGB value classifier, thereby realizing the refinement of the rough depth information obtained by the semantic segmentation module and obtaining the detailed depth information of the electronic component.

7. The aliasing electronic component space expression method based on the improved monocular depth estimation of claim 6, wherein the HSV color model determines color from H, S, V three attributes, namely hue, saturation and brightness; wherein, the hue H is measured by angle, the value range is 0-360 degrees, the red is 0 degree, the green is 120 degrees and the blue is 240 degrees according to the counter-clockwise direction calculation from the red; the saturation S represents the degree of color approaching spectral color, and usually ranges from 0% to 100%, and the larger the value is, the more saturated the color is; lightness V represents the degree of brightness of the color, for a light source color, the lightness value is related to the lightness of the illuminant; for object colors, this value is related to the transmittance or reflectance of the object.

8. The aliasing electronic component spatial expression method based on the improved monocular depth estimation of claim 6, further comprising a manipulator control module, wherein the manipulator control module is used for positioning, grabbing and assembling according to the electronic component position information, the class and probability distribution of the electronic component and the detailed depth information of the electronic component.