CN115861145B

CN115861145B - Image processing method based on machine vision

Info

Publication number: CN115861145B
Application number: CN202310066520.1A
Authority: CN
Inventors: 杨秋影; 尹作重; 孙洁香; 司佳顺; 杜已超; 王凯; 秦修功; 薛靖婉; 唐聪
Original assignee: Beijing Research Institute of Auotomation for Machinery Industry Co Ltd
Current assignee: Beijing Research Institute of Auotomation for Machinery Industry Co Ltd
Priority date: 2023-02-06
Filing date: 2023-02-06
Publication date: 2023-05-09
Anticipated expiration: 2043-02-06
Also published as: CN115861145A

Abstract

The embodiment of the invention discloses an image processing method based on machine vision, which comprises the following steps: acquiring an RGB image and a binocular depth image of a shooting scene to form an RGBD image; obtaining a monocular depth map of the shooting scene through the RGBD map by using a deep learning method; searching each outlier in the binocular depth map; and modifying the depth value of each outlier in the binocular depth map according to the monocular depth map to obtain a final depth map. The embodiment rapidly realizes the fusion of the monocular depth map and the binocular depth map, and improves the quality of the depth image.

Description

Image processing method based on machine vision

Technical Field

The embodiment of the invention relates to the field of computer vision, in particular to an image processing method based on machine vision.

Background

Estimating the depth, namely estimating the distance from each point in the image to a shooting source; an image having the distance between points as pixel values is called a depth map. Depth information has very important applications in many fields such as industrial robots, service robots, AR (Augmented Reality), VR (Virtual Reality), and the like.

In the prior art, a depth image is mainly obtained by a binocular matching method through a depth camera, the effect of the depth image depends on the characteristic difference of a left camera and a right camera, and the problems of point cloud deficiency and precision reduction exist in black and weak texture scenes. CN 104598744a, a depth estimation method based on a light field, and CN 110047144a, a real-time three-dimensional reconstruction method of a complete object based on Kinectv2, disclose two methods for realizing depth information fusion, but the calculation amount is huge.

Disclosure of Invention

The embodiment of the invention provides an image processing method based on machine vision, which can quickly realize fusion of a monocular depth image and a binocular depth image and improve the quality of the depth image.

In a first aspect, an embodiment of the present invention provides an image processing method based on machine vision, including:

acquiring an RGB image and a binocular depth image of a shooting scene to form an RGBD image;

obtaining a monocular depth map of the shooting scene through the RGBD map by using a deep learning method;

searching each outlier in the binocular depth map;

and modifying the depth value of each outlier in the binocular depth map according to the monocular depth map to obtain a final depth map.

In a second aspect, an embodiment of the present invention provides an electronic device, including:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the machine vision based image processing method described above.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described machine vision-based image processing method.

According to the embodiment of the invention, a binocular depth map is obtained by utilizing a structured light binocular matching algorithm, and then a monocular depth map is obtained by utilizing an RGBD image, so that the obtained binocular depth map and the monocular depth map have a certain complementary relationship; and then, correcting the depth value of the outlier in the binocular depth map according to the depth value in the monocular depth map, and compensating the error of the binocular depth map in the weak texture region and the repeated region. The method can be used for rapidly combining the advantages of accurate binocular depth estimation point cloud and complete monocular depth estimation point cloud, solving the problem of point cloud deficiency based on binocular depth data and improving the quality and accuracy of the depth map.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a machine vision-based image processing method provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of binocular depth estimation matching provided by an embodiment of the present invention;

FIG. 3 is a schematic illustration of an ideal speckle pattern provided by embodiments of the present invention;

FIG. 4 is two pictures with speckle added features provided by embodiments of the present invention;

FIG. 5 is a schematic diagram of a deep learning network provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a comparison of an RGB map and three depth maps provided by an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the invention, are within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Binocular depth estimation can be performed on a shooting scene by the binocular matching method by the binocular structured light camera, but the effect of the obtained binocular depth map depends on the characteristic difference of the left and right cameras, and a better effect is difficult to achieve in scenes with fewer characteristics (such as black, light reflection, repeated textures and the like). The monocular depth estimation is based on texture, gradient, texture, etc. information of the object with respect to the binocular depth estimation, providing completely different information with respect to the binocular. Aiming at the problem of point cloud deficiency of a binocular structured light camera, the image processing method for the depth image is provided, a binocular depth image of a shooting scene is obtained by adopting a binocular matching method, then the binocular depth image is compensated by a monocular depth image based on deep learning, and the problems of point cloud deficiency and precision reduction of the binocular structured light camera aiming at black and weak texture scenes are effectively solved.

Based on the above inventive concept, fig. 1 is a flowchart of an image processing method based on machine vision according to an embodiment of the present invention. The method is suitable for realizing single-binocular depth fusion through the RGBD graph, and is executed by the electronic equipment. As shown in fig. 1, the method specifically includes the following steps:

s110, acquiring an RGB image and a binocular depth image of a shooting scene to form an RGBD image.

This step may be accomplished by a binocular structured light camera that acquires RGB images and binocular depth images of the captured scene, together forming an RGBD map of the captured scene, as input to a subsequent image processing process.

The RGB image comprises color information of a shooting scene and has rich image characteristics. The binocular depth map is a depth map obtained by using a binocular matching method. In the binocular matching method, feature points are extracted from left and right images, corresponding feature points are found in the left and right images by combining polar constraint concepts to obtain matching points, and finally depth values are calculated by the principle of triangulation, as shown in fig. 2.

It can be seen that in the binocular matching method, matching needs to be performed depending on the feature points of the left and right images. Therefore, aiming at scenes with insufficient texture features, the matching precision and accuracy are difficult to ensure; the present application thus uses structured light technology to solve the matching problem. According to the structured light scheme, a special pattern is projected to the surface of an object, the scene pattern and the projected fringe pattern are combined with each other, and the characteristic points corresponding to the left camera and the right camera are calculated, so that the problem of difficult matching of the low-texture surface is avoided. In this embodiment, an ideal speckle pattern is used as a projection scheme, and the process of obtaining the binocular depth map includes the following steps:

step one, a reference speckle pattern only comprising ideal speckle patterns is obtained. Ideal speckle pattern as shown in fig. 3, for an ideal speckle-structured light system, the speckle pattern is moving in its entirety over different distances, and the speckle size and distribution pattern do not change with distance. Based on this, the present embodiment acquires a reference speckle pattern at a certain distance using a curtain (or a white wall) as a reference for comparison of other distance images.

And secondly, projecting the ideal astigmatic spot pattern on the surface of an object in a shooting scene, and shooting two object speckle patterns through a left camera and a right camera. The ideal speckle pattern can increase the surface characteristics of the object, and is generally generated by a pseudo-random method. As shown in fig. 4, the results after the speckle features are added to two different pictures are shown. Compared with the prior art without adding speckle features, the method has the advantages that the object texture features can be greatly increased by projecting speckle structure light patterns, so that a good foundation is laid for binocular matching; in addition, before the speckle features are not superimposed, some features in the left and right images are relatively easily confused, and after the speckle features are superimposed, the images are quite clear.

And thirdly, calculating the row-direction translation relation between each object speckle pattern and the reference speckle pattern. Based on the basic attribute that the ideal astigmatic spot pattern keeps unchanged at different distances, the object speckle patterns shot by the left camera and the right camera are obtained by translating the reference speckle patterns in the row direction, so that the essence of the speckle structure light algorithm is to calculate the row direction translation relation between the object speckle patterns and the reference speckle patterns.

And step four, obtaining a binocular depth map of the shooting scene according to the line direction translation relation. According to the horizontal movement relation between the two object speckle patterns and the reference speckle pattern, parallax of the two object speckle patterns can be obtained, and therefore the depth value of each point is calculated.

S120, obtaining a monocular depth map of the shooting scene through the RGBD map by using a deep learning method.

As described above, the RGBD image includes abundant image features and binocular depth information, and the monocular depth information of the shooting scene is learned from the RGBD image by using the deep learning network, so that the image features such as texture, gradient and material can be fully utilized to learn complete depth information, and the binocular depth information can be used as an initial value of the deep learning, thereby greatly improving the learning speed and reliability.

Specifically, the network structure of deep learning is various, when the electronic device is applied to a structured light 3D camera, the requirement of a faster frame rate needs to be met, and meanwhile, a certain compensation effect on the binocular depth estimation effect needs to be achieved. The whole network structure comprises an encoder layer and a decoder layer, wherein the encoder layer adopts a Mobilene structure, so that low delay in encoding is ensured; the decoder layer includes five upsampling layers each employing a convolution of 5 '5 to reduce the number of channels by 1/2, and a convolution layer (cov' 1) for outputting a monocular depth map. Meanwhile, jump connection is arranged between the encoder layer and the decoder layer to restore image details, and two images of the jump connection are added, so that subsequent calculation pressure is relieved.

More specifically, it is assumed that an image of H ' -W ' -c=224 ' -224 ' -3 (H, W and C respectively represent the length, width, and channel number) is input to the encoder layer, and the encoding characteristics of 7 ' -1024 will be obtained; inputting the coding feature into an upsampling layer 1 to perform upsampling and halving the channel number to obtain a first feature of 14' 512; inputting the first characteristic into an up-sampling layer 2 to perform up-sampling and halving the channel number to obtain a second characteristic of 28' -256; inputting the second characteristic into an up-sampling layer 3 to perform up-sampling and halving the channel number to obtain a third characteristic of 56' 128; inputting the third characteristic into an up-sampling layer 4 to perform up-sampling and halving the channel number to obtain a fourth characteristic of 112 '-112' 64; inputting the fourth characteristic into an up-sampling layer 5 to perform up-sampling and halving the channel number to obtain a fifth characteristic of 224' -32; and inputting the fifth characteristic into the convolution layer of 1' -1 to obtain a monocular depth map of the single channel.

The principle of obtaining the depth map by using the deep learning network is completely different from the principle of obtaining the depth map by using the binocular matching method, so that the monocular depth map and the binocular depth map have a certain complementary relationship. Correspondingly, if the active projection structure light is performed through the phase resolution calibration in the binocular depth estimation and the monocular depth estimation, the two depth maps have the same imaging characteristics on a typical black reflective weak texture object (such as a reflective object), complementation is difficult to achieve, and the effect after fusion cannot be guaranteed.

S130, searching each outlier in the binocular depth map.

Outliers are the manifestation of points missing from the point cloud in the binocular depth map in the binocular depth estimate. Optionally, if the depth value at a point in the binocular depth map is less than a set threshold, the point is an outlier. Assuming that the threshold is set to T (e.g., t=1), each point in the binocular depth map D1 is traversed, and if the depth value D1 (x, y) at the point (x, y) is smaller than T, the point (x, y) is considered to be an outlier or a hole point.

And S140, modifying the depth value of each outlier in the binocular depth map according to the monocular depth map to obtain a final depth map.

The step utilizes the advantage that the monocular depth map is complete, and corrects outliers in the binocular depth map according to the depth values in the monocular depth map. According to different treatments of depth change conditions in a shooting scene, the method comprises the following three optional embodiments:

in a first optional implementation manner, modifying the depth value of each outlier in the binocular depth map to be the depth value of each outlier at the corresponding position of the monocular depth map; and linearly filtering the modified binocular depth map to obtain a final depth map. The method is simple in calculation and operation, ignores the depth change condition in the shooting scene, and has high dependence on the precision of the monocular depth map. The higher the precision of the monocular depth map is, the better the fusion effect is.

Specifically, the coordinates of the outlier in the binocular depth map D1 are recorded as (x, y), and the corresponding depth value D2 (x, y) is found in the monocular depth map D2; modifying the outlier depth value D1 (x, y) in the binocular depth map D1 to D2 (x, y); repeating the operation until the depth value of each outlier is modified, and obtaining a fused depth map D3. In order to avoid the problem of D3 depth dislocation caused by outlier replacement, the whole image is subjected to linear filtering in a linear filtering mode, and a final depth map D4 is obtained. The linear filtering mode can be median filtering, and the specific calculation mode is as follows: for each depth value D3 (x, y) in the D3 depth map, an intermediate value of 9 depth values in the range of 3' 3 around it is used as a new depth value D4 (x, y), i.e. D4 (x, y) =media [ D3 (x-1, y-1), D3 (x-1, y+1), D3 (x, y-1), D3 (x, y), D3 (x, y+1), D3 (x+1, y-1), D3 (x+1, y), D3 (x+1, y+1) ].

The second optional implementation manner is suitable for the situation that the depth change in the shooting scene is smoother, and the depth value of each outlier in the binocular depth map is modified according to the linear relation between the depth values of the outlier and surrounding points in the monocular depth map and the depth values of the surrounding points in the binocular depth map; and linearly filtering the modified binocular depth map to obtain a final depth map:

specifically, first, the operations of S21 to S23 are performed for each outlier, respectively:

s21, determining the linear relation between the outlier and the depth values of surrounding points in the monocular depth map. The linear relationship is used to characterize a relatively gradual change in depth values within the local area. The surrounding points refer to points around the outlier, and include 8 points within a range of 3' pixels around.

S22, substituting the depth values of the surrounding points in the binocular depth into the linear relation to obtain another depth value. Since the monocular depth map and the binocular depth map are depth estimates of the same shooting scene, it is considered that the depth values of points in the same region in the binocular depth map D1 also satisfy the above linear change relationship.

S23, modifying the depth value of the outlier in the binocular depth map into the other depth value.

For example, the sitting of any outlier in the binocular depth map D1 is marked as (x, y), and the following is performed on the outlier:

s21, selecting two points (x ', y ') and (x ' ', y ' ') on any straight line passing through an outlier from 8 points in a range of 3 ' 3 around the outlier (x, y), wherein (x ', y ') and (x ' ', y ' ') are divided into (x-1, y-1) and (x+1, y+1), or (x-1, y) and (x+1, y), or (x, y-1) and (x, y+1), or (x-1, y+1) and (x+1, y-1); extracting depth values d2 (x, y), d2 (x ', y') and d2 (x ", y") of the outlier and the two points in the monocular depth map, determining a linear relationship between d2 (x, y), d2 (x ', y') and d2 (x ", y"). f (x, y) = [ d2 (x, y) -d2 (x ', y') ]/[ d2 (x ", y") -d2 (x ', y') ].

The depth values D1 (x, y), D1 (x ', y') and D1 (x ", y") in the binocular depth map D1 also satisfy the above-mentioned linear relationship, i.e., f (x, y) = [ D2 (x, y) -D2 (x ', y') ]/[ D2 (x ", y") -D2 (x ', y') ] = [ D1 (x, y) -D1 (x ', y') ]/[ D1 (x ", y") -D1 (x ', y') ]. Substituting d1 (x ', y') and d1 (x '', y '') into the above formula, and solving to obtain a new d1 (x, y).

S23, modifying the depth value at (x, y) in the binocular depth map D1 to be a new D1 (x, y).

After the operations of S21-S23 are executed on each outlier, a modified binocular depth map D3 is obtained; and (3) carrying out linear filtering on the D3 to obtain a final depth map. The specific operation of the linear filtering is described in the first alternative embodiment, and will not be described again.

The third alternative embodiment is suitable for the situation that the depth change in the shooting scene is relatively severe. At this time, the depth values of the outliers in the binocular depth map are modified according to a least square method, so that the difference between the depth values of the outliers and surrounding points in the modified binocular depth map and the difference between the depth values in the monocular depth map are minimized; and linearly filtering the modified binocular depth map to obtain a final depth map.

Specifically, first, the following operations are performed for each outlier, respectively:

s31, determining the difference between the depth values of the outlier and surrounding points in the monocular depth map. Since the depth change in the shooting scene is severe, the linear relation is not suitable for describing the depth value change characteristics in the local area. Therefore, the present embodiment directly uses the difference between the depth values between pixels to reflect the depth change of each pixel.

S32, modifying the depth value of the outlier in the binocular depth map according to a least square method, so that the difference between the depth values of the outlier and surrounding points in the modified binocular depth map and the difference between the depth values in the monocular depth map are minimized. Because the monocular depth map and the binocular depth map are depth estimation of the same shooting scene, the depth value difference of each point in the same area in the binocular depth map D1 is considered to be as close as possible to the depth difference in the monocular depth map, so that rich depth change information in the area is restored.

s31, determining the difference d2 (x, y) -d2 (x-1, y-1), d2 (x, y) -d2 (x-1, y), d2 (x, y) -d2 (x-1, y+1), d2 (x, y) -d2 (x, y-1), d2 (x, y) -d2 (x, y+1), d2 (x, y) -d2 (x+1, y-1), d2 (x, y) -d2 (x+1, y), and d2 (x, y) -d2 (x+1, y+1).

S32, modifying the depth value of the outlier in the binocular depth map to d1 '(x, y) according to a least square method, so that an objective function L= [ d1' (x, y) -d1 (x-1, y-1) -d2 (x, y) +d2 (x-1, y-1)] ² +[d1’(x,y)-d1(x-1,y)-d2(x,y)+d2(x-1,y)] ² +[d1’(x,y)-d1(x-1,y+1)-d2(x,y)+d2(x-1,y+1)] ² +[d1’(x,y)-d1(x,y-1)-d2(x,y)+d2(x,y-1)] ² +[d1’(x,y)-d1(x,y+1)-d2(x,y)+d2(x,y+1)] ² +[d1’(x,y)-d1(x+1,y-1)-d2(x,y)+d2(x+1,y-1)] ² +[d1’(x,y)-d1(x+1,y)-d2(x,y)+d2(x+1,y)] ² +[d1’(x,y)-d1(x+1,y+1)-d2(x,y)+d2(x+1,y+1)] ² Minimum. Specifically, the function takes d1' (x, y) as an independent variable, is a quadratic term function of the independent variable, and takes L pairs of d1' (x, y) as a partial derivative, and the partial derivative function is a quadratic term function of d1' (x, y). Let the partial derivative function equal to = 0, a value of d1' (x, y) can be obtained that enables minimization of the objective function L.

After the operations of S31-S33 are executed on each outlier, a modified binocular depth map D3 is obtained; and (3) carrying out linear filtering on the D3 to obtain a final depth map. The specific operation of the linear filtering is described in the first alternative embodiment, and will not be described again.

It should be noted that the third embodiment is also applicable to the case of not considering the depth change rule of the photographed scene and the smooth depth change in the photographed scene, but the first embodiment is simpler and more convenient to calculate under the condition of not considering the depth change rule, and the second embodiment is simpler and more convenient to calculate under the smooth condition. In addition, the depth change condition in the shooting scene can be realized by analyzing the data of the depth values of each point in the monocular depth map and/or the binocular depth map, and the smoothness or the intensity of the depth change in the shooting scene can be judged by data analysis means such as scattered point distribution, a sandbox map and the like.

In the embodiment, a binocular depth map is obtained by utilizing a structured light binocular matching algorithm, and then a monocular depth map is obtained by utilizing an RGBD image, so that the obtained binocular depth map and the monocular depth map have a certain complementary relationship; and then, correcting the depth value of the outlier in the binocular depth map according to the depth value in the monocular depth map, and compensating the error of the binocular depth map in the weak texture region and the repeated region. The method can combine the advantages of accurate binocular depth estimation point cloud and complete monocular depth estimation point cloud, realize the rapid fusion of the binocular depth map and the monocular depth map, improve the problem of point cloud deficiency based on binocular depth data, and improve the quality and the precision of the depth map.

In particular, the monocular depth map is directly based on the RGBD map by adopting a deep learning technology, so that not only can the RGB image characteristics such as texture, gradient and materials be fully utilized to learn complete depth information, but also binocular depth information D can be used as an initial value of the deep learning, and the learning rate and reliability are greatly improved. According to the method, complex phase solving and calibration steps are not needed, and the obtained monocular depth map and binocular depth map are complementary in information, so that a better depth estimation effect can be achieved. Meanwhile, the single-binocular fusion algorithm is performed pixel by pixel on the depth map, a complex point cloud computing process is not needed, and the execution speed is higher.

Furthermore, the application provides three alternative embodiments for modifying the binocular depth map according to the monocular depth map starting from the situation of depth change in the photographed scene. The first alternative embodiment directly modifies the depth value of an outlier in the binocular depth map to a depth value in the monocular depth map at that point, irrespective of the depth variations in the scene. The second optional implementation manner is suitable for the condition that depth transformation in a shooting scene is relatively gentle, and the depth value of the outlier in the binocular depth map is modified according to the linear relation of the depth values of the local areas of the outlier in the monocular depth map. The third alternative embodiment is suitable for the situation that the depth change in the shooting scene is relatively severe, and the depth values of the outliers in the binocular depth map are fitted according to the difference of the depth values of the local areas of the outliers in the monocular depth map. The three modes are respectively stressed and can be flexibly selected according to actual requirements.

Based on the above embodiment, the present embodiment verifies the method provided in the present embodiment based on Intel Realsense. In verification, related RGB images and binocular depth images based on structured light are obtained through a RealSense SDK, fusion of a monocular depth image and a binocular depth image is carried out through a first optional implementation mode, and quality and accuracy of the binocular depth image, the monocular depth image and the monocular and binocular fused depth image are compared. As shown in fig. 6 and table 1, compared with the conventional depth estimation result, the method provided by the application can effectively improve the quality and the precision of the depth point cloud.

TABLE 1 depth point cloud quality and precision comparison Table

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device includes a processor 50, a memory 51, an input device 52 and an output device 53; the number of processors 50 in the electronic device may be one or more, one processor 50 being taken as an example in fig. 7; the processor 50, the memory 51, the input means 52 and the output means 53 in the electronic device may be connected by a bus or other means, in fig. 7 by way of example.

The memory 51 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the machine vision-based image processing method in the embodiment of the present invention. The processor 50 executes various functional applications of the apparatus and data processing, i.e., implements the above-described machine vision-based image processing method, by running software programs, instructions, and modules stored in the memory 51.

The memory 51 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 51 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 51 may further include memory located remotely from processor 50, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 52 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the device. The output means 53 may comprise a display device such as a display screen.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the machine vision based image processing method of any of the embodiments.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the essence of the corresponding technical solutions from the technical solutions of the embodiments of the present invention.

Claims

1. A machine vision-based image processing method, comprising:

obtaining a monocular depth map of the shooting scene through the RGBD map by using a deep learning method; specifically, a deep learning network based on a mobilet structure is constructed, wherein the deep learning network comprises an encoder layer and a decoder layer, the encoder layer adopts the mobilet structure, the decoder layer comprises five upsampling layers and one convolution layer, each upsampling layer is also used for reducing the number of channels, and the last convolution layer is used for outputting a final depth map; a jump connection is arranged between the encoder layer and the decoder layer, and two images which are in jump connection are added; inputting the RGBD image into the deep learning network, and learning a monocular depth image of the shooting scene;

searching each outlier in the binocular depth map;

according to the monocular depth map, modifying the depth value of each outlier in the binocular depth map to obtain a final depth map; specifically, modifying the depth value of each outlier in the binocular depth map to be the depth value of each outlier at the corresponding position of the monocular depth map; and linearly filtering the modified binocular depth map to obtain a final depth map.

2. The method of claim 1, wherein the finding of outliers in the binocular depth map comprises:

if the depth value at a point in the binocular depth map is less than a set threshold, the point is an outlier.

3. The method according to claim 1, wherein the "modifying the depth value of each outlier in the binocular depth map to the depth value of each outlier at the corresponding position of the monocular depth map" is replaced by:

under the condition that the depth change in the shooting scene is smooth, modifying the depth value of each outlier in the binocular depth map according to the linear relation between the depth values of each outlier and surrounding points in the monocular depth map and the depth value of each surrounding point in the binocular depth.

4. A method according to claim 3, wherein said modifying the depth value of the outlier in the binocular depth map based on the linear relationship of the depth value of each outlier and surrounding points in the monocular depth map and the depth value of the surrounding points in the binocular depth map comprises:

selecting two points (x ', y') and (x ", y") on any straight line passing through any outlier (x, y) from 8 points in the range of 3 x 3 around the outlier, wherein (x ', y') and (x ", y") are divided into (x-1, y-1) and (x+1, y+1), or (x-1, y) and (x+1, y), or (x, y-1) and (x, y+1), or (x-1, y+1) and (x+1, y-1);

extracting depth values d2 (x, y), d2 (x ', y') and d2 (x ", y") of the outlier and the two points in the monocular depth map, determining a linear relationship between d2 (x, y), d2 (x ', y') and d2 (x ", y"). f (x, y) = [ d2 (x, y) -d2 (x ', y') ]/[ d2 (x ", y") -d2 (x ', y') ];

substituting d1 (x ', y') and d1 (x ", y") into f (x, y) = [ d1 (x, y) -d1 (x ', y') ]/[ d1 (x ", y") -d1 (x ', y') ], to obtain a new d1 (x, y);

in the binocular depth map, the depth value at (x, y) is modified to the new depth value d1 (x, y).

5. The method according to claim 1, wherein the "modifying the depth value of each outlier in the binocular depth map to the depth value of each outlier at the corresponding position of the monocular depth map" is replaced by:

under the condition that depth changes are severe in a shooting scene, modifying the depth value of each outlier in the binocular depth map according to a least square method, so that the difference between the depth value of each outlier and each surrounding point in the modified binocular depth map and the difference between the depth values in the monocular depth map are minimized.

6. The method of claim 5, wherein modifying the depth values of the outliers in the binocular depth map according to the least squares method to minimize differences between the depth values of the outliers and surrounding points in the modified binocular depth map and the depth values in the monocular depth map comprises:

determining differences d2 (x, y) -d2 (x-1, y-1), d2 (x, y) -d2 (x-1, y), d2 (x, y) -d2 (x-1, y+1), d2 (x, y) -x, y-1, d2 (x, y) -d2 (x, y+1), d2 (x, y) -d2 (x, y-1), x, y-1, d2 (x, y+1), d2 (x, y) -d2 (x+1, y-1), d2 (x, y) -d2 (x+1, y) and d2 (x, y) -d2 (x+1, y+1) in the monocular depth map;

modifying the depth value of the outlier in the binocular depth map to d1 '(x, y) according to least squares method such that the objective function l= [ d1' (x, y) -d1 (x-1, y-1) -d2 (x, y) +d2 (x-1, y-1)] ² +[d1’(x,y)-d1(x-1,y)-d2(x,y)+d2(x-1,y)] ² +[d1’(x,y)-d1(x-1,y+1)-d2(x,y)+d2(x-1,y+1)] ² +[d1’(x,y)-d1(x,y-1)-d2(x,y)+d2(x,y-1)] ² +[d1’(x,y)-d1(x,y+1)-d2(x,y)+d2(x,y+1)] ² +[d1’(x,y)-d1(x+1,y-1)-d2(x,y)+d2(x+1,y-1)] ² +[d1’(x,y)-d1(x+1,y)-d2(x,y)+d2(x+1,y)] ² +[d1’(x,y)-d1(x+1,y+1)-d2(x,y)+d2(x+1,y+1)] ² Minimum.

7. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the machine vision based image processing method of any of claims 1-6.

8. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the machine vision based image processing method according to any one of claims 1-6.