CN112862877B

CN112862877B - Method and apparatus for training an image processing network and image processing

Info

Publication number: CN112862877B
Application number: CN202110381515.0A
Authority: CN
Inventors: 叶晓青; 谭啸; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-04-09
Filing date: 2021-04-09
Publication date: 2024-05-17
Anticipated expiration: 2041-04-09
Also published as: CN112862877A

Abstract

The present disclosure provides methods, apparatus, electronic devices, storage media, and computer program products for training an image processing network and image processing, relating to the field of artificial intelligence, in particular computer vision and deep learning techniques. The specific implementation scheme is as follows: inputting the left image and the right image into a binocular deep learning network, and outputting a first parallax image; converting the first disparity map into a first binocular predicted depth map; calculating a reliable region of the first binocular predictive depth map; inputting the left image or the right image of the selected first sample into a monocular depth estimation network to obtain a first monocular predicted depth image; taking the depth value of the reliable region as pseudo-supervision information of a monocular depth estimation network, and calculating a first loss value of the depth value of the region corresponding to the reliable region in a first monocular prediction depth map; if the first loss value is less than the predetermined first threshold, the monocular depth estimation network training is complete. The embodiment reduces the manual labeling amount and improves the accuracy of single-binocular depth estimation.

Description

Method and apparatus for training an image processing network and image processing

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of computer vision and deep learning techniques, and in particular to methods and apparatus for training an image processing network and image processing.

Background

Currently, depth estimation can be divided into first by sensor: monocular depth estimation and binocular depth estimation, and secondly, according to whether supervision exists or not, can be further subdivided into: monocular supervised depth estimation and monocular unsupervised depth estimation, where monocular unsupervised typically requires additional information such as pose information, optical flow information, etc. of the previous frame video sequence, binocular may also be categorized into binocular supervised and binocular unsupervised depth estimation.

The supervised method requires higher acquisition and labeling cost, such as the use of a laser radar outdoors, the use of a structured light/ToF camera indoors, and the like, and realizes the calibration and registration of the laser radar/depth camera and the RGB camera.

The monocular unsupervised approach typically requires the assistance of additional information such as pose information, optical flow information, etc. of the front and back frame video sequences.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, storage medium and computer program product for training an image processing network and image processing.

According to a first aspect of the present disclosure, there is provided a method for training an image processing network, comprising: and acquiring a binocular deep learning network trained according to a preset loss function and a sample set, wherein samples in the sample set comprise corrected left and right images. Selecting a first sample from the set of samples, performing a first training step as follows: and inputting the left image and the right image of the selected first sample into a binocular deep learning network, and outputting a first parallax image. The first disparity map is converted to a first binocular predicted depth map. A reliable region of the first binocular predicted depth map is calculated. And inputting the left image or the right image of the selected first sample into a monocular depth estimation network to obtain a first monocular predicted depth image. And taking the depth value of the reliable region as pseudo-supervision information of the monocular depth estimation network, and calculating a first loss value of the depth value of the region corresponding to the reliable region in the first monocular predicted depth map. And if the first loss value is smaller than a preset first threshold value, determining that the monocular depth estimation network training is completed.

According to a second aspect of the present disclosure, there is provided an image processing method including: and acquiring an image to be identified. If the images are corrected left and right images, inputting the images into a binocular depth estimation network trained according to the method of any one of the first aspect to obtain binocular depth estimation values. If the image is a single image, inputting the image into a monocular depth estimation network trained according to the method of any one of the first aspects to obtain a monocular depth estimation value.

According to a third aspect of the present disclosure, there is provided an apparatus for training an image processing network, comprising: and an acquisition unit configured to acquire a binocular deep learning network trained according to a preset loss function and a sample set, wherein samples in the sample set include corrected left and right graphs. A first training unit configured to select a first sample from the set of samples, performing a first training step as follows: and inputting the left image and the right image of the selected first sample into a binocular deep learning network, and outputting a first parallax image. The first disparity map is converted to a first binocular predicted depth map. A reliable region of the first binocular predicted depth map is calculated. And inputting the left image or the right image of the selected first sample into a monocular depth estimation network to obtain a first monocular predicted depth image. And taking the depth value of the reliable region as pseudo-supervision information of the monocular depth estimation network, and calculating a first loss value of the depth value of the region corresponding to the reliable region in the first monocular predicted depth map. And if the first loss value is smaller than a preset first threshold value, determining that the monocular depth estimation network training is completed.

According to a fourth aspect of the present disclosure, there is provided an image processing apparatus including: and an acquisition unit configured to acquire an image to be recognized. A first estimation unit configured to input the image into a binocular depth estimation network trained by the apparatus of any one of the first aspects to obtain binocular depth estimation values if the image is a corrected left and right image. A second estimation unit configured to input the image into a monocular depth estimation network trained by the apparatus of any one of the first aspects, if the image is a monocular image, resulting in a monocular depth estimation value.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor. And a memory communicatively coupled to the at least one processor. Wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for training an image processing network according to any one of the first aspects or the image processing method according to the second aspect.

According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method for training an image processing network according to any one of the first aspects or the image processing method according to the second aspect.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first and second aspects.

The method and the device for training the image processing network can generate the monocular depth estimation network and the binocular depth estimation network in an unsupervised mode, do not need to resort to additional information, reduce the workload of manual annotation and improve the accuracy of the monocular depth estimation network.

The image processing method and the device provided by the embodiment of the disclosure can select a monocular depth estimation network or a binocular depth estimation network for depth estimation aiming at different scenes, and improve the accuracy of depth estimation.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which an embodiment of the present disclosure may be applied;

FIG. 2 is a flow chart of a first embodiment of a method for training an image processing network according to the present disclosure;

FIG. 3 is a flow chart of a second embodiment of a method for training an image processing network according to the present disclosure;

FIG. 4 is a flow chart of a third embodiment of a method for training an image processing network according to the present disclosure;

FIG. 5 is a schematic illustration of one application scenario of a method for training an image processing network according to the present disclosure;

FIG. 6 is a flow chart of one embodiment of an image processing method according to the present disclosure;

FIG. 7 is a schematic structural diagram of one embodiment of an apparatus for training an image processing network according to the present disclosure;

FIG. 8 is a schematic structural view of one embodiment of an image processing apparatus according to the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a method for training an image processing network in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 shows an exemplary system architecture 100 for a method for training an image processing network, an apparatus for training an image processing network, an image processing method or an image processing apparatus, to which embodiments of the application may be applied.

As shown in fig. 1, the system architecture 100 may include terminals 101, 102, a network 103, a database server 104, and a server 105. The network 103 serves as a medium for providing a communication link between the terminals 101, 102, the database server 104 and the server 105. The network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user 110 may interact with the server 105 via the network 103 using the terminals 101, 102 to receive or send messages or the like. The terminals 101, 102 may have various client applications installed thereon, such as model training class applications, depth detection class applications, shopping class applications, payment class applications, web browsers, instant messaging tools, and the like.

The terminals 101 and 102 may be hardware or software. When the terminals 101, 102 are hardware, they may be various electronic devices with display screens, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video experts compression standard audio plane 3), laptop and desktop computers, and the like. When the terminals 101, 102 are software, they can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

When the terminals 101, 102 are hardware, an image acquisition device may also be mounted thereon. The image capturing device may be various devices capable of implementing a function of capturing an image, such as a camera (may be a monocular camera or a binocular camera), a sensor, and the like. The user 110 may utilize an image acquisition device on the terminal 101, 102 to acquire a single image with a monocular camera or to acquire left and right images with a binocular camera.

Database server 104 may be a database server that provides various services. For example, a database server may have stored therein a sample set. The sample set contains a large number of samples. The sample may include corrected left and right images, and may not be labeled with information. Thus, the user 110 may also select samples from the sample set stored by the database server 104 for unsupervised training via the terminals 101, 102.

The server 105 may also be a server providing various services, such as a background server providing support for various applications displayed on the terminals 101, 102. The background server may train the initial model using samples in the sample set sent by the terminals 101, 102 and may send training results (e.g., the generated monocular depth estimation network and/or binocular depth estimation network) to the terminals 101, 102. In this way, the user may apply the generated monocular depth estimation network and binocular depth estimation network for depth estimation.

The database server 104 and the server 105 may be hardware or software. When they are hardware, they may be implemented as a distributed server cluster composed of a plurality of servers, or as a single server. When they are software, they may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that the method for training an image processing network or the image processing method provided by the embodiment of the present application is generally performed by the server 105. Accordingly, a device for training an image processing network or an image processing device is also generally provided in the server 105.

It should be noted that the database server 104 may not be provided in the system architecture 100 in cases where the server 105 may implement the relevant functions of the database server 104.

It should be understood that the number of terminals, networks, database servers, and servers in fig. 1 are merely illustrative. There may be any number of terminals, networks, database servers, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for training an image processing network in accordance with the present application is shown. The method for training an image processing network may comprise the steps of:

Step 201, a binocular deep learning network trained according to a preset loss function and a sample set are obtained.

In the present embodiment, an execution subject of a method for training an image processing network (e.g., the server 105 shown in fig. 1) can acquire a sample set and a binocular deep learning network in various ways. For example, the executing entity may obtain the existing sample set and the trained binocular deep learning network stored therein from a database server (e.g., database server 104 shown in fig. 1) through a wired connection or a wireless connection. As another example, a user may collect a sample through a terminal (e.g., terminals 101, 102 shown in fig. 1). In this way, the executing body may receive samples collected by the terminal and store the samples locally, thereby generating a sample set.

A binocular deep learning network is a neural network for generating disparity maps from left and right maps. The binocular deep learning network may be obtained through local training or may be obtained from a third party server.

Here, the sample set may include at least one sample. Wherein the samples may include corrected left and right images. The method and the device can perform unsupervised training without labeling samples, and a monocular depth estimation network and a binocular depth estimation network are obtained. The left image and the right image are left and right viewpoint images of the same scene shot by the binocular camera. The correction includes distortion correction and stereo correction. The correction method is the prior art, and therefore will not be described in detail.

In this embodiment, the left and right images may be color images (e.g., RGB (Red, green, blue, red green blue) photographs) and/or grayscale images, etc. The Format of the Image is not limited in the present disclosure, as long as it can be read and identified by the execution subject, such as jpg (Joint Photo graphic Experts Group, a picture Format), BMP (Bitmap, image file Format), or RAW (RAW Image Format).

Step 202, a first sample is selected from a set of samples.

In this embodiment, the execution subject may select a sample from the sample set acquired in step 201, and execute the training steps of steps 203 to 209. The selection manner and the selection number of the samples are not limited in the present application. For example, at least one sample may be selected randomly, or a sample with better definition (i.e., higher pixels) of the sample image may be selected from the sample. Because the three neural networks are to be trained, the same sample set is shared, and the samples obtained during the training of each network are not necessarily the same, the samples for training the monocular depth estimation network are named as first samples for distinguishing. The sample training the binocular depth estimation network is named the second sample. The sample training the binocular deep learning network is named the third sample.

Step 203, inputting the left and right images of the selected first sample into the binocular deep learning network, and outputting the first disparity map.

In this embodiment, after the left and right images after the stereo correction are obtained, the matching points are on the same line, and the disparity map may be calculated using a binocular deep learning network (based on the BM algorithm or SGBM algorithm in OpenCV). Given the input corrected left and right images I ^L,I^R, a predicted disparity map d _pred is output.

Step 204, converting the first disparity map into a first binocular predicted depth map.

In this embodiment, let B (distance between optical centers of two cameras) be the base line of the left and right dual purpose, f be the focal length of the cameras, and D be the disparity value, then for any D, there is a corresponding depth value:

the units of disparity values are pixels (pixels) and the units of depth values are often expressed in millimeters (mm). The depth value of each pixel constitutes a depth map.

In step 205, a reliable region of the first binocular predicted depth map is calculated.

In the present embodiment, the parallax unreliable region between the left and right binocular can be calculated based on the left and right consistency check:

mask＝|I^R(u+d_pred(u,v),v)-I^L(u,v)|＞T

Where u, v are coordinates of the pixel point, and T is a super-parameter threshold, which may be set to 2. If the parallax of a pixel is > T, the mask of the pixel is 1, and the pixel belongs to an unreliable area (possibly caused by shielding, mismatching and the like). The mask has a value of 0 or 1,0 representing that the pixel is reliable, and 1 representing that the pixel is unreliable.

Step 206, inputting the left or right image of the selected first sample into the monocular depth estimation network to obtain a first monocular predicted depth image.

In this embodiment, the monocular depth estimation network is currently an initial neural network for converting a single image into a depth map. Inputting a monocular depth estimation network from a first sample optionally a left or right image to obtain a first monocular predicted depth imageAt this time, the accuracy of the first monocular predicted depth map is insufficient, and training of the monocular depth estimation network needs to be supervised by means of the depth map calculated by the binocular depth estimation network.

Step 207, using the depth value of the reliable region as pseudo-supervision information of the monocular depth estimation network, calculating a first loss value of the depth value of the region corresponding to the reliable region in the first monocular predicted depth map.

In this embodiment, the Depth map Depth obtained in step 204 and the mask obtained in step 205 are used as pseudo-supervision information of the monocular Depth estimation network, and the monocular Depth map is predicted to beThe design loss function is as follows:

here, 1-mask represents a reliable area because 1-mask=0 in an unreliable area, i.e., the loss value is 0. While the loss value of the reliable region is Actual output/>, for characterizing a monocular depth estimation networkAnd the difference in the desired output Depth. To distinguish between different loss functions, the loss value of the monocular depth estimation network is named first loss value.

If the first loss value is less than the predetermined first threshold, it is determined that the monocular depth estimation network training is complete, step 208.

In this embodiment, the purpose of the training is to make the first loss value smaller the better, but in order to avoid endless iteration, a termination condition may be set, i.e. the first loss value is smaller than a predetermined first threshold. And (5) reaching a training termination condition, which indicates that the monocular depth estimation network training is completed at the moment. Can be used to predict the depth of a single image.

Step 209, if the first loss value is not less than the predetermined first threshold, the relevant parameters of the monocular depth estimation network are adjusted, and steps 202-209 are continued.

In this embodiment, if the first loss value is not less than the predetermined first threshold, which indicates that the loss value still needs to be reduced, the parameters associated with the monocular depth estimation network may be adjusted, for example, by modifying the weights in the convolutional layers in the monocular depth estimation network using a back propagation technique. Steps 202-209 continue. And ending training until the first loss value is smaller than a preset first threshold value.

According to the method provided by the embodiment of the disclosure, the disparity map is obtained through the binocular depth learning network, then the disparity map is converted into the depth map, then the depth value of the reliable area in the depth map is used as the expected output of the monocular depth estimation network, and the loss value is calculated by comparing the difference between the actual output and the expected output of the monocular depth estimation network. And then adjusting parameters of the monocular depth estimation network according to the loss value. Therefore, the monocular depth estimation network with high accuracy can be trained without manual labeling.

With further reference to fig. 3, a flow 300 of yet another embodiment of a method for training an image processing network is shown. The flow 300 of the method for training an image processing network comprises the steps of:

Step 301, selecting a second sample from the set of samples.

In this embodiment, an execution subject of the method for training an image processing network (e.g., the server 105 shown in fig. 1) may select a second sample from the sample set obtained in step 201. The second sample selected may or may not be used in steps 202-209. To distinguish between the two training processes, the sample training the binocular depth estimation network is named the second sample. For the samples used in steps 202-209, some information may be retained, e.g. disparity maps, depth maps, unreliable regions, etc., so that no repetition of the process is required.

Step 302, a second binocular predicted depth map of the selected second sample is obtained and an unreliable region of the second binocular predicted depth map is determined.

In this embodiment, a second binocular predicted depth map for the second sample may be obtained according to the methods of steps 203-204. Unreliable regions of the second binocular predicted depth map may be obtained according to the method of step 205.

In some optional implementations of this embodiment, if the second sample belongs to the first sample, the first binocular prediction depth map is determined to be the second binocular prediction depth map, the reliable region of the first binocular prediction depth map is determined to be the reliable region of the second binocular prediction depth map, and the region of the second binocular prediction depth map other than the reliable region is determined to be the unreliable region of the second binocular prediction depth map. If the second sample is the first sample used in step 202, the depth map and unreliable region of the first sample need not be repeatedly calculated, and the depth map and unreliable region of the first sample need only be directly used.

In some optional implementations of this embodiment, if the second sample does not belong to the first sample, the left and right images of the selected second sample are input to the binocular deep learning network, and the second disparity map is output. The second disparity map is converted into a second binocular predicted depth map. A reliable region of the second binocular predicted depth map is calculated. An area of the second binocular predicted depth map other than the reliable area is determined as an unreliable area of the second binocular predicted depth map. The judgment process is added to avoid unnecessary repeated operation and improve the training speed.

Step 303, inputting the left or right image of the selected second sample into the monocular depth estimation network after training, to obtain a second monocular predicted depth image.

In this embodiment, the left or right image of the second sample is optionally input into a trained monocular depth estimation network to obtain a second monocular predicted depth imageThe second monocular predicted depth map is here distinguished for the monocular predicted depth map in the first training step. Since the process 200 has already trained the monocular depth estimation network, the output of the monocular depth estimation network is accurate and can be used as the desired output of the binocular depth estimation network.

And step 304, taking the depth value of the unreliable region as pseudo supervision information of the binocular depth estimation network, and calculating a binocular loss value of the depth value of the region corresponding to the unreliable region in the second binocular prediction depth map.

In this embodiment, when calculating the loss value of the binocular depth estimation network, since the left and right images obtained by the binocular camera may be unreliable due to occlusion, the present disclosure focuses on the loss value of the depth value of the unreliable region, which is named as binocular loss value L _stereo, as shown in the following formula:

in step 305, a base loss value is calculated from the preset loss function.

In this embodiment, the loss values during training of the binocular depth estimation network may include a base loss value, such as a photometric reconstruction error loss value L _photo, in addition to the loss values of the depth values of the unreliable region. The process of calculating the base loss value may refer to flow 400.

Step 306, determining the sum of the binocular loss value and the base loss value as a second loss value.

In this embodiment, the sum of the binocular loss value and the base loss value is the overall loss value of the binocular depth estimation networkAs shown in the following formula. To distinguish from the loss value of the monocular depth estimation network, the second loss value is named.

If the second loss value is less than the predetermined second threshold, it is determined that the binocular depth estimation network training is completed in step 307.

In this embodiment, the purpose of the training is to make the second loss value smaller the better, but in order to avoid endless iteration, a termination condition may be set, i.e. the second loss value is smaller than a predetermined second threshold. And (5) reaching a training termination condition, which indicates that the training of the binocular depth estimation network is completed. May be used to predict the depth of a pair of left and right maps.

If the second loss value is not smaller than the predetermined second threshold, the step 308 is performed to adjust the relevant parameters of the binocular depth estimation network, and steps 301-308 are performed continuously.

In this embodiment, if the second loss value is not less than the predetermined second threshold, which indicates that the loss value still needs to be reduced, the parameters associated with the binocular depth estimation network may be adjusted, for example, by modifying the weights in the convolutional layers in the binocular depth estimation network using a back propagation technique. Steps 301-308 continue. And ending training until the second loss value is smaller than a preset second threshold value.

The method provided by the above-described embodiments of the present disclosure takes the depth values of the image obtained by the monocular depth estimation network (for unreliable regions of the left and right images at the time of binocular depth estimation) as the desired output of the binocular depth estimation network. The loss value is calculated by comparing the difference between the actual output and the expected output of the binocular depth estimation network. And then adjusting parameters of the binocular depth estimation network according to the loss value. Therefore, the binocular depth estimation network with high accuracy can be trained without manual labeling.

With further reference to fig. 4, a flow 400 of a third embodiment of a method for training an image processing network is shown. The flow 400 of the method for training an image processing network comprises the steps of:

In step 401, a sample set is acquired.

In the present embodiment, an execution subject of a method for training an image processing network (e.g., the server 105 shown in fig. 1) can acquire a sample set in various ways. Wherein the samples in the sample set comprise corrected left and right graphs. The acquisition process may refer to step 201, and will not be described in detail herein.

A third sample is selected from the set of samples, step 402.

In this embodiment, the execution subject may select a sample from the sample set acquired in step 401, and execute the training steps of steps 403 to 407. The sample for training the binocular deep learning network is named the third sample. The third sample may be the same as or different from the first sample. The third sample may be the same as the second sample or may be different.

Step 403, inputting the left and right images of the selected third sample into the initial binocular deep learning network, and outputting a third disparity map.

In the present embodiment, the initial binocular deep learning network is a neural network for generating a disparity map of binocular images from the disparity map. The performance of the initial binocular deep learning network is poor, and the accuracy of the output third disparity map is not high. Further training is required.

Given the input corrected left and right images I ^L,I^R, a predicted disparity map d _pred is output. Wherein the disparity d _gt (u, v) for each pixel position in the theoretical true value disparity map satisfies:

d_gt(u,v)＝I^R(u+d(u,v),v)-I^L(u，v)

where d (u, v) is the distance between the right and left graphs on the abscissa.

And step 404, reconstructing the original right image according to the third parallax image to obtain a reconstructed right image.

In this embodiment, based on I ^L and the network estimated d _pred, I ^R′＝d_pred+I^L.I^R′ can be solved reversely to be the reconstructed right graph.

At step 405, the photometric error between the reconstructed right graph and the original right graph is calculated as a base loss value.

In this embodiment, the photometric error is calculated based on the calculated I ^R′ and the original right graph. The photometric reconstruction error loss can be expressed as:

L_photo＝|I^R-I^R′|

L _photo serves as a base loss value. This base loss value is also used in training the binocular depth estimation network.

And step 406, if the basic loss value is smaller than the preset third threshold value, determining that the training of the initial binocular deep learning network is completed, and obtaining the binocular deep learning network.

In this embodiment, the purpose of the training is to make the third loss value smaller the better, but in order to avoid endless iteration, a termination condition may be set, i.e. the third loss value is smaller than a predetermined third threshold. And (5) reaching a training termination condition, which indicates that the initial binocular deep learning network training is completed. The binocular deep learning network is obtained and can be used for predicting the parallax of binocular images.

If the base loss value is not less than the predetermined third threshold, the relevant parameters of the initial binocular deep learning network are adjusted 407, and steps 402-407 are continued.

In this embodiment, if the third loss value is not less than the predetermined third threshold, the loss value still needs to be reduced continuously, and the relevant parameters of the initial binocular deep learning network may be adjusted, for example, the weight in each convolution layer in the initial binocular deep learning network is modified by adopting the back propagation technology. Steps 402-407 continue. And ending training until the third loss value is smaller than a preset third threshold value.

The above embodiments of the present disclosure provide methods for training a binocular deep learning network by designing photometric reconstruction errors. The binocular deep learning network with high accuracy can be trained under the condition of no manual annotation, and the training cost is reduced. The binocular deep learning network can be used for assisting in training a monocular deep estimation network and a binocular deep estimation network.

With continued reference to fig. 5, fig. 5 is a schematic diagram of an application scenario of the method for training an image processing network according to the present embodiment. In the application scenario of fig. 5, binocular depth estimation is performed according to the left graph and the right graph to obtain an unreliable region (shielding mask) and a depth graph 1, the depth graph 1 corresponding to the reliable region is used as supervision information of a monocular depth estimation network, and a depth value estimated by the monocular depth estimation network on the left graph and a loss value of the supervision information are calculated. And adjusting parameters of the monocular depth estimation network in the direction of reducing the loss value, and calculating the loss value of the monocular depth estimation network to the depth value estimated by the right image and the supervision information. Parameters of the monocular depth estimation network are adjusted in a direction to reduce the loss value. After monocular depth estimation network training is completed. The depth value (depth map 2) estimated by the monocular depth estimation network on the left map can be used as supervision information of the binocular depth estimation network to train the binocular depth estimation network.

The monocular depth estimation network and the binocular depth estimation network are trained in an iterative mode, and a prediction result obtained last time is used as pseudo-supervision information when each iteration is carried out. This can speed up training and increase the accuracy of the monocular depth estimation network and the binocular depth estimation network. Manual labeling is reduced, so that training cost is reduced.

With further reference to fig. 6, a flow 600 of one embodiment of an image processing method is shown. The flow 600 of the image processing method includes the steps of:

in step 601, an image to be identified is acquired.

In the present embodiment, the execution subject of the image processing method (e.g., the server 105 shown in fig. 1) can acquire an image to be recognized in various ways. The existing image stored therein may be acquired from a database server (e.g., database server 104 shown in fig. 1), or the image to be identified may be acquired from a monocular or binocular camera. The image to be identified may be one or more.

Step 602, if the images are corrected left and right images, inputting the images into a binocular depth estimation network to obtain binocular depth estimation values.

In this embodiment, if the images to be identified are corrected left and right images, the binocular depth estimation network trained by the method of the process 300 may be used to perform depth estimation to obtain the binocular depth estimation value.

Step 603, if the image is a single image, inputting the image into a monocular depth estimation network to obtain a monocular depth estimation value.

In this embodiment, if the image to be identified is a single image, the monocular depth estimation network trained by the method of the process 200 may be used to perform depth estimation to obtain the monocular depth estimation value.

As can be seen from fig. 6, the flow 600 of the image processing method in the present embodiment represents the step of estimating the depth of the image. Therefore, the scheme described in the embodiment can adopt the most suitable depth estimation network for different images, so that the accuracy of image depth estimation is improved.

With further reference to fig. 7, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an apparatus for training an image processing network, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 7, an apparatus 700 for training an image processing network of the present embodiment includes: an acquisition unit 701 and a first training unit 702. Wherein the obtaining unit 701 is configured to obtain a binocular deep learning network trained according to a preset loss function and a sample set, wherein samples in the sample set comprise corrected left and right graphs. A first training unit 702 configured to select a first sample from the set of samples, performing a first training step as follows: and inputting the left image and the right image of the selected first sample into a binocular deep learning network, and outputting a first parallax image. The first disparity map is converted to a first binocular predicted depth map. A reliable region of the first binocular predicted depth map is calculated. And inputting the left image or the right image of the selected first sample into a monocular depth estimation network to obtain a first monocular predicted depth image. And taking the depth value of the reliable region as pseudo-supervision information of the monocular depth estimation network, and calculating a first loss value of the depth value of the region corresponding to the reliable region in the first monocular predicted depth map. And if the first loss value is smaller than a preset first threshold value, determining that the monocular depth estimation network training is completed.

In some optional implementations of the present embodiment, the apparatus 700 further comprises a first parameter tuning unit 703 configured to: and if the first loss value is not smaller than the preset first threshold value, adjusting the related parameters of the monocular depth estimation network, re-selecting a first sample from the sample set, and continuing to execute the first training step by using the adjusted monocular depth estimation network.

In some optional implementations of this embodiment, the apparatus 700 further comprises a second training unit 704 configured to: selecting a second sample from the set of samples, performing a second training step as follows: and acquiring a second binocular prediction depth map of the selected second sample, and determining an unreliable region of the second binocular prediction depth map. And inputting the left image or the right image of the selected second sample into a monocular depth estimation network after training is completed, and obtaining a second monocular predicted depth image. And taking the depth value of the unreliable region as pseudo supervision information of the binocular depth estimation network, and calculating a binocular loss value of the depth value of the region corresponding to the unreliable region in the second binocular prediction depth map. And calculating a basic loss value according to a preset loss function. The sum of the binocular loss value and the base loss value is determined as a second loss value. And if the second loss value is smaller than a preset second threshold value, determining that the binocular depth estimation network training is completed.

In some optional implementations of the present embodiment, the apparatus 700 further comprises a second parameter tuning unit 705 configured to: and if the second loss value is not smaller than the preset second threshold value, adjusting the related parameters of the binocular depth estimation network, re-selecting a second sample from the sample set, and continuing to execute the second training step by using the adjusted binocular depth estimation network.

In some optional implementations of this embodiment, the second training unit 704 is further configured to: if the second sample belongs to the first sample, the first binocular prediction depth map is determined to be the second binocular prediction depth map, the reliable region of the first binocular prediction depth map is determined to be the reliable region of the second binocular prediction depth map, and the region except the reliable region in the second binocular prediction depth map is determined to be the unreliable region of the second binocular prediction depth map.

In some optional implementations of this embodiment, the second training unit 704 is further configured to: if the second sample does not belong to the first sample, inputting the left image and the right image of the selected second sample into a binocular deep learning network, and outputting a second parallax image. The second disparity map is converted into a second binocular predicted depth map. A reliable region of the second binocular predicted depth map is calculated. An area of the second binocular predicted depth map other than the reliable area is determined as an unreliable area of the second binocular predicted depth map.

In some optional implementations of the present embodiment, the acquisition unit 701 is further configured to: a sample set is acquired, wherein the samples in the sample set comprise corrected left and right images. The apparatus 700 further comprises a third training unit 706 configured to: selecting a third sample from the set of samples, performing a third training step as follows: and inputting the left image and the right image of the selected third sample into an initial binocular deep learning network, and outputting a third parallax image. And reconstructing the original right image according to the third parallax image to obtain a reconstructed right image. The photometric error between the reconstructed right graph and the original right graph is calculated as a base loss value. And if the basic loss value is smaller than a preset third threshold value, determining that the initial binocular deep learning network training is completed, and obtaining the binocular deep learning network.

In some optional implementations of the present embodiment, the apparatus 700 further comprises a third parameter tuning unit 707 configured to: and if the basic loss value is not smaller than a preset third threshold value, adjusting related parameters of the initial binocular deep learning network, re-selecting a third sample from the sample set, and continuing to execute a third training step by using the adjusted initial binocular deep learning network.

In some optional implementations of the present embodiment, the apparatus further comprises an iteration unit (not shown in the drawings) configured to: and (3) alternately training a monocular depth estimation network and a binocular depth estimation network in an iterative mode, and using a prediction result obtained last time as pseudo-supervision information when each iteration is performed.

With further reference to fig. 8, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of an image processing apparatus, which corresponds to the method embodiment shown in fig. 5, and which is particularly applicable to various electronic devices.

As shown in fig. 8, the image processing apparatus 800 of the present embodiment includes: an acquisition unit 801, a first estimation unit 802, a second estimation unit 803. Wherein the acquisition unit 801 is configured to acquire an image to be identified. The first estimation unit 802 is configured to input the image into a binocular depth estimation network trained by the apparatus 700, to obtain binocular depth estimation values, if the image is a corrected left and right image. The second estimating unit 803 is configured to input the image into a monocular depth estimation network trained by the apparatus 700, resulting in a monocular depth estimation value, if the image is a monocular image.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital information processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the various methods and processes described above, such as methods for training a depth estimation network. For example, in some embodiments, the method for training a depth estimation network may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by the computing unit 901, the method described above may be performed for training one or more steps of a depth estimation network. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the method for training the depth estimation network by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable information medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology. The server may be a server of a distributed system or a server that incorporates a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for training an image processing network, comprising:

obtaining a binocular deep learning network trained according to a preset loss function and a sample set, wherein samples in the sample set comprise corrected left images and corrected right images;

Selecting a first sample from the set of samples, performing a first training step as follows: inputting the left image and the right image of the selected first sample into the binocular deep learning network, and outputting a first parallax image; converting the first disparity map into a first binocular predicted depth map; calculating a reliable region of the first binocular prediction depth map; inputting the left image or the right image of the selected first sample into a monocular depth estimation network to obtain a first monocular predicted depth image; taking the depth value of the reliable region as pseudo-supervision information of the monocular depth estimation network, and calculating a first loss value of the depth value of the region corresponding to the reliable region in the first monocular predicted depth map; if the first loss value is smaller than a preset first threshold value, determining that the monocular depth estimation network training is completed;

wherein said calculating a reliable region of said first binocular predicted depth map comprises:

and calculating a parallax reliable region between left and right binocular in the first parallax map based on the left and right consistency check.

2. The method of claim 1, wherein the method further comprises:

and if the first loss value is not smaller than a preset first threshold value, adjusting related parameters of the monocular depth estimation network, re-selecting a first sample from the sample set, and continuously executing the first training step by using the adjusted monocular depth estimation network.

3. The method of claim 1, wherein the method further comprises:

Selecting a second sample from the set of samples, performing a second training step as follows: acquiring a second binocular prediction depth map of the selected second sample, and determining an unreliable region of the second binocular prediction depth map; inputting the left image or the right image of the selected second sample into a monocular depth estimation network after training is completed, and obtaining a second monocular predicted depth image; taking the depth value of the unreliable region as pseudo-supervision information of a binocular depth estimation network, and calculating a binocular loss value of the depth value of a region corresponding to the unreliable region in the second binocular prediction depth map; calculating a basic loss value according to the preset loss function; determining a sum of the binocular loss value and the base loss value as a second loss value; and if the second loss value is smaller than a preset second threshold value, determining that the binocular depth estimation network training is completed.

4. A method according to claim 3, wherein the method further comprises:

And if the second loss value is not smaller than a preset second threshold value, adjusting related parameters of the binocular depth estimation network, re-selecting a second sample from the sample set, and continuing to execute the second training step by using the adjusted binocular depth estimation network.

5. The method of claim 3, wherein the acquiring a second binocular predicted depth map of the selected second samples and determining unreliable regions of the second binocular predicted depth map comprises:

And if the second sample belongs to the first sample, determining the first binocular prediction depth map as a second binocular prediction depth map, determining a reliable region of the first binocular prediction depth map as a reliable region of the second binocular prediction depth map, and determining a region except the reliable region in the second binocular prediction depth map as an unreliable region of the second binocular prediction depth map.

6. The method of claim 3, wherein the acquiring a second binocular predicted depth map of the selected second samples and determining unreliable regions of the second binocular predicted depth map comprises:

If the second sample does not belong to the first sample, inputting the left image and the right image of the selected second sample into the binocular deep learning network, and outputting a second parallax image; converting the second disparity map into a second binocular predicted depth map; calculating a reliable region of the second binocular predicted depth map; and determining the area except the reliable area in the second binocular prediction depth map as an unreliable area of the second binocular prediction depth map.

7. The method of claim 1, wherein the acquiring a binocular deep learning network trained according to a preset loss function comprises:

Obtaining a sample set, wherein samples in the sample set comprise corrected left and right images;

selecting a third sample from the set of samples, performing a third training step as follows: inputting the left image and the right image of the selected third sample into an initial binocular deep learning network, and outputting a third parallax image; reconstructing the original right image according to the third parallax image to obtain a reconstructed right image; calculating a photometric error between the reconstructed right graph and the original right graph as a base loss value; and if the basic loss value is smaller than a preset third threshold value, determining that the initial binocular deep learning network training is completed, and obtaining a binocular deep learning network.

8. The method of claim 7, wherein the method further comprises:

And if the basic loss value is not smaller than a preset third threshold value, adjusting related parameters of the initial binocular deep learning network, reselecting a third sample from the sample set, and continuously executing the third training step by using the adjusted initial binocular deep learning network.

9. The method of any of claims 1-8, wherein the method further comprises:

and (3) alternately training a monocular depth estimation network and a binocular depth estimation network in an iterative mode, and using a prediction result obtained last time as pseudo-supervision information when each iteration is performed.

10. An image processing method, comprising:

Acquiring an image to be identified;

if the image is a corrected left image and right image, inputting the image into a binocular depth estimation network trained according to the method of any one of claims 3-6 to obtain binocular depth estimation values;

If the image is a single image, inputting the image into a monocular depth estimation network trained according to the method of any one of claims 1-9, resulting in a monocular depth estimation value.

11. An apparatus for training an image processing network, comprising:

An acquisition unit configured to acquire a binocular deep learning network trained according to a preset loss function and a sample set, wherein samples in the sample set include corrected left and right graphs;

A first training unit configured to select a first sample from the set of samples, performing a first training step of: inputting the left image and the right image of the selected first sample into the binocular deep learning network, and outputting a first parallax image; converting the first disparity map into a first binocular predicted depth map; calculating a reliable region of the first binocular prediction depth map; inputting the left image or the right image of the selected first sample into a monocular depth estimation network to obtain a first monocular predicted depth image; taking the depth value of the reliable region as pseudo-supervision information of the monocular depth estimation network, and calculating a first loss value of the depth value of the region corresponding to the reliable region in the first monocular predicted depth map; if the first loss value is smaller than a preset first threshold value, determining that the monocular depth estimation network training is completed;

12. The apparatus of claim 11, wherein the apparatus further comprises a first parameter tuning unit configured to:

13. The apparatus of claim 11, wherein the apparatus further comprises a second training unit configured to:

14. The apparatus of claim 13, wherein the apparatus further comprises a second parameter tuning unit configured to:

15. The apparatus of claim 13, wherein the second training unit is further configured to:

16. The apparatus of claim 13, wherein the second training unit is further configured to:

17. The apparatus of claim 11, wherein the acquisition unit is further configured to: obtaining a sample set, wherein samples in the sample set comprise corrected left and right images;

the apparatus further comprises a third training unit configured to:

18. The apparatus of claim 17, wherein the apparatus further comprises a third parameter tuning unit configured to:

19. The apparatus according to any of claims 11-18, wherein the apparatus further comprises an iteration unit configured to:

20. An image processing apparatus comprising:

An acquisition unit configured to acquire an image to be recognized;

A first estimation unit configured to input the image into a binocular depth estimation network trained by the apparatus of any one of claims 13-16, if the image is a corrected left and right image, resulting in a binocular depth estimation value;

a second estimation unit configured to input the image into a monocular depth estimation network trained by the apparatus of any one of claims 11-19, if the image is a monocular map, resulting in a monocular depth estimate.

21. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method according to any one of claims 1-10.

22. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-10.