CN110033481B

CN110033481B - Method and apparatus for image processing

Info

Publication number: CN110033481B
Application number: CN201810024743.0A
Authority: CN
Inventors: 刘志花; 马林; 李源煕; 安敏洙; 高天豪; 洪成勋; 王淳; 王光伟
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2024-07-02
Anticipated expiration: 2038-01-10
Also published as: KR20190085464A; KR102661954B1; CN110033481A

Abstract

A method and apparatus for processing an image are disclosed. The method comprises the following steps: receiving an input image; and estimating the depth of the predetermined target according to the position, size and category of the predetermined target in the image.

Description

Method and apparatus for image processing

Technical Field

The present disclosure relates to methods and apparatus for image processing, and more particularly, to methods and apparatus for object detection, object classification, and object depth estimation of images.

Background

The depth estimation of the target object in the image may be applicable to various scenarios, in particular in automatic driving or assisted driving. Existing methods of depth estimation mainly include a stereoscopic vision-based method, a laser ranging-based method, a target size-based method, and the like.

The existing depth estimation method mainly comprises two main types, namely the depth is directly obtained from hardware equipment, such as Velodyne laser radar equipment, the equipment can obtain high-precision depth estimation, but the equipment is large in size and high in cost, and the obtained depth map is sparse and low in resolution. Another type of method is to obtain depth from a low cost vision sensor, such as from two vision sensors, which is very inaccurate when the target is particularly far from the sensor, because the line of sight is now almost parallel.

Capturing monocular images from a single vision sensor (e.g., camera) and estimating the depth of the monocular images based on depth learning is becoming increasingly popular, but the major drawbacks of this approach include (1) more reliance on training data; (2) low accuracy.

Therefore, there is a need for a low cost and high accuracy method and apparatus for estimating the depth of a target in an image.

Disclosure of Invention

The present disclosure provides a method and apparatus for performing image processing. In particular, the present disclosure relates to a method and apparatus for depth estimation of a target based on monocular images from characteristics of near-far size of objects in the co-focal images.

According to one aspect of the present disclosure, a method of processing an image is disclosed, the method comprising: receiving an input image; the depth of the predetermined target is estimated from the position, size and class of the predetermined target in the image.

In one embodiment of the present disclosure, estimating the depth of the predetermined target comprises: estimating the depth of a predetermined target through single-task network learning under the condition that the position, the size and the category of the predetermined target are known; and/or estimating the depth of the predetermined target by multi-tasking network learning in case the position, size and class of the predetermined target are unknown.

In one embodiment of the present disclosure, the position of the predetermined target is the coordinates of the predetermined target in the entire image.

In one embodiment of the present disclosure, the size of the predetermined target is a size of a detection frame surrounding the predetermined target.

In one embodiment of the present disclosure, before estimating the depth of the predetermined target, the method further comprises: the received image is preprocessed.

In one embodiment of the present disclosure, preprocessing a received image includes: and normalizing the image according to the focal length information and the standard focal length information of the image.

In one embodiment of the present disclosure, estimating the depth of the predetermined target through single-task web learning comprises: cutting out image blocks with preset sizes around a detection frame of the preset target in the image, and carrying out mask processing on the image blocks to obtain mask images with the same sizes; splicing the image blocks and the mask image together according to channels; inputting the spliced images into a single-task network; the depth of the predetermined target is output from the single-tasking network.

In one embodiment of the present disclosure, outputting the depth of the predetermined target from the single-tasking network comprises: determining the probability that the depth of the preset target belongs to each preset depth interval, obtaining the final result of the depth of the preset target in a probability weighting mode, and outputting the final result.

In one embodiment of the present disclosure, estimating the depth of the predetermined target through multitasking network learning comprises: a multitasking network is employed to estimate the location and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.

In one embodiment of the present disclosure, a multitasking network includes a plurality of convolutional layers and corresponding pooling layers.

In one embodiment of the present disclosure, the multi-tasking network is a faster convolutional layer feature based target area (master-rcnn) based network and the penalty function of the multi-tasking network is a penalty function based on the penalty function of master-rcnn plus the penalty information of depth.

In one embodiment of the present disclosure, concurrently estimating the depth of the predetermined target through multitasking network learning includes: performing target detection branch processing, target classification branch processing and target depth estimation branch processing on the image: determining a position and a size of the predetermined target through the target detection branch processing, and determining a category of the predetermined target based on the position and the size of the predetermined target through the target classification branch processing; and determining, by the target depth estimation branch processing, a depth of the predetermined target based on a position, a size, and a category of the predetermined target.

In one embodiment of the present disclosure, the multi-tasking network is a YOLO2 based network and the loss function of the multi-tasking network is a loss function of adding depth loss information on the basis of the loss function of YOLO 2.

In one embodiment of the present disclosure, the position and size of the predetermined target, the category of the predetermined target, and the depth of the predetermined target are output via a last convolutional layer of the plurality of convolutional layers.

In one embodiment of the present disclosure, each trellis in the last convolutional layer includes information of a plurality of anchors.

In one embodiment of the present disclosure, single-layer or multi-layer features are employed to estimate the location and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.

In one embodiment of the present disclosure, the multi-layer features are obtained by different prediction layers or the same prediction layer.

In one embodiment of the present disclosure, the target depth estimation branch processing includes: determining the probability that the depth of the preset target belongs to each preset depth interval; and taking the depth interval with the maximum probability of the preset target as the depth of the preset target, or obtaining the depth of the preset target in a probability weighting mode.

In one embodiment of the present disclosure, the loss function includes at least one of a square error, a cross entropy, and a logarithmic polynomial logistic regression.

In one embodiment of the present disclosure, the predetermined target includes at least one of a person, a vehicle, a traffic light, and a traffic sign in the image.

According to another aspect of the present disclosure, there is provided an apparatus for processing an image, including: a receiver configured to receive an input image; a processor; and a memory storing computer executable instructions that, when executed by the processor, cause the processor to: the depth of the predetermined target is estimated from the position, size and class of the predetermined target in the image.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: estimating the depth of a predetermined target through single-task network learning under the condition that the position, the size and the category of the predetermined target are known; and/or estimating the depth of the predetermined target by multi-tasking network learning in case the position, size and class of the predetermined target are unknown.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: the received image is preprocessed before estimating the depth of the predetermined object.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: and normalizing the image according to the focal length information and the standard focal length information of the image.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: cutting out image blocks with preset sizes around a detection frame of the preset target in the image, and carrying out mask processing on the image blocks to obtain mask images with the same sizes; splicing the image blocks and the mask image together according to channels; inputting the spliced images into a single-task network; the depth of the predetermined target is output from the single-tasking network.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: determining the probability that the depth of the preset target belongs to each preset depth interval, obtaining the final result of the depth of the preset target in a probability weighting mode, and outputting the final result.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: a multitasking network is employed to estimate the location and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: performing target detection branch processing, target classification branch processing and target depth estimation branch processing on the image: determining a position and a size of the predetermined target through the target detection branch processing, and determining a category of the predetermined target based on the position and the size of the predetermined target through the target classification branch processing; and determining, by the target depth estimation branch processing, a depth of the predetermined target based on a position, a size, and a category of the predetermined target.

In one embodiment of the present disclosure, the multi-tasking network is a (YOLO 2) based network and the loss function of the multi-tasking network is a loss function of adding depth loss information on the basis of the loss function of YOLO 2.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: the position and size of the predetermined target, the category of the predetermined target, and the depth of the predetermined target are output via a last convolutional layer of the plurality of convolutional layers.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: single-layer or multi-layer features are employed to estimate the position and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.

In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: determining the probability that the depth of the preset target belongs to each preset depth interval; and taking the depth interval with the maximum probability of the preset target as the depth of the preset target, or obtaining the depth of the preset target in a probability weighting mode.

In one embodiment of the present disclosure, the predetermined targets include, but are not limited to, persons, vehicles, traffic lights, and traffic signs in the image.

With the solution of the above-described embodiments of the present disclosure, it is possible to perform high-precision target depth estimation of an image using only one visual sensor.

Drawings

For a better understanding of the present invention, the present invention will be described in detail with reference to the following drawings:

FIG. 1 illustrates a method of processing an image according to an exemplary embodiment of the present disclosure;

FIG. 2A illustrates a method of target depth estimation of an image according to an exemplary embodiment of the present disclosure;

FIG. 2B illustrates a method of target depth estimation of an image according to another exemplary embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of depth estimation of a target with the state of the target known, according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a schematic diagram of depth estimation of a target with unknown target state according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a scenario where multiple feature layers are connected to different prediction layers in the case of a YOLO 2-based network framework;

FIG. 6 illustrates a scenario where multiple feature layers are connected to the same prediction layer in the case of a YOLO 2-based network framework; and

Fig. 7 shows a schematic structural view of an apparatus according to an exemplary embodiment of the present disclosure.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical means and advantages of the present application more apparent. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present disclosure and are not to be construed as limiting the present disclosure.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs, unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Fig. 1 illustrates a method 100 of processing an image according to an exemplary embodiment of the present disclosure.

As shown in fig. 1, in step S110, an input image is received. The input image may be a monocular image taken by a single vision sensor (e.g., a camera).

In step S120, the depth of the target is estimated from the position, size, and category of the target in the input image. The category of the object may be at least one of a person, a vehicle, a traffic light, and a traffic sign in the image. The position of the object is the coordinates of the object in the image. The size of the target may be the size of a detection frame (generally represented by a rectangular frame) surrounding the target. In particular, where the location, size, and class of a target are known, the depth of the target is estimated by single-task network learning, which refers to learning through a network with only one task (e.g., depth estimation). In the case where the position, size, and class of the target are unknown, the depth of the target is estimated by multitasking network learning, which refers to learning through a network having a plurality of tasks (e.g., three tasks including target detection, target identification, depth estimation). Hereinafter, estimating the depth of the target by the single-task network learning and estimating the depth of the target by the multi-task network learning will be described in detail with reference to fig. 2A and 2B, respectively.

According to one embodiment of the present disclosure, the method may further include a step of preprocessing the received image before step S120. The preprocessing may include normalizing the image according to the focal length information and standard focal length information of the input image. For example, since images may be taken by different cameras at different focal lengths, the same object may be displayed in different sizes on the photograph at different focal lengths, resulting in a difference in the inferred depth of the object. According to an embodiment of the present disclosure, given a standard focal length f ₀, for any one image with a width w, a height h, and a focal length f, the width and the height of the image are normalized to w ' =f ₀w/f,h′＝f₀ h/f, where w ' is the width of the normalized image and h ' is the height of the normalized image. Each pixel is obtained by interpolation according to the ratio of w to w '(or h to h').

Next, a method of estimating a target depth for an image according to an exemplary embodiment of the present disclosure will be described with reference to fig. 2A.

Fig. 2A illustrates a method 200a of estimating target depth for an image with the location, size, and class of the target known.

In the case where the position, size, and type of the object in the image are known, as shown in fig. 2A, in step S210a, an image block of a predetermined size is cut out around the object in the image, masking processing is performed on the image block to obtain a masked image of the same size, and the image block and the masked image are concatenated together per channel.

Thereafter, in step S220a, the stitched together images are input into a single-tasking network. In step S230a, the depth of the target is output from the single-tasking network.

A method of depth estimation of a target in the case where the state of the target is known will be described in detail with reference to fig. 3. Fig. 3 shows a schematic diagram of depth estimation of a target with the state of the target known according to an exemplary embodiment of the present disclosure.

In the field of autopilot and assisted driving, KITTI datasets are typically used to perform algorithms such as vehicle detection, vehicle tracking, semantic segmentation, etc. in a test traffic scenario. For KITTI datasets, all depth data was obtained by lidar scanning. By analysis, the detection depth range of the radar is approximately 5-85 meters. For simplicity of description, the present disclosure equally divides the range into 8 intervals, i.e., 8 classes, such as: the division of the interval may also take the form of non-uniform divisions, such as short-range intervals may be denser, long-range intervals may be less dense, such as [5,7], [8, 11], [11, 15], [16, 23]. The specific interval ranges may be divided according to the distribution of training samples.

In fig. 3, the image size normalized by KITTI data sets is 1242×375 pixels (hereinafter "pixels" are omitted for brevity), as shown at 310. According to an embodiment of the present invention, an RGB image of 321×181 is cut out centering on a target area while a binary mask image of the same size as the RGB image of the target is obtained (i.e., the mask image is also 321×181 in size), as shown in 320. Here, the target image block size is selected to 321×181 mainly depending on the size of the target area. The size of the object is generally 100×60 through statistics. Preferably, the object is better identified using the background information when the ratio of the target area to the background area is 1:3. In the 321×181 mask image, the element in the target rectangular frame is set to 1, otherwise, set to 0. To contain background information, the rectangular frame used may be larger than the actual rectangular frame of the target, and the degree of the larger may be set according to circumstances. The actual frame size and the rectangular frame size used are here set in a ratio of 1:3. In some cases, when the target is relatively large, the size of the rectangular frame may exceed the range of 321×181, and the exceeding portion may be directly truncated. The invention splices the 321×181 image blocks and the mask images corresponding to the blocks together as input to the single-task network. Here the mask indicates the region of the object in the RGB map, so the RGB image needs to be stitched together with the mask image as input. The input image is feature extracted and predicted in a single-tasking network as shown at 330. The final depth of the target is then obtained by weighting the probability of the target depth output by the single-tasking network, as shown at 340.

According to an embodiment, the single-tasking network may be a Convolutional Neural Network (CNN) based network. The present invention employs an improved network architecture for VGG16, the specific architecture of which is shown in table 1 below.

TABLE 1

Network layer

Conv1

Conv2

Conv3

Conv4

Fc1

Fc2

Fc3

Conventional VGG 16 network

3×3×64

3×3×128

3×3×256

3×3×512

4096

1000

Improved VGG 16 network

3×3×32

3×3×64

128

64

8

In table 1 above, conv represents a convolution layer and Fc represents a full-link layer. Further, in a parameter such as "3×3×64", 3×3 denotes a core size, 64 denotes the number of channels, and so on.

The probability that the depth of a target output from a single-tasked network (i.e., a modified VGG 16 network) is defined to belong to class k is p _k, k=0, 1. The single-tasking network may be trained by SGD (Stochastic GRADIENT DESCENT, random gradient descent) algorithm. Definition d _k＝(k+1)×10,d_k represents the average depth of the kth depth interval. Thus, the depth d of the target can be obtained in a probability weighted manner, i.e.:

Next, a method of estimating a target depth for an image according to another exemplary embodiment of the present disclosure will be described with reference to fig. 2B.

Fig. 2B illustrates a method 200B of estimating target depth for an image with unknown position, size, and class of targets.

In the case where the position, size, and class of the object in the image are unknown, as shown in fig. 2B, in step S210B, the image is input into a multi-tasking network to estimate the position, size, and class of the object and the depth of the object using the multi-tasking network. The position of the object is the coordinates of the object in the image. The size of the target may be the size of a detection frame (generally represented by a rectangular frame) surrounding the target. Then in step S220b, the position, size and class of the object and the depth of the object are output from the multitasking network. Here, one example of a multitasking network may be a master-rcnn (master-Regions with Convolutional Neural Network Features, faster target area based on convolutional layer features) based network architecture. The operation of depth estimation of a target based on a multi-tasking network will be described in detail below with reference to fig. 4.

Fig. 4 shows a schematic diagram of depth estimation of a target with unknown target state according to an exemplary embodiment of the present disclosure.

In the case that the position, size and class of the object in one image are unknown, the image is input into the multi-tasking network to output the position and size of the object, the class of the object and the depth estimation result of the object. As shown in fig. 4, when an image is input, several layers of convolution operations and corresponding pooling operations are performed on the image to obtain a shared feature. After that, the image subjected to the convolution and pooling operations is divided into three branches, that is, an object detection branch process, an object classification branch process, and an object depth estimation branch process are performed on the input image. By the target detection branching process, the position and size of the target (for example, the size of a detection frame surrounding the target) are determined. The target position, size are input into the target classification branch, i.e. the class of the target is determined based on the position and size of the target by the target classification branch process. Thereafter, the target position, size, and class of the target are input into the target depth estimation branch, that is, the depth of the target is determined based on the position, size, and class of the target through the target depth estimation branch processing. In this way, the invention can provide the required target area and category information from the first two branches when performing depth estimation. A sliding window or region sampling (region proposal) is used to provide candidate boxes. Similar to the master-rcnn, a plurality of anchors (anchors) may be defined, i.e., one anchor at each location, and the result corresponding to the most appropriate anchor is selected for output.

The loss function of the multi-tasking network can be obtained by adding the loss information of depth to the loss function of the master-rcnn. The loss function of a multitasking network is defined as follows:

Wherein,

I represents a sequence number of an anchor in the mini-batch,

P _i is the class label of the i-th anchor prediction,

T _i is a detection box for 4 parameters,

D _i is the predicted depth.

Both L _cls and L _depth are loss functions of multiple logistic regression (softmax),

L _reg is a loss function of L1 smoothing,

The representation is according to GT (GT refers to groundtruth, manually labeled), the current anchor is a positive anchor,

Is the detection frame of the GT,

Is the depth of the GT.

N _cls、N_reg and N _depth are normalization terms, and

Lambda ₁ and lambda ₂ are lost weight terms.

The network may be trained by SGD algorithms.

Specific loss functions can be seen in faster-rcnn(Faster R-CNN：Towards Real-Time Object Detection with Region Proposal Networks,NIPS 2015).

The network outputs the location and size of the target, the class of the target, and depth information. According to the master-rcnn, a plurality of candidate frames can be obtained, and the confidence level, the detection frame and the depth of classification can be output for each candidate frame simultaneously through forward propagation of the network. Those boxes belonging to the target may be filtered out based on the confidence threshold of the classification and non-maximum suppression. For the left frame, the category, the detection frame, and the depth information corresponding thereto may be directly output. The depth information of the object may be the optimal (i.e., probability-maximum) depth interval of the object, or may be a probability-weighted depth obtained according to equation (1) above.

Another example of a multitasking network is a YOLO2 (You Only Look Once, only once viewed) based network architecture. The network architecture is shown in table 2 below.

TABLE 2

The anchor concept is also employed in YOLO 2. When an image is input, the invention performs rolling and pooling operations on the image to finally obtain a convolution layer. The dimension of the last convolution layer is w×h×s, where w and h represent the width and height, respectively, of the image after reduction, and s corresponds to a vector. This corresponds to dividing one image into a number of lattices. Each trellis in the last convolutional layer includes information for a plurality of anchors. Definition R _i represents the detection box of the ith anchor, P _i represents the probability of each class of the ith anchor, D _i represents the depth of the ith anchor, then the vector for each lattice can be expressed as [ R ₁,…,R_K,P₁,…,P_K,D₁,…,D_K ].

The loss function of the multi-tasking network can be obtained by adding the loss information of depth on the basis of the loss function of YOLO 2. The loss function of a multitasking network can be expressed as:

Wherein,

Lambdas _coord、λ_noobj is the weight of the coordinate and no object item respectively,

N is the number of individual cells of the last convolutional layer, i.e. wide x high,

B is the number of anchors and,

Marking the j-th anchor of the i-th cell (pixel) as if there is an object, i.e. if there is an object Otherwise it is 0 and the number of the cells is,

The x-coordinate of the GT is represented,

The y-coordinate of the GT is represented,

The width w of the GT is indicated,

The height h of the GT is indicated,

Is the actual resulting detection box for the current anchor,

P _ij (c) is the probability that the current anchor belongs to class c,

D _ij is the depth of the target to which the current anchor corresponds,

The probability that the jth anchor, which is the ith bin (pixel), has a class c object, is the GT value,

The depth of the object of the jth anchor, which is the ith bin (pixel), is the GT value, and

Classes is a collection of types of items,Representing the sum of calculated values for each category.

Specific loss function parameters can be found in YOLO (You Only Look Once: unified, real-Time Object Detection, CVPR 2016).

When a target exists in a certain grid, the detected rectangular frame and the actual frame can be made as close as possible by the loss function, the overlapping rate of the detected rectangular frame and the GT frame is made as close as possible, and the estimated target depth and the actual depth are made as close as possible. When a certain grid does not have an object, the probability of detecting the object by the grid can be made as small as possible by the loss function. The network may be trained by SGD algorithms. The loss functions of the items listed in the above equations (2) and (3) may not be limited to the form in the above equations, and may be at least one of square error, cross entropy, and logarithmic logistic regression (softmax log).

In the case of a single image input, objects in the image may be detected, classified, and depth estimated based on single layer features. When the last convolution layer is obtained, whether a target exists in the grid or not and what kind the target belongs to can be judged according to the probability of the obtained kind. When one lattice is judged to have an object, a detection frame of the object can be obtained from the corresponding anchor while the depth corresponding to the frame is obtained. The final depth information of the object may be the optimal (i.e., most probable) depth interval of the object, or may be a probability weighted depth obtained according to equation (1) above.

According to another embodiment of the present disclosure, in the case of processing multiple scales (i.e., sampling one image to obtain multiple images of different sizes), objects in the image may be detected, classified, and depth estimated from multiple layers of features, similar to SSD (Single Shot MultiBox Detector, single hit multi-scale frame detection). These feature layers of different dimensions may be connected to different prediction layers or to the same prediction layer. Fig. 4 and 5 show the case where a plurality of feature layers are connected to different prediction layers and the same prediction layer, respectively, in the case of a YOLO 2-based network framework.

In fig. 5, the respective feature layers connected to different prediction layers are classified, detected, and depth estimated, respectively. In fig. 6, two feature layers are connected to the same prediction layer, that is, parameters of the prediction layer are shared, but features of different layers are respectively predicted to obtain calculation results of targets of different scales. And obtaining a final result according to the confidence value and the non-maximum suppression of the category of the detection frames obtained by the different feature layers.

A schematic structure of an apparatus according to an exemplary embodiment of the present disclosure will be described below with reference to fig. 7. Fig. 7 shows a schematic structural diagram of an apparatus 700 according to an exemplary embodiment of the present disclosure. The apparatus 700 may be used to perform the method 100 described with reference to fig. 1. For brevity, the schematic structure of the apparatus according to an exemplary embodiment of the present disclosure is described herein, and details that have been detailed in the method as described previously with reference to fig. 1 are omitted.

As shown in fig. 7, the apparatus 700 may include a receiver 701 for receiving an input image; a processing unit or processor 703, which processor 703 may be a single unit or a combination of units for performing the different steps of the method; memory 705 having computer executable instructions stored therein. Here, the input image may be a monocular image photographed by a single vision sensor (e.g., camera).

According to an embodiment of the present disclosure, the instructions, when executed by the processor 703, cause the processor 703 to estimate the depth of the target according to the position, size and class of the target in the input image (as described in step S120 of fig. 1, which is not repeated here). Specifically, in the case where the position, size, and category of the target are known, the depth of the target is estimated through single-task network learning (as described in steps S210a to S230a of fig. 2A, which are not repeated here); in the case where the position, size, and class of the target are unknown, the depth of the target is estimated through the multi-tasking network learning (as described in steps S210B to S220B of fig. 2B, which will not be repeated here).

By the technical scheme, the target depth in the image can be estimated with high precision by using a single camera. Experimental results show that the method of the invention can reduce the error by about 1.4 times compared with the current best monocular depth estimation method, and is more accurate, and RMSE (minimum mean square error) is reduced from about 4.1 meters to about 2.9 meters. That is, with the target depth estimation method of the present invention, it is possible to improve the estimation accuracy while reducing the cost, which is particularly advantageous in the field of automated driving or assisted driving.

As will be appreciated by those skilled in the art, the program running on the apparatus according to the present disclosure may be a program for causing a computer to realize the functions of the embodiments of the present disclosure by controlling a Central Processing Unit (CPU). The program or information processed by the program may be temporarily stored in a volatile store such as a random access memory RAM, a Hard Disk Drive (HDD), a nonvolatile store such as a flash memory, or other memory system.

A program for realizing the functions of the embodiments of the present disclosure may be recorded on a computer-readable recording medium. The corresponding functions can be realized by causing a computer system to read programs recorded on the recording medium and execute the programs. The term "computer system" as used herein may be a computer system embedded in the device and may include an operating system or hardware (e.g., peripheral devices). The "computer-readable recording medium" may be a semiconductor recording medium, an optical recording medium, a magnetic recording medium, a recording medium in which a program is stored dynamically at a short time, or any other recording medium readable by a computer.

The various features or functional modules of the apparatus used in the embodiments described above may be implemented or performed by circuitry (e.g., single-chip or multi-chip integrated circuits). Circuits designed to perform the functions described herein may include a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The circuit may be a digital circuit or an analog circuit. Where new integrated circuit technologies are presented as an alternative to existing integrated circuits due to advances in semiconductor technology, one or more embodiments of the present disclosure may also be implemented using these new integrated circuit technologies.

As above, the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings. The specific structure is not limited to the above-described embodiments, but the present disclosure also includes any design modifications without departing from the gist of the present disclosure. In addition, various modifications can be made to the present disclosure within the scope of the claims, and embodiments obtained by appropriately combining the technical means disclosed in the different embodiments are also included in the technical scope of the present disclosure. Further, the components having the same effects described in the above embodiments may be replaced with each other.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A method of processing an image, comprising:

Receiving an input image; and

Estimating the depth of the object from the position, size and class of the object in the image,

Wherein estimating the depth of the target comprises:

Estimating the depth of a target through single-task network learning under the condition that the position, the size and the category of the target are known;

In the case that the position, size and category of the target are unknown, the depth of the target is estimated through multitasking network learning.

2. The method of claim 1, wherein the size of the target is a size of a detection frame surrounding the target.

3. The method of claim 1, prior to estimating the depth of the target, the method further comprising: the received image is preprocessed.

4. A method according to claim 3, wherein preprocessing the received image comprises: and normalizing the image according to the focal length information and the standard focal length information of the image.

5. The method of claim 1, wherein estimating the depth of the target through single-task web learning comprises:

cutting out image blocks around the target in the image, and carrying out mask processing on the image blocks to obtain mask images with the same size;

splicing the image blocks and the mask image together according to channels;

inputting the spliced images into a single-task network;

the depth of the target is output from the single-tasking network.

6. The method of claim 5, wherein outputting the depth of the target from a single-tasking network comprises:

Determining the probability that the depth of the target belongs to each preset depth interval, obtaining the final result of the depth of the target in a probability weighting mode, and outputting the final result.

7. The method of claim 1, wherein estimating the depth of the target by multitasking network learning comprises: a multi-tasking network is employed to estimate the location, size, class, and depth of the target.

8. The method of claim 1, wherein the multitasking network comprises a plurality of convolutional layers and corresponding pooling layers.

9. The method of claim 1, wherein the multitasking network is a faster convolutional layer feature based network of target regions "master-rcnn" and the penalty function of the multitasking network is a penalty function of adding depth penalty information to the penalty function of the master-rcnn.

10. The method of any of claims 1-9, wherein estimating the depth of the target by multitasking network learning comprises:

Performing target detection branch processing, target classification branch processing and target depth estimation branch processing on the image:

determining the position and size of the object through the object detection branch processing,

Determining, by the object classification branch processing, a class of the object based on a position and a size of the object; and

And determining the depth of the target based on the position, the size and the category of the target through the target depth estimation branch processing.

11. The method according to any of claims 1-9, wherein the multitasking network is a "YOLO2" based network and the loss function of the network is a loss function adding depth loss information on the basis of the loss function of YOLO 2.

12. The method of claim 8, wherein the location and size of the target, the class of the target, and the depth of the target are output via a last convolutional layer of the plurality of convolutional layers.

13. The method of claim 12, wherein each trellis in the last convolutional layer includes information for a plurality of anchors.

14. The method of claim 11, wherein single-layer features or multi-layer features are employed to estimate the position and size of the target, the class of the target, and the depth of the target.

15. The method of claim 14, wherein the multi-layer features are obtained by different prediction layers or the same prediction layer.

16. The method of claim 10, wherein the target depth estimation branch processing comprises:

determining the probability that the depth of the target belongs to each preset depth interval; and

And taking the depth interval with the maximum probability of the target as the depth of the target, or obtaining the depth of the target by using a probability weighting mode.

17. The method of claim 9, wherein the loss function comprises at least one of a square error, a cross entropy, and a logarithmic polynomial logistic regression.

18. The method of claim 1, wherein the target comprises at least one of a person, a vehicle, a traffic light, and a traffic sign in the image.

19. An apparatus for processing an image, comprising:

A receiver configured to receive an input image;

A processor; and

Memory storing computer executable instructions which, when executed by a processor, cause the processor to perform the method according to any one of claims 1-18.