CN110033481B - Method and apparatus for image processing - Google Patents

Method and apparatus for image processing Download PDF

Info

Publication number
CN110033481B
CN110033481B CN201810024743.0A CN201810024743A CN110033481B CN 110033481 B CN110033481 B CN 110033481B CN 201810024743 A CN201810024743 A CN 201810024743A CN 110033481 B CN110033481 B CN 110033481B
Authority
CN
China
Prior art keywords
target
depth
image
size
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810024743.0A
Other languages
Chinese (zh)
Other versions
CN110033481A (en
Inventor
刘志花
马林
李源煕
安敏洙
高天豪
洪成勋
王淳
王光伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to CN201810024743.0A priority Critical patent/CN110033481B/en
Priority to KR1020180090827A priority patent/KR102661954B1/en
Priority to US16/237,952 priority patent/US11107229B2/en
Publication of CN110033481A publication Critical patent/CN110033481A/en
Application granted granted Critical
Publication of CN110033481B publication Critical patent/CN110033481B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30252Vehicle exterior; Vicinity of vehicle

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

A method and apparatus for processing an image are disclosed. The method comprises the following steps: receiving an input image; and estimating the depth of the predetermined target according to the position, size and category of the predetermined target in the image.

Description

Method and apparatus for image processing
Technical Field
The present disclosure relates to methods and apparatus for image processing, and more particularly, to methods and apparatus for object detection, object classification, and object depth estimation of images.
Background
The depth estimation of the target object in the image may be applicable to various scenarios, in particular in automatic driving or assisted driving. Existing methods of depth estimation mainly include a stereoscopic vision-based method, a laser ranging-based method, a target size-based method, and the like.
The existing depth estimation method mainly comprises two main types, namely the depth is directly obtained from hardware equipment, such as Velodyne laser radar equipment, the equipment can obtain high-precision depth estimation, but the equipment is large in size and high in cost, and the obtained depth map is sparse and low in resolution. Another type of method is to obtain depth from a low cost vision sensor, such as from two vision sensors, which is very inaccurate when the target is particularly far from the sensor, because the line of sight is now almost parallel.
Capturing monocular images from a single vision sensor (e.g., camera) and estimating the depth of the monocular images based on depth learning is becoming increasingly popular, but the major drawbacks of this approach include (1) more reliance on training data; (2) low accuracy.
Therefore, there is a need for a low cost and high accuracy method and apparatus for estimating the depth of a target in an image.
Disclosure of Invention
The present disclosure provides a method and apparatus for performing image processing. In particular, the present disclosure relates to a method and apparatus for depth estimation of a target based on monocular images from characteristics of near-far size of objects in the co-focal images.
According to one aspect of the present disclosure, a method of processing an image is disclosed, the method comprising: receiving an input image; the depth of the predetermined target is estimated from the position, size and class of the predetermined target in the image.
In one embodiment of the present disclosure, estimating the depth of the predetermined target comprises: estimating the depth of a predetermined target through single-task network learning under the condition that the position, the size and the category of the predetermined target are known; and/or estimating the depth of the predetermined target by multi-tasking network learning in case the position, size and class of the predetermined target are unknown.
In one embodiment of the present disclosure, the position of the predetermined target is the coordinates of the predetermined target in the entire image.
In one embodiment of the present disclosure, the size of the predetermined target is a size of a detection frame surrounding the predetermined target.
In one embodiment of the present disclosure, before estimating the depth of the predetermined target, the method further comprises: the received image is preprocessed.
In one embodiment of the present disclosure, preprocessing a received image includes: and normalizing the image according to the focal length information and the standard focal length information of the image.
In one embodiment of the present disclosure, estimating the depth of the predetermined target through single-task web learning comprises: cutting out image blocks with preset sizes around a detection frame of the preset target in the image, and carrying out mask processing on the image blocks to obtain mask images with the same sizes; splicing the image blocks and the mask image together according to channels; inputting the spliced images into a single-task network; the depth of the predetermined target is output from the single-tasking network.
In one embodiment of the present disclosure, outputting the depth of the predetermined target from the single-tasking network comprises: determining the probability that the depth of the preset target belongs to each preset depth interval, obtaining the final result of the depth of the preset target in a probability weighting mode, and outputting the final result.
In one embodiment of the present disclosure, estimating the depth of the predetermined target through multitasking network learning comprises: a multitasking network is employed to estimate the location and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.
In one embodiment of the present disclosure, a multitasking network includes a plurality of convolutional layers and corresponding pooling layers.
In one embodiment of the present disclosure, the multi-tasking network is a faster convolutional layer feature based target area (master-rcnn) based network and the penalty function of the multi-tasking network is a penalty function based on the penalty function of master-rcnn plus the penalty information of depth.
In one embodiment of the present disclosure, concurrently estimating the depth of the predetermined target through multitasking network learning includes: performing target detection branch processing, target classification branch processing and target depth estimation branch processing on the image: determining a position and a size of the predetermined target through the target detection branch processing, and determining a category of the predetermined target based on the position and the size of the predetermined target through the target classification branch processing; and determining, by the target depth estimation branch processing, a depth of the predetermined target based on a position, a size, and a category of the predetermined target.
In one embodiment of the present disclosure, the multi-tasking network is a YOLO2 based network and the loss function of the multi-tasking network is a loss function of adding depth loss information on the basis of the loss function of YOLO 2.
In one embodiment of the present disclosure, the position and size of the predetermined target, the category of the predetermined target, and the depth of the predetermined target are output via a last convolutional layer of the plurality of convolutional layers.
In one embodiment of the present disclosure, each trellis in the last convolutional layer includes information of a plurality of anchors.
In one embodiment of the present disclosure, single-layer or multi-layer features are employed to estimate the location and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.
In one embodiment of the present disclosure, the multi-layer features are obtained by different prediction layers or the same prediction layer.
In one embodiment of the present disclosure, the target depth estimation branch processing includes: determining the probability that the depth of the preset target belongs to each preset depth interval; and taking the depth interval with the maximum probability of the preset target as the depth of the preset target, or obtaining the depth of the preset target in a probability weighting mode.
In one embodiment of the present disclosure, the loss function includes at least one of a square error, a cross entropy, and a logarithmic polynomial logistic regression.
In one embodiment of the present disclosure, the predetermined target includes at least one of a person, a vehicle, a traffic light, and a traffic sign in the image.
According to another aspect of the present disclosure, there is provided an apparatus for processing an image, including: a receiver configured to receive an input image; a processor; and a memory storing computer executable instructions that, when executed by the processor, cause the processor to: the depth of the predetermined target is estimated from the position, size and class of the predetermined target in the image.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: estimating the depth of a predetermined target through single-task network learning under the condition that the position, the size and the category of the predetermined target are known; and/or estimating the depth of the predetermined target by multi-tasking network learning in case the position, size and class of the predetermined target are unknown.
In one embodiment of the present disclosure, the position of the predetermined target is the coordinates of the predetermined target in the entire image.
In one embodiment of the present disclosure, the size of the predetermined target is a size of a detection frame surrounding the predetermined target.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: the received image is preprocessed before estimating the depth of the predetermined object.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: and normalizing the image according to the focal length information and the standard focal length information of the image.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: cutting out image blocks with preset sizes around a detection frame of the preset target in the image, and carrying out mask processing on the image blocks to obtain mask images with the same sizes; splicing the image blocks and the mask image together according to channels; inputting the spliced images into a single-task network; the depth of the predetermined target is output from the single-tasking network.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: determining the probability that the depth of the preset target belongs to each preset depth interval, obtaining the final result of the depth of the preset target in a probability weighting mode, and outputting the final result.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: a multitasking network is employed to estimate the location and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.
In one embodiment of the present disclosure, a multitasking network includes a plurality of convolutional layers and corresponding pooling layers.
In one embodiment of the present disclosure, the multi-tasking network is a faster convolutional layer feature based target area (master-rcnn) based network and the penalty function of the multi-tasking network is a penalty function based on the penalty function of master-rcnn plus the penalty information of depth.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: performing target detection branch processing, target classification branch processing and target depth estimation branch processing on the image: determining a position and a size of the predetermined target through the target detection branch processing, and determining a category of the predetermined target based on the position and the size of the predetermined target through the target classification branch processing; and determining, by the target depth estimation branch processing, a depth of the predetermined target based on a position, a size, and a category of the predetermined target.
In one embodiment of the present disclosure, the multi-tasking network is a (YOLO 2) based network and the loss function of the multi-tasking network is a loss function of adding depth loss information on the basis of the loss function of YOLO 2.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: the position and size of the predetermined target, the category of the predetermined target, and the depth of the predetermined target are output via a last convolutional layer of the plurality of convolutional layers.
In one embodiment of the present disclosure, each trellis in the last convolutional layer includes information of a plurality of anchors.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: single-layer or multi-layer features are employed to estimate the position and size of the predetermined target, the class of the predetermined target, and the depth of the predetermined target.
In one embodiment of the present disclosure, the multi-layer features are obtained by different prediction layers or the same prediction layer.
In one embodiment of the disclosure, the instructions, when executed by the processor, cause the processor to: determining the probability that the depth of the preset target belongs to each preset depth interval; and taking the depth interval with the maximum probability of the preset target as the depth of the preset target, or obtaining the depth of the preset target in a probability weighting mode.
In one embodiment of the present disclosure, the loss function includes at least one of a square error, a cross entropy, and a logarithmic polynomial logistic regression.
In one embodiment of the present disclosure, the predetermined targets include, but are not limited to, persons, vehicles, traffic lights, and traffic signs in the image.
With the solution of the above-described embodiments of the present disclosure, it is possible to perform high-precision target depth estimation of an image using only one visual sensor.
Drawings
For a better understanding of the present invention, the present invention will be described in detail with reference to the following drawings:
FIG. 1 illustrates a method of processing an image according to an exemplary embodiment of the present disclosure;
FIG. 2A illustrates a method of target depth estimation of an image according to an exemplary embodiment of the present disclosure;
FIG. 2B illustrates a method of target depth estimation of an image according to another exemplary embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of depth estimation of a target with the state of the target known, according to an exemplary embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of depth estimation of a target with unknown target state according to an exemplary embodiment of the present disclosure;
FIG. 5 illustrates a scenario where multiple feature layers are connected to different prediction layers in the case of a YOLO 2-based network framework;
FIG. 6 illustrates a scenario where multiple feature layers are connected to the same prediction layer in the case of a YOLO 2-based network framework; and
Fig. 7 shows a schematic structural view of an apparatus according to an exemplary embodiment of the present disclosure.
Detailed Description
The present application will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical means and advantages of the present application more apparent. The embodiments described below by referring to the drawings are exemplary only for the purpose of illustrating the present disclosure and are not to be construed as limiting the present disclosure.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs, unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Fig. 1 illustrates a method 100 of processing an image according to an exemplary embodiment of the present disclosure.
As shown in fig. 1, in step S110, an input image is received. The input image may be a monocular image taken by a single vision sensor (e.g., a camera).
In step S120, the depth of the target is estimated from the position, size, and category of the target in the input image. The category of the object may be at least one of a person, a vehicle, a traffic light, and a traffic sign in the image. The position of the object is the coordinates of the object in the image. The size of the target may be the size of a detection frame (generally represented by a rectangular frame) surrounding the target. In particular, where the location, size, and class of a target are known, the depth of the target is estimated by single-task network learning, which refers to learning through a network with only one task (e.g., depth estimation). In the case where the position, size, and class of the target are unknown, the depth of the target is estimated by multitasking network learning, which refers to learning through a network having a plurality of tasks (e.g., three tasks including target detection, target identification, depth estimation). Hereinafter, estimating the depth of the target by the single-task network learning and estimating the depth of the target by the multi-task network learning will be described in detail with reference to fig. 2A and 2B, respectively.
According to one embodiment of the present disclosure, the method may further include a step of preprocessing the received image before step S120. The preprocessing may include normalizing the image according to the focal length information and standard focal length information of the input image. For example, since images may be taken by different cameras at different focal lengths, the same object may be displayed in different sizes on the photograph at different focal lengths, resulting in a difference in the inferred depth of the object. According to an embodiment of the present disclosure, given a standard focal length f 0, for any one image with a width w, a height h, and a focal length f, the width and the height of the image are normalized to w ' =f 0w/f,h′=f0 h/f, where w ' is the width of the normalized image and h ' is the height of the normalized image. Each pixel is obtained by interpolation according to the ratio of w to w '(or h to h').
Next, a method of estimating a target depth for an image according to an exemplary embodiment of the present disclosure will be described with reference to fig. 2A.
Fig. 2A illustrates a method 200a of estimating target depth for an image with the location, size, and class of the target known.
In the case where the position, size, and type of the object in the image are known, as shown in fig. 2A, in step S210a, an image block of a predetermined size is cut out around the object in the image, masking processing is performed on the image block to obtain a masked image of the same size, and the image block and the masked image are concatenated together per channel.
Thereafter, in step S220a, the stitched together images are input into a single-tasking network. In step S230a, the depth of the target is output from the single-tasking network.
A method of depth estimation of a target in the case where the state of the target is known will be described in detail with reference to fig. 3. Fig. 3 shows a schematic diagram of depth estimation of a target with the state of the target known according to an exemplary embodiment of the present disclosure.
In the field of autopilot and assisted driving, KITTI datasets are typically used to perform algorithms such as vehicle detection, vehicle tracking, semantic segmentation, etc. in a test traffic scenario. For KITTI datasets, all depth data was obtained by lidar scanning. By analysis, the detection depth range of the radar is approximately 5-85 meters. For simplicity of description, the present disclosure equally divides the range into 8 intervals, i.e., 8 classes, such as: the division of the interval may also take the form of non-uniform divisions, such as short-range intervals may be denser, long-range intervals may be less dense, such as [5,7], [8, 11], [11, 15], [16, 23]. The specific interval ranges may be divided according to the distribution of training samples.
In fig. 3, the image size normalized by KITTI data sets is 1242×375 pixels (hereinafter "pixels" are omitted for brevity), as shown at 310. According to an embodiment of the present invention, an RGB image of 321×181 is cut out centering on a target area while a binary mask image of the same size as the RGB image of the target is obtained (i.e., the mask image is also 321×181 in size), as shown in 320. Here, the target image block size is selected to 321×181 mainly depending on the size of the target area. The size of the object is generally 100×60 through statistics. Preferably, the object is better identified using the background information when the ratio of the target area to the background area is 1:3. In the 321×181 mask image, the element in the target rectangular frame is set to 1, otherwise, set to 0. To contain background information, the rectangular frame used may be larger than the actual rectangular frame of the target, and the degree of the larger may be set according to circumstances. The actual frame size and the rectangular frame size used are here set in a ratio of 1:3. In some cases, when the target is relatively large, the size of the rectangular frame may exceed the range of 321×181, and the exceeding portion may be directly truncated. The invention splices the 321×181 image blocks and the mask images corresponding to the blocks together as input to the single-task network. Here the mask indicates the region of the object in the RGB map, so the RGB image needs to be stitched together with the mask image as input. The input image is feature extracted and predicted in a single-tasking network as shown at 330. The final depth of the target is then obtained by weighting the probability of the target depth output by the single-tasking network, as shown at 340.
According to an embodiment, the single-tasking network may be a Convolutional Neural Network (CNN) based network. The present invention employs an improved network architecture for VGG16, the specific architecture of which is shown in table 1 below.
TABLE 1
Network layer Conv1 Conv2 Conv3 Conv4 Fc1 Fc2 Fc3
Conventional VGG 16 network 3×3×64 3×3×128 3×3×256 3×3×512 4096 4096 1000
Improved VGG 16 network 3×3×32 3×3×32 3×3×64 3×3×64 128 64 8
In table 1 above, conv represents a convolution layer and Fc represents a full-link layer. Further, in a parameter such as "3×3×64", 3×3 denotes a core size, 64 denotes the number of channels, and so on.
The probability that the depth of a target output from a single-tasked network (i.e., a modified VGG 16 network) is defined to belong to class k is p k, k=0, 1. The single-tasking network may be trained by SGD (Stochastic GRADIENT DESCENT, random gradient descent) algorithm. Definition d k=(k+1)×10,dk represents the average depth of the kth depth interval. Thus, the depth d of the target can be obtained in a probability weighted manner, i.e.:
Next, a method of estimating a target depth for an image according to another exemplary embodiment of the present disclosure will be described with reference to fig. 2B.
Fig. 2B illustrates a method 200B of estimating target depth for an image with unknown position, size, and class of targets.
In the case where the position, size, and class of the object in the image are unknown, as shown in fig. 2B, in step S210B, the image is input into a multi-tasking network to estimate the position, size, and class of the object and the depth of the object using the multi-tasking network. The position of the object is the coordinates of the object in the image. The size of the target may be the size of a detection frame (generally represented by a rectangular frame) surrounding the target. Then in step S220b, the position, size and class of the object and the depth of the object are output from the multitasking network. Here, one example of a multitasking network may be a master-rcnn (master-Regions with Convolutional Neural Network Features, faster target area based on convolutional layer features) based network architecture. The operation of depth estimation of a target based on a multi-tasking network will be described in detail below with reference to fig. 4.
Fig. 4 shows a schematic diagram of depth estimation of a target with unknown target state according to an exemplary embodiment of the present disclosure.
In the case that the position, size and class of the object in one image are unknown, the image is input into the multi-tasking network to output the position and size of the object, the class of the object and the depth estimation result of the object. As shown in fig. 4, when an image is input, several layers of convolution operations and corresponding pooling operations are performed on the image to obtain a shared feature. After that, the image subjected to the convolution and pooling operations is divided into three branches, that is, an object detection branch process, an object classification branch process, and an object depth estimation branch process are performed on the input image. By the target detection branching process, the position and size of the target (for example, the size of a detection frame surrounding the target) are determined. The target position, size are input into the target classification branch, i.e. the class of the target is determined based on the position and size of the target by the target classification branch process. Thereafter, the target position, size, and class of the target are input into the target depth estimation branch, that is, the depth of the target is determined based on the position, size, and class of the target through the target depth estimation branch processing. In this way, the invention can provide the required target area and category information from the first two branches when performing depth estimation. A sliding window or region sampling (region proposal) is used to provide candidate boxes. Similar to the master-rcnn, a plurality of anchors (anchors) may be defined, i.e., one anchor at each location, and the result corresponding to the most appropriate anchor is selected for output.
The loss function of the multi-tasking network can be obtained by adding the loss information of depth to the loss function of the master-rcnn. The loss function of a multitasking network is defined as follows:
Wherein,
I represents a sequence number of an anchor in the mini-batch,
P i is the class label of the i-th anchor prediction,
T i is a detection box for 4 parameters,
D i is the predicted depth.
Both L cls and L depth are loss functions of multiple logistic regression (softmax),
L reg is a loss function of L1 smoothing,
The representation is according to GT (GT refers to groundtruth, manually labeled), the current anchor is a positive anchor,
Is the detection frame of the GT,
Is the depth of the GT.
N cls、Nreg and N depth are normalization terms, and
Lambda 1 and lambda 2 are lost weight terms.
The network may be trained by SGD algorithms.
Specific loss functions can be seen in faster-rcnn(Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks,NIPS 2015).
The network outputs the location and size of the target, the class of the target, and depth information. According to the master-rcnn, a plurality of candidate frames can be obtained, and the confidence level, the detection frame and the depth of classification can be output for each candidate frame simultaneously through forward propagation of the network. Those boxes belonging to the target may be filtered out based on the confidence threshold of the classification and non-maximum suppression. For the left frame, the category, the detection frame, and the depth information corresponding thereto may be directly output. The depth information of the object may be the optimal (i.e., probability-maximum) depth interval of the object, or may be a probability-weighted depth obtained according to equation (1) above.
Another example of a multitasking network is a YOLO2 (You Only Look Once, only once viewed) based network architecture. The network architecture is shown in table 2 below.
TABLE 2
The anchor concept is also employed in YOLO 2. When an image is input, the invention performs rolling and pooling operations on the image to finally obtain a convolution layer. The dimension of the last convolution layer is w×h×s, where w and h represent the width and height, respectively, of the image after reduction, and s corresponds to a vector. This corresponds to dividing one image into a number of lattices. Each trellis in the last convolutional layer includes information for a plurality of anchors. Definition R i represents the detection box of the ith anchor, P i represents the probability of each class of the ith anchor, D i represents the depth of the ith anchor, then the vector for each lattice can be expressed as [ R 1,…,RK,P1,…,PK,D1,…,DK ].
The loss function of the multi-tasking network can be obtained by adding the loss information of depth on the basis of the loss function of YOLO 2. The loss function of a multitasking network can be expressed as:
Wherein,
Lambdas coord、λnoobj is the weight of the coordinate and no object item respectively,
N is the number of individual cells of the last convolutional layer, i.e. wide x high,
B is the number of anchors and,
Marking the j-th anchor of the i-th cell (pixel) as if there is an object, i.e. if there is an object Otherwise it is 0 and the number of the cells is,
The x-coordinate of the GT is represented,
The y-coordinate of the GT is represented,
The width w of the GT is indicated,
The height h of the GT is indicated,
Is the actual resulting detection box for the current anchor,
P ij (c) is the probability that the current anchor belongs to class c,
D ij is the depth of the target to which the current anchor corresponds,
The probability that the jth anchor, which is the ith bin (pixel), has a class c object, is the GT value,
The depth of the object of the jth anchor, which is the ith bin (pixel), is the GT value, and
Classes is a collection of types of items,Representing the sum of calculated values for each category.
Specific loss function parameters can be found in YOLO (You Only Look Once: unified, real-Time Object Detection, CVPR 2016).
When a target exists in a certain grid, the detected rectangular frame and the actual frame can be made as close as possible by the loss function, the overlapping rate of the detected rectangular frame and the GT frame is made as close as possible, and the estimated target depth and the actual depth are made as close as possible. When a certain grid does not have an object, the probability of detecting the object by the grid can be made as small as possible by the loss function. The network may be trained by SGD algorithms. The loss functions of the items listed in the above equations (2) and (3) may not be limited to the form in the above equations, and may be at least one of square error, cross entropy, and logarithmic logistic regression (softmax log).
In the case of a single image input, objects in the image may be detected, classified, and depth estimated based on single layer features. When the last convolution layer is obtained, whether a target exists in the grid or not and what kind the target belongs to can be judged according to the probability of the obtained kind. When one lattice is judged to have an object, a detection frame of the object can be obtained from the corresponding anchor while the depth corresponding to the frame is obtained. The final depth information of the object may be the optimal (i.e., most probable) depth interval of the object, or may be a probability weighted depth obtained according to equation (1) above.
According to another embodiment of the present disclosure, in the case of processing multiple scales (i.e., sampling one image to obtain multiple images of different sizes), objects in the image may be detected, classified, and depth estimated from multiple layers of features, similar to SSD (Single Shot MultiBox Detector, single hit multi-scale frame detection). These feature layers of different dimensions may be connected to different prediction layers or to the same prediction layer. Fig. 4 and 5 show the case where a plurality of feature layers are connected to different prediction layers and the same prediction layer, respectively, in the case of a YOLO 2-based network framework.
In fig. 5, the respective feature layers connected to different prediction layers are classified, detected, and depth estimated, respectively. In fig. 6, two feature layers are connected to the same prediction layer, that is, parameters of the prediction layer are shared, but features of different layers are respectively predicted to obtain calculation results of targets of different scales. And obtaining a final result according to the confidence value and the non-maximum suppression of the category of the detection frames obtained by the different feature layers.
A schematic structure of an apparatus according to an exemplary embodiment of the present disclosure will be described below with reference to fig. 7. Fig. 7 shows a schematic structural diagram of an apparatus 700 according to an exemplary embodiment of the present disclosure. The apparatus 700 may be used to perform the method 100 described with reference to fig. 1. For brevity, the schematic structure of the apparatus according to an exemplary embodiment of the present disclosure is described herein, and details that have been detailed in the method as described previously with reference to fig. 1 are omitted.
As shown in fig. 7, the apparatus 700 may include a receiver 701 for receiving an input image; a processing unit or processor 703, which processor 703 may be a single unit or a combination of units for performing the different steps of the method; memory 705 having computer executable instructions stored therein. Here, the input image may be a monocular image photographed by a single vision sensor (e.g., camera).
According to an embodiment of the present disclosure, the instructions, when executed by the processor 703, cause the processor 703 to estimate the depth of the target according to the position, size and class of the target in the input image (as described in step S120 of fig. 1, which is not repeated here). Specifically, in the case where the position, size, and category of the target are known, the depth of the target is estimated through single-task network learning (as described in steps S210a to S230a of fig. 2A, which are not repeated here); in the case where the position, size, and class of the target are unknown, the depth of the target is estimated through the multi-tasking network learning (as described in steps S210B to S220B of fig. 2B, which will not be repeated here).
By the technical scheme, the target depth in the image can be estimated with high precision by using a single camera. Experimental results show that the method of the invention can reduce the error by about 1.4 times compared with the current best monocular depth estimation method, and is more accurate, and RMSE (minimum mean square error) is reduced from about 4.1 meters to about 2.9 meters. That is, with the target depth estimation method of the present invention, it is possible to improve the estimation accuracy while reducing the cost, which is particularly advantageous in the field of automated driving or assisted driving.
As will be appreciated by those skilled in the art, the program running on the apparatus according to the present disclosure may be a program for causing a computer to realize the functions of the embodiments of the present disclosure by controlling a Central Processing Unit (CPU). The program or information processed by the program may be temporarily stored in a volatile store such as a random access memory RAM, a Hard Disk Drive (HDD), a nonvolatile store such as a flash memory, or other memory system.
A program for realizing the functions of the embodiments of the present disclosure may be recorded on a computer-readable recording medium. The corresponding functions can be realized by causing a computer system to read programs recorded on the recording medium and execute the programs. The term "computer system" as used herein may be a computer system embedded in the device and may include an operating system or hardware (e.g., peripheral devices). The "computer-readable recording medium" may be a semiconductor recording medium, an optical recording medium, a magnetic recording medium, a recording medium in which a program is stored dynamically at a short time, or any other recording medium readable by a computer.
The various features or functional modules of the apparatus used in the embodiments described above may be implemented or performed by circuitry (e.g., single-chip or multi-chip integrated circuits). Circuits designed to perform the functions described herein may include a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. The circuit may be a digital circuit or an analog circuit. Where new integrated circuit technologies are presented as an alternative to existing integrated circuits due to advances in semiconductor technology, one or more embodiments of the present disclosure may also be implemented using these new integrated circuit technologies.
As above, the embodiments of the present disclosure have been described in detail with reference to the accompanying drawings. The specific structure is not limited to the above-described embodiments, but the present disclosure also includes any design modifications without departing from the gist of the present disclosure. In addition, various modifications can be made to the present disclosure within the scope of the claims, and embodiments obtained by appropriately combining the technical means disclosed in the different embodiments are also included in the technical scope of the present disclosure. Further, the components having the same effects described in the above embodiments may be replaced with each other.
The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims (19)

1. A method of processing an image, comprising:
Receiving an input image; and
Estimating the depth of the object from the position, size and class of the object in the image,
Wherein estimating the depth of the target comprises:
Estimating the depth of a target through single-task network learning under the condition that the position, the size and the category of the target are known;
In the case that the position, size and category of the target are unknown, the depth of the target is estimated through multitasking network learning.
2. The method of claim 1, wherein the size of the target is a size of a detection frame surrounding the target.
3. The method of claim 1, prior to estimating the depth of the target, the method further comprising: the received image is preprocessed.
4. A method according to claim 3, wherein preprocessing the received image comprises: and normalizing the image according to the focal length information and the standard focal length information of the image.
5. The method of claim 1, wherein estimating the depth of the target through single-task web learning comprises:
cutting out image blocks around the target in the image, and carrying out mask processing on the image blocks to obtain mask images with the same size;
splicing the image blocks and the mask image together according to channels;
inputting the spliced images into a single-task network;
the depth of the target is output from the single-tasking network.
6. The method of claim 5, wherein outputting the depth of the target from a single-tasking network comprises:
Determining the probability that the depth of the target belongs to each preset depth interval, obtaining the final result of the depth of the target in a probability weighting mode, and outputting the final result.
7. The method of claim 1, wherein estimating the depth of the target by multitasking network learning comprises: a multi-tasking network is employed to estimate the location, size, class, and depth of the target.
8. The method of claim 1, wherein the multitasking network comprises a plurality of convolutional layers and corresponding pooling layers.
9. The method of claim 1, wherein the multitasking network is a faster convolutional layer feature based network of target regions "master-rcnn" and the penalty function of the multitasking network is a penalty function of adding depth penalty information to the penalty function of the master-rcnn.
10. The method of any of claims 1-9, wherein estimating the depth of the target by multitasking network learning comprises:
Performing target detection branch processing, target classification branch processing and target depth estimation branch processing on the image:
determining the position and size of the object through the object detection branch processing,
Determining, by the object classification branch processing, a class of the object based on a position and a size of the object; and
And determining the depth of the target based on the position, the size and the category of the target through the target depth estimation branch processing.
11. The method according to any of claims 1-9, wherein the multitasking network is a "YOLO2" based network and the loss function of the network is a loss function adding depth loss information on the basis of the loss function of YOLO 2.
12. The method of claim 8, wherein the location and size of the target, the class of the target, and the depth of the target are output via a last convolutional layer of the plurality of convolutional layers.
13. The method of claim 12, wherein each trellis in the last convolutional layer includes information for a plurality of anchors.
14. The method of claim 11, wherein single-layer features or multi-layer features are employed to estimate the position and size of the target, the class of the target, and the depth of the target.
15. The method of claim 14, wherein the multi-layer features are obtained by different prediction layers or the same prediction layer.
16. The method of claim 10, wherein the target depth estimation branch processing comprises:
determining the probability that the depth of the target belongs to each preset depth interval; and
And taking the depth interval with the maximum probability of the target as the depth of the target, or obtaining the depth of the target by using a probability weighting mode.
17. The method of claim 9, wherein the loss function comprises at least one of a square error, a cross entropy, and a logarithmic polynomial logistic regression.
18. The method of claim 1, wherein the target comprises at least one of a person, a vehicle, a traffic light, and a traffic sign in the image.
19. An apparatus for processing an image, comprising:
A receiver configured to receive an input image;
A processor; and
Memory storing computer executable instructions which, when executed by a processor, cause the processor to perform the method according to any one of claims 1-18.
CN201810024743.0A 2018-01-10 2018-01-10 Method and apparatus for image processing Active CN110033481B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201810024743.0A CN110033481B (en) 2018-01-10 2018-01-10 Method and apparatus for image processing
KR1020180090827A KR102661954B1 (en) 2018-01-10 2018-08-03 A method of processing an image, and apparatuses performing the same
US16/237,952 US11107229B2 (en) 2018-01-10 2019-01-02 Image processing method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810024743.0A CN110033481B (en) 2018-01-10 2018-01-10 Method and apparatus for image processing

Publications (2)

Publication Number Publication Date
CN110033481A CN110033481A (en) 2019-07-19
CN110033481B true CN110033481B (en) 2024-07-02

Family

ID=67234124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810024743.0A Active CN110033481B (en) 2018-01-10 2018-01-10 Method and apparatus for image processing

Country Status (2)

Country Link
KR (1) KR102661954B1 (en)
CN (1) CN110033481B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796103A (en) * 2019-11-01 2020-02-14 邵阳学院 Target based on fast-RCNN and distance detection method thereof
CN112926370A (en) * 2019-12-06 2021-06-08 纳恩博(北京)科技有限公司 Method and device for determining perception parameters, storage medium and electronic device
CN111179253B (en) * 2019-12-30 2023-11-24 歌尔股份有限公司 Product defect detection method, device and system
CN111191621B (en) * 2020-01-03 2024-06-28 北京同方软件有限公司 Rapid and accurate identification method for multi-scale target in large-focal-length monitoring scene
KR102538848B1 (en) * 2020-05-12 2023-05-31 부산대학교병원 Deep learning architecture system for real time quality interpretation of fundus image
KR20220013875A (en) * 2020-07-27 2022-02-04 옴니어스 주식회사 Method, system and non-transitory computer-readable recording medium for providing information regarding products based on trends
WO2022025568A1 (en) * 2020-07-27 2022-02-03 옴니어스 주식회사 Method, system, and non-transitory computer-readable recording medium for recognizing attribute of product by using multi task learning
CN112686887A (en) * 2021-01-27 2021-04-20 上海电气集团股份有限公司 Method, system, equipment and medium for detecting concrete surface cracks
KR102378887B1 (en) * 2021-02-15 2022-03-25 인하대학교 산학협력단 Method and Apparatus of Bounding Box Regression by a Perimeter-based IoU Loss Function in Object Detection
CN112967187B (en) * 2021-02-25 2024-05-31 深圳海翼智新科技有限公司 Method and apparatus for target detection
KR102437760B1 (en) 2021-05-27 2022-08-29 이충열 Method for processing sounds by computing apparatus, method for processing images and sounds thereby, and systems using the same
KR20230137814A (en) 2022-03-22 2023-10-05 이충열 Method for processing images obtained from shooting device operatively connected to computing apparatus and system using the same
CN116385984B (en) * 2023-06-05 2023-09-01 武汉理工大学 Automatic detection method and device for ship draft
CN116883990A (en) * 2023-07-07 2023-10-13 中国科学技术大学 Target detection method for stereoscopic vision depth perception learning
CN117333487B (en) * 2023-12-01 2024-03-29 深圳市宗匠科技有限公司 Acne classification method, device, equipment and storage medium

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103337076B (en) * 2013-06-26 2016-09-21 深圳市智美达科技股份有限公司 There is range determining method and device in video monitor object
JP6351238B2 (en) * 2013-11-15 2018-07-04 キヤノン株式会社 Image processing apparatus, imaging apparatus, and distance correction method
US9495606B2 (en) * 2014-02-28 2016-11-15 Ricoh Co., Ltd. Method for product recognition from multiple images
US20150381972A1 (en) * 2014-06-30 2015-12-31 Microsoft Corporation Depth estimation using multi-view stereo and a calibrated projector
US9272417B2 (en) * 2014-07-16 2016-03-01 Google Inc. Real-time determination of object metrics for trajectory planning
JP2016035623A (en) * 2014-08-01 2016-03-17 キヤノン株式会社 Information processing apparatus and information processing method
CN104279960B (en) * 2014-10-14 2017-01-25 安徽大学 Method for measuring size of object by mobile equipment
AU2015203666A1 (en) * 2015-06-30 2017-01-19 Canon Kabushiki Kaisha Methods and systems for controlling a camera to perform a task
CN105260356B (en) * 2015-10-10 2018-02-06 西安交通大学 Chinese interaction text emotion and topic detection method based on multi-task learning
US10049267B2 (en) * 2016-02-29 2018-08-14 Toyota Jidosha Kabushiki Kaisha Autonomous human-centric place recognition
US10032067B2 (en) * 2016-05-28 2018-07-24 Samsung Electronics Co., Ltd. System and method for a unified architecture multi-task deep learning machine for object recognition
CN106529402B (en) * 2016-09-27 2019-05-28 中国科学院自动化研究所 The face character analysis method of convolutional neural networks based on multi-task learning
US11094208B2 (en) 2016-09-30 2021-08-17 The Boeing Company Stereo camera system for collision avoidance during aircraft surface operations
CN106780303A (en) * 2016-12-02 2017-05-31 上海大学 A kind of image split-joint method based on local registration
CN106780536A (en) * 2017-01-13 2017-05-31 深圳市唯特视科技有限公司 A kind of shape based on object mask network perceives example dividing method
CN107492115B (en) * 2017-08-30 2021-01-01 北京小米移动软件有限公司 Target object detection method and device

Also Published As

Publication number Publication date
KR20190085464A (en) 2019-07-18
KR102661954B1 (en) 2024-04-29
CN110033481A (en) 2019-07-19

Similar Documents

Publication Publication Date Title
CN110033481B (en) Method and apparatus for image processing
CN109034078B (en) Training method of age identification model, age identification method and related equipment
CN110826379B (en) Target detection method based on feature multiplexing and YOLOv3
KR102337367B1 (en) Learning method and learning device for object detector with hardware optimization based on cnn for detection at distance or military purpose using image concatenation, and testing method and testing device using the same
Cepni et al. Vehicle detection using different deep learning algorithms from image sequence
CN110796048A (en) Ship target real-time detection method based on deep neural network
CN113420682B (en) Target detection method and device in vehicle-road cooperation and road side equipment
CN112926461B (en) Neural network training and driving control method and device
EP3495989A1 (en) Best image crop selection
CN113129335B (en) Visual tracking algorithm and multi-template updating strategy based on twin network
CN115170792B (en) Infrared image processing method, device and equipment and storage medium
CN113627229A (en) Object detection method, system, device and computer storage medium
CN108229473A (en) Vehicle annual inspection label detection method and device
CN111967396A (en) Processing method, device and equipment for obstacle detection and storage medium
CN112101114B (en) Video target detection method, device, equipment and storage medium
CN112699711A (en) Lane line detection method, lane line detection device, storage medium, and electronic apparatus
CN111709377B (en) Feature extraction method, target re-identification method and device and electronic equipment
CN112396594B (en) Method and device for acquiring change detection model, change detection method, computer equipment and readable storage medium
CN111178181B (en) Traffic scene segmentation method and related device
CN116935356A (en) Weak supervision-based automatic driving multi-mode picture and point cloud instance segmentation method
CN116363532A (en) Unmanned aerial vehicle image traffic target detection method based on attention mechanism and re-parameterization
CN111553474A (en) Ship detection model training method and ship tracking method based on unmanned aerial vehicle video
US20240221426A1 (en) Behavior detection method, electronic device, and computer readable storage medium
CN116363628A (en) Mark detection method and device, nonvolatile storage medium and computer equipment
CN115937596A (en) Target detection method, training method and device of model thereof, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant