CN116188598A

CN116188598A - Target detection model training method and target detection method combining camera parameters

Info

Publication number: CN116188598A
Application number: CN202310130057.2A
Authority: CN
Inventors: 刁文辉; 林向阳; 李学学; 曲小飞; 李俊希; 申志平
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-05-30

Abstract

The disclosure provides a target detection model training method and a target detection method combined with camera parameters, which can be applied to the technical field of computer vision. The method comprises the following steps: acquiring a target detection image, wherein the target detection image is an image with a collection inclination angle larger than a preset angle threshold value, and the target detection image is provided with target annotation information; inputting the target detection image into an initial target detection network, and outputting an initial target detection result of the target detection image; determining a pixel depth detection image with pixel depth information according to the target detection image and the camera parameter information; inputting the pixel depth detection image into a pixel depth detection network, and outputting a pixel depth detection result of the pixel depth detection image; and training an initial target detection network and a pixel depth detection network based on the target labeling information, the pixel depth information, the initial target detection result and the pixel depth detection result to obtain a trained target detection model.

Description

Target detection model training method and target detection method combining camera parameters

Technical Field

The present disclosure relates to the field of computer vision, and more particularly, to a target detection model training method and a target detection method that incorporate camera parameters.

Background

In the related art, emerging devices such as an optical satellite, an aerostat, an unmanned aerial vehicle and the like all adopt an optical camera to collect images on the ground, and because a certain inclination angle exists when the images are collected, the difference between a target object in the images and the actual object is large, so that the tasks such as classification, detection and the like of the images are difficult.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: in the related art, information of the target detection image is not fully utilized, resulting in lower target detection accuracy.

Disclosure of Invention

In view of this, the present disclosure provides a method, an apparatus, an electronic device, and a method for training a target detection model in combination with camera parameters.

An aspect of the present disclosure provides a method for training a target detection model in combination with camera parameters, including:

acquiring a target detection image, wherein the target detection image is an image with a collection inclination angle larger than a preset angle threshold value, and the target detection image is provided with target annotation information;

inputting the target detection image into an initial target detection network, and outputting an initial target detection result of the target detection image;

determining a pixel depth detection image with pixel depth information according to the target detection image and the camera parameter information;

Inputting the pixel depth detection image into a pixel depth detection network, and outputting a pixel depth detection result of the pixel depth detection image;

and training an initial target detection network and a pixel depth detection network based on the target labeling information, the pixel depth information, the initial target detection result and the pixel depth detection result to obtain a trained target detection model.

According to an embodiment of the present disclosure, wherein inputting an object detection image to an initial object detection network, outputting an initial object detection result for the object detection image, includes:

inputting the target detection image into a first downsampling layer of an initial target detection network, and outputting a first feature map of a plurality of scales;

inputting the first feature images with multiple scales into a first fusion feature layer of an initial detection network, and outputting a second feature image with multiple scales;

inputting the second feature images with the multiple scales into an interested region extraction layer of an initial detection network, and outputting the interested region of each second feature image in the second feature images with the multiple scales;

and inputting the multiple regions of interest into a prediction layer of an initial detection network, and outputting an initial target detection result of the target detection image.

According to an embodiment of the present disclosure, wherein inputting a pixel depth detection image into a pixel depth detection network, outputting a pixel depth detection result for the pixel depth detection image, includes:

inputting the pixel depth detection image into a second downsampling layer of a pixel depth detection network, and outputting a third feature map with multiple scales;

inputting the third feature images with multiple scales into a second fusion feature layer of the pixel depth detection network, and outputting a fourth feature image with multiple scales;

and respectively outputting the fourth feature images with multiple scales to a three-dimensional depth sensing layer of the pixel depth detection network, and outputting a pixel depth detection result of the pixel depth detection image.

According to an embodiment of the present disclosure, training an initial target detection network and a pixel depth detection network based on target labeling information, pixel depth information, an initial target detection result and a pixel depth detection result, to obtain a trained target detection model includes:

determining a target detection loss value based on the initial target detection result and the target labeling information;

determining a pixel depth detection loss value based on the pixel depth detection result and the pixel depth information;

and training an initial target detection network and a pixel depth detection network based on the target detection loss value and the pixel depth detection loss value to obtain a trained target detection model.

According to an embodiment of the present disclosure, wherein the target annotation information includes target coordinate information and target class information, determining a target detection loss value based on an initial target detection result and the target annotation information includes:

determining a target coordinate detection loss value based on the initial target detection result and the target coordinate information;

determining a target class detection loss value based on the initial target detection result and the target class information;

the target detection loss value is determined based on the target coordinate detection loss value and the target class detection loss value.

According to an embodiment of the present disclosure, wherein determining a pixel depth detection image with pixel depth information from an object detection image and camera parameter information comprises:

determining pixel depth information of a target detection image according to the target annotation information and the camera parameter information to obtain a first sub-pixel depth detection image;

determining a plurality of clustering areas of the first sub-pixel depth detection image by using a K-means clustering algorithm according to a preset pixel threshold value;

determining a second sub-pixel depth detection image from the plurality of cluster areas according to the pixel depth values in the plurality of cluster areas;

a pixel depth detection image is determined based on the first sub-pixel depth detection image and the second sub-pixel depth detection image.

According to an embodiment of the present disclosure, the target detection loss value, the target coordinate detection loss value, and the target class detection loss value are respectively expressed by the following formulas:

L _detection ＝L _loc +L _cls (1)

wherein L is _detection Represents the target detection loss value, L _loc Representing the target coordinate detection loss value, L _cls The detection loss value of the target category is represented by x, x represents the left upper-corner x-axis coordinate value of the target position frame in the target coordinate information, y represents the left upper-corner y-axis coordinate value of the target position frame in the target coordinate information, w represents the width value of the target position frame in the target coordinate information, h represents the height value of the target position frame in the target coordinate information, and t _i Representing the coordinate information of the object, mu _i Representing the target coordinate detection result in the initial target detection result, T represents the number of target categories in the target detection image, y _i Representing object class information S _j And representing the target class detection result in the initial target detection results.

Another aspect of the present disclosure provides a target detection method in combination with camera parameter information, including:

inputting the target image to be detected into the target detection model trained by the target detection model training method combined with the camera parameter information, and outputting an initial target detection result and a pixel depth detection result;

And determining a target detection result by using a non-maximum suppression method based on the initial target detection result and the pixel depth detection result.

Another aspect of the present disclosure provides an object detection model training apparatus in combination with camera parameter information, including:

the first acquisition module is used for acquiring a target detection image, wherein the target detection image is provided with target annotation information;

the first output module is used for inputting the target detection image into an initial target detection network and outputting an initial target detection result of the target detection image;

a first determining module, configured to determine a pixel depth detection image with pixel depth information according to the target detection image and the camera parameter information;

the second output module is used for inputting the pixel depth detection image into the pixel depth detection network and outputting a pixel depth detection result of the pixel depth detection image;

the obtaining module is used for training the initial target detection network and the pixel depth detection network based on the initial target detection result and the pixel depth detection result to obtain a trained target detection model.

Another aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the object detection model training method in combination with camera parameter information.

According to the embodiment of the disclosure, the target detection image with the target labeling information is input into the target detection network, the initial target detection result is output, the pixel depth detection image with the pixel depth information determined according to the target detection image and the camera parameters is input into the pixel depth detection network, the pixel depth detection result is output, the initial target detection network and the pixel depth detection network are trained based on the target labeling information, the pixel depth information, the initial target detection result and the pixel depth detection result, the target detection model technical means is obtained, the target information in the target detection image is combined with the pixel depth, and the target in the target detection image can be determined more accurately, so that the technical problem that the information of the target detection image is not fully utilized in the related technology, and the target detection precision is low is at least partially solved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates an exemplary system architecture to which an object detection model training method incorporating camera parameters may be applied, in accordance with an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of training a target detection model in combination with camera parameters, in accordance with an embodiment of the disclosure;

FIG. 3 schematically illustrates a block diagram of an object detection model according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a block diagram of a three-dimensional depth perception layer according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a method of object detection in combination with camera parameter information, in accordance with an embodiment of the present disclosure;

FIG. 6 schematically illustrates a block diagram of an object detection model training apparatus incorporating camera parameter information, according to an embodiment of the present disclosure;

FIG. 7 schematically illustrates a block diagram of an object detection apparatus incorporating camera parameter information, according to an embodiment of the present disclosure; and

fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement the above-described object detection model training method in combination with camera parameters, according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a formulation similar to at least one of "A, B or C, etc." is used, in general such a formulation should be interpreted in accordance with the ordinary understanding of one skilled in the art (e.g. "a system with at least one of A, B or C" would include but not be limited to systems with a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are taken, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

In the related art, due to diversity and flexibility of devices such as an optical satellite, an aerostat, an unmanned aerial vehicle and the like, the camera acquires images from the ground in a complex and changeable manner. In the ground shooting scene with multiple carrier heights and camera angles, the problem of image domain differences such as large difference of the obtained large-inclination-angle image target characteristics can occur. This can present significant challenges to downstream problems of computer vision, such as image classification, segmentation, detection, etc.

Along with the development of deep learning technology, a target detection technology based on a convolutional neural network achieves a good detection effect. The existing target detection methods can be divided into two main categories according to whether candidate region extraction is adopted or not: one-stage series methods such as Yolo (You Only Look Once, a target detector) and RetinaNet (a target detection network) do not need candidate region extraction, and the detection speed is high by directly regressing the classification prediction of the target and the position information of the target; the method is based on candidate region extraction, and performs classification prediction and position coordinate fine adjustment, so that the detection precision is higher.

While the object detection method in the related art is mostly based on an extraction and post-processing method for two-dimensional features of an image. The performance of the object detector is significantly degraded for images with large range of object depth variations due to lack of perceived extraction of three-dimensional depth information contained in the two-dimensional image.

In view of the above, embodiments of the present disclosure provide a method for training an object detection model in combination with camera parameters. The method comprises the steps of obtaining a target detection image, wherein the target detection image is an image with a collection inclination angle larger than a preset angle threshold value, and the target detection image is provided with target annotation information; inputting the target detection image into an initial target detection network, and outputting an initial target detection result of the target detection image; determining a pixel depth detection image with pixel depth information according to the target detection image and the camera parameter information; inputting the pixel depth detection image into a pixel depth detection network, and outputting a pixel depth detection result of the pixel depth detection image; and training an initial target detection network and a pixel depth detection network based on the target labeling information, the pixel depth information, the initial target detection result and the pixel depth detection result to obtain a trained target detection model.

Fig. 1 schematically illustrates an exemplary system architecture 100 to which an object detection model training method incorporating camera parameters may be applied, according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the

terminal devices

101, 102, 103 to receive or send messages or the like. Various communication client applications may be installed on the

terminal devices

101, 102, 103, such as shopping class applications, web browser applications, search class applications, instant messaging tools, mailbox clients and/or social platform software, to name a few.

The

terminal devices

101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the method for training the object detection model in combination with the camera parameters provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the object detection model training method apparatus combined with camera parameters provided in the embodiments of the present disclosure may be generally provided in the server 105. The object detection model training method in combination with camera parameters provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the object detection model training apparatus incorporating camera parameters provided by the embodiments of the present disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Alternatively, the method for training the object detection model in combination with the camera parameters provided by the embodiments of the present disclosure may be performed by the

terminal device

101, 102, or 103, or may be performed by another terminal device different from the

terminal device

101, 102, or 103. Accordingly, the apparatus for training the object detection model in combination with the camera parameters provided in the embodiments of the present disclosure may also be provided in the

terminal device

101, 102, or 103, or in another terminal device different from the

terminal device

101, 102, or 103.

For example, the target detection image may be originally stored in any one of the

terminal devices

101, 102, or 103 (for example, but not limited to, the terminal device 101), or stored on an external storage device and may be imported into the terminal device 101. Then, the terminal device 101 may locally perform the camera parameter-combined object detection model training method provided by the embodiment of the present disclosure, or transmit the object detection image to other terminal devices, servers, or server clusters, and perform the camera parameter-combined object detection model training method provided by the embodiment of the present disclosure by other terminal devices, servers, or server clusters that receive the object detection image.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flow chart of a method of training an object detection model in combination with camera parameters according to an embodiment of the disclosure.

As shown in fig. 2, the method includes operations S201 to S205.

In operation S201, a target detection image is acquired, where the target detection image is an image with a collection inclination angle greater than a preset angle threshold, and the target detection image has target annotation information.

According to an embodiment of the present disclosure, the object detection image may be a ground image of a scene photographed obliquely to the ground according to the presence of a satellite, an aerostat, an unmanned aerial vehicle, or the like, and the acquisition inclination angle of the object detection image is greater than a preset angle threshold.

According to an embodiment of the present disclosure, the target labeling information of the target detection image may be information that labels the target in the target detection image.

According to embodiments of the present disclosure, there may be multiple target detection images so that the target detection model has sufficient training data.

In operation S202, an object detection image is input to an initial object detection network, and an initial object detection result for the object detection image is output.

According to an embodiment of the present disclosure, the initial target detection network may be a network that performs rough detection on the target detection image, thereby obtaining an initial target detection result.

In operation S203, a pixel depth detection image with pixel depth information is determined according to the target detection image and the camera parameter information.

According to an embodiment of the present disclosure, the camera parameter information may be determined according to a photographing apparatus of the object detection image, and may include a camera height, a diaphragm size, an exposure degree, and the like.

According to the embodiment of the disclosure, the pixel point of the target detection image and the camera parameter information may be combined to determine pixel depth information, so as to determine the pixel depth detection image, where the pixel depth information may be an actual distance between the image capturing image point and the object.

In operation S204, a pixel depth detection image is input into a pixel depth detection network, and a pixel depth detection result for the pixel depth detection image is output.

According to an embodiment of the present disclosure, the pixel depth detection network may be a network that detects a pixel depth in an image, and the pixel depth detection result for the pixel depth detection image may be output by inputting the pixel depth detection image into the pixel depth detection network.

In operation S205, the initial target detection network and the pixel depth detection network are trained based on the target labeling information, the pixel depth information, the initial target detection result, and the pixel depth detection result, and a trained target detection model is obtained.

According to the embodiment of the disclosure, the deviation value of the initial target detection result and the target annotation information can be determined based on the target annotation information and the initial target detection result, the deviation value of the pixel depth detection result and the pixel depth information is determined based on the pixel depth information and the pixel depth detection result, and the initial target detection network and the pixel depth detection network are trained according to the two deviation values, so that the target detection model is obtained.

According to an embodiment of the present disclosure, the target detection image may be used as a tensor of bs×3×512×512 size, where bs is the number of target detection images in the initial target detection network for each batch input, for example, bs=8, '3' may be used to represent RGB three channels of the target detection image, and '512×512' may be used to represent the resolution size of the target detection image.

According to embodiments of the present disclosure, an initial target detection network may be built using a awareness (MMdetection) platform.

According to an embodiment of the present disclosure, the first downsampling layer may be constructed based on a Backbone extraction network backhaul, specifically, a res net50 network may be used, a target detection image may be input into the first downsampling layer, a plurality of scale first feature maps may be output, and each scale may correspond to one first downsampling layer, so that the first downsampling layer may have a plurality of, for example, four first downsampling layers, may be a 4-times downsampling layer, an 8-times downsampling layer, a 16-times downsampling layer, and a 32-times downsampling layer, and a first feature map of four scales may be output, or may be other times downsampled.

According to the embodiment of the disclosure, the first fusion feature layer can fuse information in the first feature map of multiple scales, and optimize the first feature map to obtain the second feature map of multiple scales. The first fused feature layer may be constructed based on a multi-level feature fusion network (Feature Parymid Network, FPN).

According to an embodiment of the present disclosure, the region of interest extraction layer may extract a region where a target may exist in each scale of the second feature map, and output the region of interest in each second feature map. The region of interest extraction layer may be constructed based on a region candidate network (Region Proposal Network, RPN), the number of regions of interest may be plural, the number of regions of interest may be specified, for example, the number of regions of interest may be 1000.

According to the embodiment of the disclosure, the prediction layer may perform multiple series-connected classification of categories and fine adjustment of position coordinates on the basis of the region of interest output by the region of interest extraction layer by using cascaded Head, so as to output an initial target detection result for the target detection image, specifically, three Head may be cascaded, and each cascaded io u threshold is 0.5, 0.6 and 0.7 in sequence.

According to the embodiment of the disclosure, the initial target detection network is utilized to detect the target in the target detection image, so that the position of the target can be initially determined.

According to an embodiment of the present disclosure, the second downsampling layer may be the same network layer as the first downsampling layer, and the second fusion feature layer may be the same network layer as the first fusion feature layer, which is not described herein.

According to the embodiment of the disclosure, the fourth feature map of each scale may correspond to one three-dimensional depth perception layer, and feature information extracted from each fourth feature map is fused, so as to output a pixel depth detection result of the pixel depth detection image.

According to the embodiment of the disclosure, a depth residual error convolution layer is added in series behind a three-dimensional depth perception layer, the depth residual error convolution layer comprises a convolution layer, a BN layer, a ReLU layer and a residual error connection layer, taking 4 times of downsampling feature images, 8 times of downsampling feature images and 16 times of downsampling feature images as examples, the 16 times of downsampling feature images are spliced with the 8 times of downsampling feature images after upsampling, and the depth residual error convolution layer is used as input of the 8 times of downsampling feature images in series after convolution dimension reduction; similarly, the 8-time downsampling characteristic diagram is spliced by upsampling and 4-time downsampling characteristic diagram to be used as the input of a depth residual convolution layer; and 4 times of downsampling feature images are subjected to 2 times of upsampling through the output of the depth residual convolution layer to output a pixel depth detection result.

Fig. 3 schematically illustrates a block diagram of an object detection model according to an embodiment of the present disclosure.

As shown in fig. 3, the object detection model 300 includes an initial object detection network 310 and a pixel depth detection network 320, the initial object detection network 310 includes a first downsampling layer, a first fused feature layer, a region of interest extraction layer, and a prediction layer, and the pixel depth detection network 320 includes a second downsampling layer, a second fused feature layer, and a three-dimensional depth perception layer.

Fig. 4 schematically illustrates a block diagram of a three-dimensional depth perception layer according to an embodiment of the present disclosure.

As shown in fig. 4, taking a 4-time downsampling feature map, an 8-time downsampling feature map and a 16-time downsampling feature map as examples of the third feature map with multiple scales, a three-dimensional depth perception layer is connected after each third feature map with multiple scales, wherein a 3*3 convolution layer, a normalization layer, a ReLU activation layer, a 3*3 convolution layer, a normalization layer, a ReLU activation layer and a depth residual convolution layer are connected after each third feature map with multiple scales, and the three-dimensional depth perception layer connected by the 4-time downsampling feature map and the 8-time downsampling feature map is similar to the three-dimensional depth perception layer connected by the 16-time downsampling feature map, which is not repeated here.

According to an embodiment of the present disclosure, a target detection loss value in an initial target detection network may be determined based on an initial target detection result and target annotation information, and a pixel depth detection loss value in a pixel depth detection network may be determined based on a pixel depth detection result and pixel depth information.

According to the embodiment of the disclosure, all parameters in the initial target detection network and the pixel depth detection network can be updated by using a gradient descent method until the network converges, so that a trained target detection model is obtained.

According to an embodiment of the present disclosure, the target coordinate information may be coordinates of an upper left corner, a lower left corner, an upper right corner, and a lower right corner of the target position frame, and a width and a height of the target position frame, and the target category information may be category information of the target, which may be, for example, a person, a vehicle, an animal, or the like.

According to an embodiment of the present disclosure, the initial target detection result may include a target coordinate detection result and a target class detection result, the target coordinate detection loss value may be determined based on the target coordinate detection result and the target coordinate information, and the target class detection loss value may be determined based on the target class detection result and the target class information.

According to an embodiment of the present disclosure, the target coordinate detection loss value and the target class detection loss value may be added to determine a target detection loss value.

According to the embodiment of the disclosure, the rough depth of each pixel can be calculated on the premise that all pixels in the target detection image are on the same plane by utilizing the target labeling information and the camera parameter information, the relative depth information in the same target detection image is generated according to the actual size of each target and the longest side and the shortest side of the minimum circumscribed rectangle of the target coordinate information in the target labeling information, and the relative accurate primary pixel depth map is calculated by combining the rough depth and the relative depth information; and optimizing the primary pixel depth map into a first sub-pixel depth detection image by using the conditional radiation field.

According to the embodiment of the disclosure, the first sub-pixel depth detection image can be converted into the mask image in the 0-1 format by using the preset pixel threshold, specifically, the preset pixel threshold can be set to be 0.65, and the mask image is clustered by using a K-means clustering algorithm to obtain a plurality of clustering areas, wherein each clustering area can be an image.

According to the embodiment of the disclosure, the pixel depth value in each cluster region is different, and according to the difference of the pixel depth values in each cluster region, the images corresponding to the two cluster regions with larger depth values are reserved as the second sub-pixel depth detection images.

According to an embodiment of the present disclosure, the first sub-pixel depth detection image and the second sub-pixel depth detection image are taken as pixel depth detection images of a training pixel depth detection network.

According to the embodiment of the disclosure, since the target in the second sub-pixel depth detection image is more accurate, the second sub-pixel depth detection image is used as a training sample of the pixel depth detection network, so that the pixel depth detection network can more quickly determine the pixel depth of the image.

L _detection ＝L _loc +L _cls (1)

wherein L is _detection Represents the target detection loss value, L _loc Representing the target coordinate detection loss value, L _cls The detection loss value of the target category is represented by x, x represents the left upper-corner x-axis coordinate value of the target position frame in the target coordinate information, y represents the left upper-corner y-axis coordinate value of the target position frame in the target coordinate information, w represents the width value of the target position frame in the target coordinate information, and h represents the target in the target coordinate information Height value of position frame, t _i Representing the coordinate information of the object, mu _i Representing the target coordinate detection result in the initial target detection result, T represents the number of target categories in the target detection image, y _i Representing object class information S _j And representing the target class detection result in the initial target detection results.

According to an embodiment of the present disclosure, the pixel depth detection loss value may be expressed by the following formula (4):

wherein L is _depth Representing pixel depth detection loss value, y _p Representing the result of the pixel depth detection,

the pixel depth information is represented, p represents the p-th pixel depth detection image, and n represents the n-th pixel depth detection image.

Fig. 5 schematically illustrates a flowchart of a method of object detection in combination with camera parameter information according to an embodiment of the present disclosure.

As shown in fig. 5, the method includes operations S501 to S502.

In operation S501, inputting the target image to be detected into the target detection model trained by the target detection model training method combined with the camera parameter information, and outputting an initial target detection result and a pixel depth detection result;

in operation S502, a target detection result is determined using a non-maximum suppression method based on the initial target detection result and the pixel depth detection result.

According to the embodiment of the disclosure, an initial target detection result and a first sub-pixel depth detection image of a target image to be detected are calculated through the trained target detection model in the embodiment of the disclosure, the first sub-pixel depth detection images are clustered to obtain two second sub-pixel depth detection images, the second sub-pixel depth detection images and the pixel depth detection result are fused by a non-maximum suppression method, and the target detection result is determined.

Fig. 6 schematically illustrates a block diagram of an object detection model training apparatus incorporating camera parameter information according to an embodiment of the present disclosure.

As shown in fig. 6, the object detection model training apparatus 600 in combination with camera parameter information includes a first acquisition module 610, a first output module 620, a first determination module 630, a second output module 640, and an obtaining module 650.

A first obtaining module 610, configured to obtain a target detection image, where the target detection image has target labeling information;

a first output module 620, configured to input the target detection image to an initial target detection network, and output an initial target detection result for the target detection image;

a first determining module 630, configured to determine a pixel depth detection image with pixel depth information according to the target detection image and the camera parameter information;

A second output module 640, configured to input the pixel depth detection image into a pixel depth detection network, and output a pixel depth detection result of the pixel depth detection image;

the obtaining module 650 is configured to train the initial target detection network and the pixel depth detection network based on the initial target detection result and the pixel depth detection result, and obtain a trained target detection model.

According to an embodiment of the present disclosure, wherein the first output module 620 for inputting the object detection image to the initial object detection network and outputting the initial object detection result of the object detection image includes:

the first output unit is used for inputting the target detection image into a first downsampling layer of an initial target detection network and outputting a first characteristic map with multiple scales;

the second output unit is used for inputting the first feature images with a plurality of scales to a first fusion feature layer of the initial detection network and outputting a second feature image with a plurality of scales;

the third output unit is used for inputting the second feature images with a plurality of scales to the region-of-interest extraction layer of the initial detection network and outputting the region-of-interest of each second feature image in the second feature images with a plurality of scales;

And the fourth output unit is used for inputting the multiple regions of interest into a prediction layer of the initial detection network and outputting an initial target detection result of the target detection image.

According to an embodiment of the present disclosure, wherein the second output module 640 for inputting the pixel depth detection image into the pixel depth detection network and outputting the pixel depth detection result of the pixel depth detection image includes:

a fifth output unit, configured to input the pixel depth detection image to a second downsampling layer of the pixel depth detection network, and output a third feature map of multiple scales;

a sixth output unit, configured to input a third feature map of multiple scales to a second fused feature layer of the pixel depth detection network, and output a fourth feature map of multiple scales;

and the seventh output unit is used for respectively outputting the fourth feature images with multiple scales to the three-dimensional depth perception layer of the pixel depth detection network and outputting a pixel depth detection result of the pixel depth detection image.

According to an embodiment of the present disclosure, the obtaining module 650 for training the initial target detection network and the pixel depth detection network based on the target labeling information, the pixel depth information, the initial target detection result and the pixel depth detection result, to obtain a trained target detection model includes:

The first obtaining unit comprises a target detection loss value determining unit for determining a target detection loss value based on an initial target detection result and target labeling information;

a second obtaining unit that includes determining a pixel depth detection loss value based on a pixel depth detection result and pixel depth information;

and the third obtaining unit is used for training the initial target detection network and the pixel depth detection network based on the target detection loss value and the pixel depth detection loss value to obtain a trained target detection model.

According to an embodiment of the present disclosure, wherein the target annotation information includes target coordinate information and target class information, and the first obtaining unit for determining the target detection loss value based on the initial target detection result and the target annotation information includes:

the first obtaining subunit is used for determining a target coordinate detection loss value based on the initial target detection result and the target coordinate information;

the second obtaining subunit is used for determining a target class detection loss value based on the initial target detection result and the target class information;

and a third obtaining subunit, configured to determine a target detection loss value based on the target coordinate detection loss value and the target class detection loss value.

According to an embodiment of the present disclosure, the first determining module 630 for determining a pixel depth detection image with pixel depth information according to the target detection image and the camera parameter information includes:

The first determining unit is used for determining the pixel depth information of the target detection image according to the target marking information and the camera parameter information to obtain a first sub-pixel depth detection image;

the second determining unit is used for determining a plurality of clustering areas of the first sub-pixel depth detection image by using a K-means clustering algorithm according to a preset pixel threshold value;

a third determining unit configured to determine a second sub-pixel depth detection image from the plurality of cluster areas according to the pixel depth values in the plurality of cluster areas;

and a fourth determination unit configured to determine a pixel depth detection image based on the first sub-pixel depth detection image and the second sub-pixel depth detection image.

L _detection ＝L _loc +L _cls (1)

Fig. 7 schematically illustrates a block diagram of an object detection apparatus incorporating camera parameter information according to an embodiment of the present disclosure.

As shown in fig. 7, the object detection device 700 in combination with camera parameter information includes a third output module 710 and a second determination module 720.

The third output module 710 is configured to input the target image to be detected into the target detection model trained by the target detection model training method combined with the camera parameter information, and output an initial target detection result and a pixel depth detection result;

the second determining module 720 is configured to determine the target detection result by using a non-maximum suppression method based on the initial target detection result and the pixel depth detection result.

Any number of modules, sub-modules, units, sub-units, or at least some of the functionality of any number of the sub-units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the circuit, or in any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be at least partially implemented as computer program modules, which when executed, may perform the corresponding functions.

For example, any of the first acquisition module 610, the first output module 620, the first determination module 630, the second output module 640, and the obtaining module 650 may be combined in one module/unit/sub-unit, or any of the modules/units/sub-units may be split into a plurality of modules/units/sub-units. Alternatively, at least some of the functionality of one or more of these modules/units/sub-units may be combined with at least some of the functionality of other modules/units/sub-units and implemented in one module/unit/sub-unit. According to embodiments of the present disclosure, at least one of the first acquisition module 610, the first output module 620, the first determination module 630, the second output module 640, and the obtaining module 650 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or as hardware or firmware in any other reasonable manner of integrating or packaging the circuitry, or as any one of or a suitable combination of three of software, hardware, and firmware. Alternatively, at least one of the first acquisition module 610, the first output module 620, the first determination module 630, the second output module 640, and the obtaining module 650 may be at least partially implemented as a computer program module, which when executed, may perform the corresponding functions.

It should be noted that, in the embodiment of the present disclosure, the portion of the object detection model training device that combines the camera parameters corresponds to the portion of the object detection model training method that combines the camera parameters, and the description of the portion of the object detection model training device that combines the camera parameters specifically refers to the portion of the object detection model training method that combines the camera parameters, which is not described herein.

Fig. 8 schematically illustrates a block diagram of an electronic device adapted to implement the above-described object detection model training method in combination with camera parameters, according to an embodiment of the disclosure. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 8, an electronic device 800 according to an embodiment of the present disclosure includes a processor 801 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. The processor 801 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 801 may also include on-board memory for caching purposes. The processor 801 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 803, various programs and data required for the operation of the electronic device 800 are stored. The processor 801, the ROM802, and the RAM 803 are connected to each other by a bus 804. The processor 801 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM802 and/or the RAM 803. Note that the program may be stored in one or more memories other than the ROM802 and the RAM 803. The processor 801 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 800 may also include an input/output (I/O) interface 805, the input/output (I/O) interface 805 also being connected to the bus 804. The system 800 may also include one or more of the following components connected to the I/O interface 805: an input portion 806 including a keyboard, mouse, etc.; an output portion 807 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. The drive 810 is also connected to the I/O interface 805 as needed. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as needed so that a computer program read out therefrom is mounted into the storage section 808 as needed.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of training a target detection model in combination with camera parameters, comprising:

Training the initial target detection network and the pixel depth detection network based on the target labeling information, the pixel depth information, the initial target detection result and the pixel depth detection result to obtain a trained target detection model.

2. The method of claim 1, wherein the inputting the object detection image into an initial object detection network, outputting an initial object detection result for the object detection image, comprises:

inputting the target detection image into a first downsampling layer of the initial target detection network, and outputting a first feature map with multiple scales;

inputting the first feature images with multiple scales into a first fusion feature layer of the initial detection network, and outputting second feature images with multiple scales;

inputting the second feature images with multiple scales into a region-of-interest extraction layer of the initial detection network, and outputting the region-of-interest of each of the second feature images with multiple scales;

and inputting a plurality of regions of interest into a prediction layer of the initial detection network, and outputting the initial target detection result of the target detection image.

3. The method of claim 1, wherein the inputting the pixel depth detection image into a pixel depth detection network, outputting a pixel depth detection result for the pixel depth detection image, comprises:

inputting the pixel depth detection image to a second downsampling layer of the pixel depth detection network, and outputting a third feature map with multiple scales;

and respectively outputting the fourth feature images with multiple scales to a three-dimensional depth perception layer of the pixel depth detection network, and outputting the pixel depth detection result of the pixel depth detection image.

4. The method of claim 1, wherein the training the initial target detection network and the pixel depth detection network based on the target labeling information, the pixel depth information, the initial target detection result, and the pixel depth detection result, results in a trained target detection model, comprising:

and training the initial target detection network and the pixel depth detection network based on the target detection loss value and the pixel depth detection loss value to obtain a trained target detection model.

5. The method of claim 4, wherein the target annotation information comprises target coordinate information and target category information, the determining a target detection loss value based on the initial target detection result and the target annotation information comprising:

and determining the target detection loss value based on the target coordinate detection loss value and the target class detection loss value.

6. The method of claim 1, wherein the determining a pixel depth detection image with pixel depth information from the target detection image and camera parameter information comprises:

determining the pixel depth information of the target detection image according to the target annotation information and the camera parameter information to obtain a first sub-pixel depth detection image;

Determining a plurality of clustering areas of the first sub-pixel depth detection image by using a K-means clustering algorithm according to a preset pixel threshold;

determining a second sub-pixel depth detection image from the plurality of cluster regions according to pixel depth values in the plurality of cluster regions;

the pixel depth detection image is determined based on the first sub-pixel depth detection image and the second sub-pixel depth detection image.

7. The method of claim 5, the target detected loss value, the target coordinate detected loss value, and the target class detected loss value are each represented by the following formulas:

L _detection ＝L _loc +L _cls (1)

wherein L is _detection Representing the target detection loss value, L _loc Representing the target coordinate detection loss value, L _cls Representing the target class detection loss value, x representing the upper left-hand x-axis coordinate value of the target position frame in the target coordinate information, y representing the upper left-hand y-axis coordinate value of the target position frame in the target coordinate information, w representing the width value of the target position frame in the target coordinate information, h representing the height value of the target position frame in the target coordinate information, t _i Representing the coordinate information of the object mu _i Representing the target coordinate detection result in the initial target detection result, T represents the number of target categories in the target detection image, and y _i Representing the target class information, S _j And representing the target class detection result in the initial target detection results.

8. A target detection method in combination with camera parameter information, comprising:

inputting an image of a target to be detected into a target detection model trained by the method according to any one of claims 1 to 7, and outputting an initial target detection result and a pixel depth detection result;

and determining a target detection result by using a non-maximum value inhibition method based on the initial target detection result and the pixel depth detection result.

9. An object detection model training device combined with camera parameter information, comprising:

a first determining module, configured to determine a pixel depth detection image with pixel depth information according to the target detection image and camera parameter information;

the second output module is used for inputting the pixel depth detection image into a pixel depth detection network and outputting a pixel depth detection result of the pixel depth detection image;

10. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 8.