CN114612872A

CN114612872A - Target detection method, target detection device, electronic equipment and computer-readable storage medium

Info

Publication number: CN114612872A
Application number: CN202111562162.0A
Authority: CN
Inventors: 蒋乐; 陈健; 黄雨安; 唐迪锋; 宋勇; 欧阳晔
Original assignee: Guangzhou Yaxin Technology Co ltd
Current assignee: Guangzhou Yaxin Technology Co ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-06-10

Abstract

The embodiment of the application provides a target detection method, a target detection device, electronic equipment and a computer-readable storage medium, and the method comprises the following steps: obtaining an oversized pixel image, and reducing the oversized pixel image by at least one preset multiple to obtain a corresponding scale image; performing sliding segmentation on the scale image by using a sliding window with a first preset size to obtain a plurality of corresponding sub-images, and inputting each sub-image into a trained target detection model to obtain a detection result of the sub-image; fusing the detection results of the sub-images to obtain a detection result corresponding to the scale image; and fusing the detection results of the images of all scales to obtain the detection result of the image with the oversized pixel. According to the scheme, a target detection model is adopted to detect the reduced and segmented subgraphs of the super-large pixel image to obtain the detection result of each subgraph, and then the detection results of the super-large pixel image are fused for multiple times based on the detection results of the subgraphs, so that the target in the super-large pixel image is detected.

Description

Target detection method, target detection device, electronic equipment and computer-readable storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a target detection method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

The technology is widely applied to the fields of intelligent monitoring, automatic driving, smart cities and the like, and the detection process is shown in figure 1 (in the figure, "Person" indicates that the object type in the corresponding object frame is a "pedestrian", and "Vehicle" indicates that the object type in the corresponding object frame is a "Vehicle"). Currently, the mainstream target detection algorithms are mainly divided into two major classes, one class is a two-stage detection algorithm represented by fast RCNN, and the other class is a one-stage detection algorithm represented by ssd (single Shot multi box detector), yolo (young Only Look once) series.

The super-large pixel scene generally refers to scenes with wide coverage such as crossroads, stations, large-scale event sites, commercial squares and peripheral areas thereof, the image pixels of the scenes are hundreds of millions, the global view covers natural scenes of square kilometers, thousands of people are accommodated in the scenes, the scale of a single target can be changed by hundreds of times, the appearance characteristics of the local target in the maximum magnification view are still clearly distinguishable, and the image in the super-large pixel scene is shown in fig. 2.

At present, the technology for detecting pedestrians and vehicle targets in conventional images with small-range scenes and low video image resolution is relatively mature, in order to better guarantee safe traveling and intelligent mutual entertainment of people, the requirements for monitoring and analyzing the pedestrians and vehicle targets in large scenes are increased sharply, but the characteristics and difficulties of the images in the scenes different from those of the conventional images make the traditional target detection algorithm incapable of directly carrying out effective image analysis on the scenes.

Disclosure of Invention

The purpose of this application is to solve at least one of the above technical defects, and the technical solution provided by this application embodiment is as follows:

in a first aspect, an embodiment of the present application provides a target detection method, including:

obtaining an oversized pixel image, and reducing the oversized pixel image by at least one preset multiple to obtain a corresponding scale image;

for each scale image, performing sliding segmentation on the scale image by using a sliding window with a first preset size to obtain a plurality of corresponding sub-images, and inputting each sub-image into a trained target detection model to obtain a detection result of the sub-image, wherein the trained target detection model is obtained by training a sub-image sample marked with the detection result, and the detection result comprises a target frame coordinate, a target category and a target frame confidence coefficient;

fusing the detection results of the sub-images corresponding to each scale image to obtain the detection result corresponding to the scale image;

and fusing the detection results of the images of all scales corresponding to the oversized pixel image to obtain the detection result of the oversized pixel image.

In an optional embodiment of the present application, for each scale image, performing sliding segmentation on the scale image by using a sliding window of a first preset size to obtain a plurality of corresponding sub-images, including:

sliding the sliding window on the scale image according to a preset step length, wherein the corresponding area of the sliding window on the scale image is a sub-image after each sliding, and the ratio of the preset step length to the width of the sliding window is in a preset proportional range;

and if the sliding window exceeds the boundary of the scale image after the sliding once, translating the sliding window into the scale image to obtain a corresponding sub-image.

In an optional embodiment of the present application, the target detection model includes a Backbone module, an intermediate Neck module, and an output module;

inputting each sub-graph into a trained target detection model to obtain a detection result of the sub-graph, wherein the detection result comprises the following steps:

respectively extracting target characteristics of subgraphs through a Transformer layer and a deformable convolution DCN layer in a Backbone layer; fusing the target characteristic information through a Neck module to obtain corresponding fusion characteristics; and finally, respectively outputting corresponding initial detection results through an output module based on fusion characteristics corresponding to at least two network layers in the Neck module, and acquiring the detection results corresponding to the subgraphs based on a plurality of initial detection results.

In an optional embodiment of the present application, obtaining a detection result corresponding to the sub-graph based on a plurality of initial detection results includes:

projecting all target frames corresponding to a plurality of initial detection results into the subgraph, and acquiring at least one group of target frames with first intersection ratio IOU not less than a first preset threshold value;

for each group of target frames with the first IOU not smaller than the first preset threshold, acquiring corresponding fusion target frames by utilizing a weighted frame fusion WBF algorithm based on the group of target frames with the first IOU not smaller than the first preset threshold;

and obtaining a detection result of the subgraph based on the fusion target frame corresponding to each group of target frames with the first IOU not less than the first preset threshold value and other target frames except the target frames with the first IOU not less than the first preset threshold value.

In an alternative embodiment of the present application, the trained target detection model is obtained by training as follows:

reducing at least one oversized pixel image sample marked with a detection result according to at least one preset multiple to obtain a corresponding scale image sample, and performing sliding segmentation on each scale image sample by using a sliding window with a first preset size to obtain a preset number of sub-image samples;

performing joint data enhancement on each sub-image sample to obtain a corresponding sub-image sample after data enhancement, and training an initial target detection model by using a preset amount of sub-image samples after data enhancement until a loss function meets a preset condition to obtain a trained target detection model;

wherein the loss function comprises a target frame coordinate loss sub-function and a quality focus loss QFL sub-function derived from the target classification loss sub-function and the target frame confidence loss sub-function.

In an optional embodiment of the present application, performing joint data enhancement on each sub-graph sample to obtain a corresponding data-enhanced sub-graph sample includes:

obtaining a detection result that the target frame in each sub-image sample is not larger than a second preset size;

and copying the target frame corresponding to each detection result, performing translation and rotation on the target frame by a preset angle, and pasting the target frame to a non-target area of the sub-image to obtain a corresponding sub-image sample with enhanced data.

In an optional embodiment of the present application, the fusing the detection results of the sub-images corresponding to each scale image to obtain the detection result corresponding to the scale image includes:

splicing sub-images with detection results based on the segmentation mode of each scale image;

for a front sub-image in two adjacent front and rear sub-images, if a target frame corresponding to the detection result of the front sub-image is positioned on the left side of a central line of an overlapping region of the two sub-images or is intersected with the central line, retaining the detection result, and if the target frame corresponding to the detection result of the front sub-image is positioned on the right side of the central line, discarding the detection result; and for the subsequent subgraph in the two adjacent front and rear subgraphs, if the target frame corresponding to the detection result of the subsequent subgraph is positioned on the right side of the central line, retaining the detection result, and if the target frame corresponding to the detection result of the subsequent subgraph is positioned on the left side of the central line or is intersected with the central line, discarding the detection result.

In a second aspect, an embodiment of the present application provides an object detection apparatus, including:

the scale image acquisition module is used for acquiring the super-large pixel image and reducing the super-large pixel image according to at least one preset multiple to obtain a corresponding scale image;

the subgraph obtaining and detecting module is used for carrying out sliding segmentation on each scale image by using a sliding window with a first preset size to obtain a plurality of corresponding subgraphs, inputting each subgraph into a trained target detection model to obtain a detection result of the subgraph, wherein the trained target detection model is obtained by training a subgraph sample marked with the detection result, and the detection result comprises a target frame coordinate, a target category and a target frame confidence coefficient;

the first detection result fusion module is used for fusing the detection results of the sub-images corresponding to each scale image to obtain the detection result corresponding to the scale image;

and the second detection result fusion module is used for fusing the detection results of the images of all scales corresponding to the super-large pixel image to obtain the detection result of the super-large pixel image.

In an optional embodiment of the present application, the sub-graph obtaining and detecting module is specifically configured to:

In an optional embodiment of the present application, the target detection model includes a Backbone module, an intermediate Neck module, and an output module; the subgraph acquisition and detection module is specifically configured to:

In an optional embodiment of the present application, the sub-graph obtaining and detecting module is further configured to:

In an optional embodiment of the present application, the apparatus further comprises a training module for:

In an optional embodiment of the present application, the training module is specifically configured to:

In an optional embodiment of the present application, the first detection result fusion module is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including a memory and a processor;

the memory has a computer program stored therein;

a processor configured to execute a computer program to implement the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

In a fourth aspect, this application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method provided in the embodiments of the first aspect or any optional embodiment of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product or a computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device when executing implements the method provided in the embodiment of the first aspect or any optional embodiment of the first aspect.

The technical scheme provided by the application brings the beneficial effects that:

the method comprises the steps of obtaining a plurality of scale images by reducing an oversized pixel image, obtaining a plurality of sub-images by sliding and segmenting each scale image through a sliding window, inputting each sub-image into a target detection model to obtain a detection result of each sub-image, fusing the detection results of all sub-images of each scale image to obtain a detection result of the scale image, and finally fusing the detection results of all scale images of the oversized pixel image to obtain a detection result of the oversized pixel image. According to the scheme, the target detection model is adopted to detect the subgraphs after the super-large pixel image is reduced and segmented to obtain the detection result of each subgraph, and then the detection results of the super-large pixel image are obtained after multiple times of fusion based on the detection results of the subgraphs, so that the detection of the target in the super-large pixel image is realized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of a prior art target detection process;

FIG. 2 is an exemplary diagram of an image in a very large pixel scene;

fig. 3 is a schematic flowchart of a target detection method according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an example of a Resize operation and a sliding window split operation on a super-large pixel image according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a structure of an object detection model in an example of an embodiment of the present application;

FIG. 6 is a diagram illustrating a structure of a Transformer in a target detection model according to an embodiment of the present application;

FIG. 7 is a diagram illustrating joint data enhancement of sub-pattern copies in an example of an embodiment of the present application;

fig. 8 is a schematic diagram of a detection result fusion process of two adjacent front and back subgraphs in an example of the embodiment of the application;

FIG. 9 is a flowchart illustrating an overall process for implementing object detection in an example of an embodiment of the present application;

fig. 10 is a block diagram illustrating a structure of an object detection apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 3 is a schematic flowchart of a target detection method provided in an embodiment of the present application, and as shown in fig. 3, the method may include:

step S301, obtaining an oversized pixel image, and reducing the oversized pixel image according to at least one preset multiple to obtain a corresponding scale image.

The super-large pixel image, that is, the image corresponding to the super-large pixel scene, needs to detect and acquire the target in the super-large pixel image, such as a pedestrian, a vehicle, and the like.

Specifically, after the super-large pixel image is obtained, the size of the super-large pixel image can be reduced for convenience of subsequent processing, and the super-large pixel image can be reduced according to one or more preset multiples in the embodiment of the application. The reduction process may also be referred to as "Resize" processing on the super-large pixel image, where the preset multiple used in the "Resize" processing may be 0.2, 0.4, or 0.6, in other words, the super-large pixel image may be reduced by 0.2, 0.4, and 0.6 times, respectively, so as to obtain three corresponding scale images.

It can be understood that the preset multiple may be selected according to actual requirements, and the embodiment of the present application is not limited.

Step S302, for each scale image, performing sliding segmentation on the scale image by using a sliding window with a first preset size to obtain a plurality of corresponding sub-images, and inputting each sub-image into a trained target detection model to obtain a detection result of the sub-image, wherein the trained target detection model is obtained by training a sub-image sample labeled with the detection result, and the detection result comprises a target frame coordinate, a target category and a target frame confidence coefficient.

Specifically, each scale image obtained after the super-large pixel image is reduced is segmented, specifically, a sliding window with a first preset size is adopted to segment each scale image in a sliding mode, and the scale image is divided into a plurality of sub-images. Then, each sub-image obtained by segmenting each image is input into the trained target detection model, and the detection result of each sub-image is obtained, namely the position (and the coordinate) of the target frame in each sub-image, the target category of the target in the target frame and the confidence coefficient of the target frame are obtained.

It can be understood that the subgraph obtained by segmenting the images of all scales is equivalent to an image under a small scene, or is equivalent to a conventional image, and the subgraph can be directly input into a target detection model for processing, and then the detection results of all subgraphs are fused to obtain the detection result of the image of the whole scale.

It should be noted that the first preset size may be selected according to actual requirements, or may be determined by referring to the labeled information in the training process, which will be described in detail later.

And step S303, fusing the detection results of the sub-images corresponding to each scale image to obtain the detection result corresponding to the scale image.

Specifically, after the detection results of the sub-images of each scale image have been obtained in the previous step, the detection results of all sub-images corresponding to the size image are fused, so that the detection result of the scale image can be obtained.

And S304, fusing the detection results of the images of all scales corresponding to the oversized pixel image to obtain the detection result of the oversized pixel image.

Specifically, the detection results of each scale image corresponding to the super large pixel image are obtained in the previous step, and then the detection results of all the scale images corresponding to the super large pixel image are fused, so that the detection result corresponding to the super large pixel image can be obtained, and the target detection of the super large pixel image is realized.

According to the scheme provided by the application, a plurality of scale images are obtained by reducing the super-large pixel image, a sliding window is adopted to perform sliding segmentation on each scale image to obtain a plurality of sub-images, each sub-image is input into a target detection model to obtain the detection result of each sub-image, then the detection results of all sub-images of each scale image are fused to obtain the detection result of the scale image, and finally the detection results of all scale images of the super-large pixel image are fused to obtain the detection result of the super-large pixel image. According to the scheme, a target detection model is adopted to detect the reduced and segmented subgraphs of the super-large pixel image to obtain the detection result of each subgraph, and then the detection results of the super-large pixel image are fused for multiple times based on the detection results of the subgraphs, so that the target in the super-large pixel image is detected.

Specifically, when each scale image is subjected to sliding segmentation, the size of a sliding window is determined first, and then each scale image is subjected to sliding segmentation by using the sliding window. Specifically, a sliding window is slid on each scale image according to a preset step length, and an area on the scale image covered by the sliding window is used as a sub-image when the sliding window is slid once. The subgraph generated by two sliding operations before and after the sliding process can have an overlapping area (overlap), and the size of the overlapping area is controlled by a preset step length.

As shown in fig. 4, Resize operation is performed on the super-large pixel image to reduce the size of the super-large pixel image to 0.2, 0.4, and 0.6 times of the original size (i.e. three preset multiples are used), so as to obtain corresponding three-scale images. And then, the adopted preset step length enables the overlapping proportion range of the front subgraph and the rear subgraph to be 0.1-0.5, and finally, subgraphs of a third batch corresponding to the three-scale images are obtained.

Further, the sliding sequence of the sliding window on the scale image may be from left to right and from top to bottom, and if the sliding window exceeds the scale image, generally exceeds to the right or exceeds to the bottom, the sliding window may be moved upwards or leftwards into the scale image for processing, so as to obtain a sub-image with a complete size.

respectively extracting target characteristics of subgraphs through a Transformer layer and a deformable convolution DCN layer in a Backbone layer; fusing the target characteristic information through a Neck module to obtain corresponding fusion characteristics; and finally, respectively outputting corresponding initial detection results through an output module based on the fusion characteristics corresponding to at least two network layers in the Neck module, and acquiring the detection results corresponding to the subgraphs based on a plurality of initial detection results.

The target detection model comprises a Backbone module, an intermediate Neck module and an output module, and after each sub-graph is input into the target detection model, the detection result of each sub-graph is output through the output module after the characteristics of the Backbone module and the Neck module are extracted in sequence.

For example, as shown in fig. 5, the function of the backhaul module (layers 1-12 in the figure) is to extract the features of pedestrian and vehicle targets in the input sub-graph (assuming the sub-graph size is img _ size × img _ size), mainly including Focus, C3 and SPP structures. The Backbone of YOLOv5x was modified in this example: a. optimizing 3C 3 structures in 12 layers into a Transformer structure for extracting global information of the image; b. the traditional Convolution of a fixed shape in the C3 structure of the 3 rd, 5 th, 7 th and 9 th layers is replaced by Deformable Convolution (DCN), and the DCN adds learnable position offset parameters on the Convolution action region of the fixed shape to diffuse the sampling points of a Convolution kernel into a non-grid shape, so as to better extract the target features with large appearance shape difference and serious occlusion.

The function of the Neck module (13-33 layers in the figure) is to fuse target features extracted from each convolution layer in the Backbone module, and in the example, four feature layers (the feature map sizes are img _ size/8, img _ size/16, img _ size/32 and img _ size/64 respectively) output from the 5 th, 7 th, 9 th and 12 th layers in the Backbone module are selected for feature fusion according to a PANET network structure.

The function of the output module is to predict the class and the coordinate of the target in the feature map, in this example, the class and the coordinate of the target in the output feature map of the 24 th, 27 th, 30 th and 33 th layers of the Neck module are predicted to obtain a plurality of target frames (corresponding to a plurality of initial detection results), and then the target frames are screened by adopting a Weighted-Box-Fusion (WBF) method to obtain more accurate target class and target frame coordinates (corresponding to a final detection result of the sub-graph).

Further, as shown in fig. 6, Flatten in the transform structure is a feature map flattening process, linear projection and MLP (Multilayer Perceptron) are implemented by fully connected layers, and MultiHead Attention is a multi-head Attention function implemented in a PyTorch (Python machine learning library based on Torch open source) deep learning framework, and is a feature map adding operation. The input of the transform is a feature map Output by the 11 th layer of the Backbone module, input values q, k and v of a MultiHead Attenttion function are obtained after Flatten and linear project, the Output of the MultiHead Attenttion is a weighted feature map, then the weighted feature map is summed with a vector of Flatten, and finally, a feature map (feature map) with the depth and width consistent with the original input feature map is obtained after a resume operation is carried out on the Output (Output).

In an optional embodiment of the present application, the obtaining, based on the multiple initial detection results, a detection result corresponding to the sub-graph includes:

Specifically, all target frames corresponding to a plurality of initial detection results are projected into the same sub-map region, then the first IOU of each group of overlapped target frames is calculated, and each group of target frames of which the first IOU is not less than a first preset threshold (the first preset threshold may be set to 0.7) is found. And finally, processing each group of first IOU target frames by using the WBF algorithm, wherein the first IOU target frames are not smaller than a first preset threshold value, and respectively obtaining corresponding fusion target frames. Finally, a plurality of fused target frames and target frames which are not processed by the WBF algorithm are the final target frames of the sub-graph, and the target frame coordinates, the target frame confidence and the target frame categories corresponding to the final target frames are the detection result of the sub-graph.

The WBF method performs weighted fusion on each group of target frame confidence coefficients and target frame coordinates of which the first IOU is greater than a first preset threshold (the first preset threshold may be set to 0.7) to obtain final target frame confidence coefficients and target frame coordinates, wherein the fusion weight is determined by the target frame confidence coefficients in the initial detection result.

Specifically, an initial target detection model is trained through a sub-graph sample, and a trained target detection model is obtained. The sub-graph sample acquisition process comprises the following steps: firstly, labeling a super-large image sample, namely labeling a target type, a target frame coordinate and a target frame confidence coefficient. And then, reducing the oversized pixel image samples according to a plurality of preset multiples to obtain a plurality of scale image samples. And finally, performing sliding segmentation on the image samples of all scales by adopting a sliding window to obtain a plurality of batches of sub-image samples.

It can be understood that the reduction or sliding segmentation processing on the oversized pixel image sample in the sub-image sample acquisition process is the same as the reduction or sliding segmentation processing principle on the oversized pixel image to be detected in the model application process, and the adopted preset multiple and the size of the sliding window are also the same.

It should be noted that, if the target frame of the sub-graph sample obtained in the above manner is incomplete, a second IOU of the incomplete target frame on the sub-graph sample and the original target frame in the corresponding scale image sample is calculated, and the target frame whose second IOU is not less than a second preset threshold (the first preset threshold may be set to 0.5) is retained. And performing mask processing on the target frame with the second IOU smaller than the second preset threshold, in other words, replacing the image gray value in the target frame region with a uniform gray value (such as [0, 0, 0] and the like), so that the processing is equivalent to erasing the target from the sub-image sample.

It will be appreciated that masking is only required during the model training phase and is not required during the model application phase.

In addition, the sizes of the target frames marked in the oversized pixel image samples are counted, and the maximum sizes (max _ w, max _ h) of the target frames are obtained, so that the range of the first preset size of the sliding window can be set to (0.2-0.6) × (max _ w, max _ h), and the first preset size can be selected from the range.

Further, after multiple batches of sub-image samples are obtained, randomly segmenting a preset number of sub-image samples according to the ratio of 3:1 to obtain training data and test data; performing joint data enhancement on training data; training a pedestrian and vehicle target detection model (namely an initial target detection model) by adopting an official open-source YOLOv5x model and model training hyperparameters as a pre-training model and hyperparameters and adopting a random multi-scale input strategy, wherein the model training aims at minimizing a calculated value of a loss function, and the model is tested by using a test set after each iteration of a specific turn; and selecting the model with the maximum mAP (Mean Average Precision) value on the test set as the optimal model.

Specifically, the total Loss in model training is represented by the target frame coordinate Loss L_boxTarget frame confidence loss L_objAnd target classification loss L_clsConsists of the following components: loss ═ L_box+L_obj+L_clsWherein, L_boxUsing CIOU loss calculations, this application will assign L_objAnd L_clsOptimized as Quality Focal Local (QFL), the specific calculation is as follows:

L_obj(σ)，L_cls(σ)＝-|y-σ|^β((1-y)log(1-σ)+ylog(σ))

in the formula, y is an IOU value (value range is 0-1) of the predicted target frame and the real target frame, sigma is a confidence coefficient or a target classification probability of the predicted target frame, and a beta value can be 2. QFL support joint representation of target frame coordinate quality and target classification probability or target frame confidence while having Focal local balance positive and negative, difficult sample characteristics; the method avoids a true negative sample with lower classification probability, and is superior to the situation of a true sample with lower classification probability but lower position score in the candidate frame screening because an incredible and extremely high position score is predicted.

and copying a target frame corresponding to each detection result, performing translation and rotation on the target frame by a preset angle, and then pasting the target frame to a non-target area of the sub-image to obtain a corresponding sub-image sample with enhanced data.

Specifically, for each sub-image sample, the detection results corresponding to the target frames with the size not larger than the second preset size are obtained, the target frames corresponding to the detection results are translated and rotated by a preset angle, and then the target frames are pasted to the non-target area of the sub-image, so that the sub-image sample with the corresponding data enhanced is obtained, and the target information in the processed sub-image sample is richer.

It should be noted that, the joint data enhancement for the sub-graph may include, in addition to the above operations, a data enhancement method that performs image disturbance on the sub-graph sample, changes brightness, contrast, saturation, and hue, adds noise, randomly scales, randomly cuts, turns, rotates, and randomly erases a self-contained data enhancement method of the conventional open source yolov5 algorithm, and during actual processing, the conventional data enhancement method may be performed on the sub-graph sample first, and then, after performing translation and rotation of a detection result that a target frame in each sub-graph sample is not larger than a second preset size, the detection result is pasted to a non-target area of the sub-graph, the enhancement may be performed in a manner of being performed.

For example, as shown in fig. 7, the joint data enhancement is performed on the image with the target frame size of pedestrian and vehicle in the sub-pattern text being less than 15 × 15 pixels (i.e. the second preset size) and the label information thereof. Specifically, as shown in the left image of fig. 7, the vehicle and pedestrian targets are copied, translated and rotated by a certain angle, and then pasted on the non-target area of the original image, so as to obtain the enhanced sub-image sample of the right image of fig. 7.

Specifically, for each scale image, after the detection results of all sub-images corresponding to the scale image are obtained, the sub-images with the detection results are spliced, and target frames in two adjacent sub-images before and after the sub-images are fused in the splicing process.

And the sequence and the position of sub-image splicing correspond to the sliding segmentation of the scale image.

For a front sub-image in two adjacent front and rear sub-images, if a target frame corresponding to the detection result of the front sub-image is positioned on the left side of a central line of an overlapping region of the two sub-images or is intersected with the central line, retaining the detection result, and if the target frame corresponding to the detection result of the front sub-image is positioned on the right side of the central line, discarding the detection result; and for the subsequent subgraph in the two adjacent front and rear subgraphs, if the target frame corresponding to the detection result of the subsequent subgraph is positioned on the right side of the central line, the detection result is retained, and if the target frame corresponding to the detection result of the subsequent subgraph is positioned on the left side of the central line or is intersected with the central line, the detection result is discarded. For example, as shown in fig. 8, when sub-image a and sub-image B are spliced (assuming that both sub-images respectively include target frame frames (i.e., dashed lines) in the two overlapping regions), sub-image a retains the detection results of target frame (i) and target frame (ii), discards the detection result of target frame (iii), sub-image B retains the detection result of target frame (iii), discards the detection results of target frame (i) and target frame (ii), and the fusion manner of the two adjacent sub-images is the same in the two adjacent sub-images.

It should be noted that after the detection result of each scale image of the super large pixel image is obtained, the scale image is enlarged to the size of the super large pixel image and the target frame therein in proportion, then all the target frames in each scale image are projected to the same super large pixel image area, and the third IOU corresponding to multiple sets of target frames which are mutually overlapped is obtained. And for each group of target frames with the third IOU larger than a third preset threshold (the third preset threshold can be 0.7 removed), reserving the target frame with the maximum target frame confidence coefficient, and deleting other target frames. In addition, all target frames are retained except for the target frames of each set of third IOUs that are greater than the third preset threshold. And the retained target frame coordinates, target frame confidence and target type of the target frame are the detection result of the super-large pixel image.

The solution of the embodiment of the present application is further described below by an example, as shown in fig. 9, an implementation process of the solution may be divided into an S1 data preprocessing module, an S2 model training module, and an S3 model inference module, where an execution process of each module may include:

s1 data preprocessing module:

s1.1, acquiring marked large scene image data, namely an oversized pixel image with marking information, in different scenes;

s1.2, carrying out statistical analysis on the labeled information in the data to obtain scale information of the pedestrian and vehicle targets;

and S1.3, performing parallel multi-scale sliding window cutting on the oversized pixel image data by taking the scale information of the target as a reference to obtain a multi-scale sub-image.

The S2 model training module:

s2.1, randomly segmenting the multi-scale subgraph according to the ratio of 3:1 to obtain training data and test data;

s2.2, performing combined data enhancement on the training data;

s2.3, training a pedestrian and vehicle target detection model by using an official open-source YOLOv5x model and model training hyper-parameters as a pre-training model and hyper-parameters and adopting a random multi-scale input strategy, wherein the model training aims at minimizing a calculated value of a loss function, and the model is tested by using a test set after each iteration of a specific round;

s2.4, selecting a model with the maximum mAP (mean Average precision) value on the test set as an optimal model to obtain a trained target detection model;

the S3 model inference module:

s3.1, obtaining a picture in a super-large pixel scene, namely obtaining a super-large pixel image to be detected

S3.1, performing on-line multi-scale sliding window cutting on the oversized pixel picture according to the mode in the S1.3 to obtain a multi-scale sub-picture;

s3.2, performing parallel reasoning on the multi-scale subgraphs by using the optimal model obtained in the S2.4 to obtain candidate frames (namely target frames) of targets in each subgraph;

s3.3, screening the candidate frames by adopting a WBF (white blood count) method to obtain the detection result of each subgraph;

and S3.4, carrying out boundary fusion on the detection result of each sub-image to obtain the detection result of the super-large pixel image.

Fig. 10 is a block diagram of a target detection apparatus according to an embodiment of the present application, and as shown in fig. 10, the apparatus 1000 may include: a scale image acquisition module 1001, a sub-image acquisition and detection module 1002, a first detection result fusion module 1003, and a second detection result fusion module 1004, wherein:

the scale image obtaining module 1001 is configured to obtain an oversized pixel image, and reduce the oversized pixel image by at least one preset multiple to obtain a corresponding scale image;

the sub-graph obtaining and detecting module 1002 is configured to perform sliding segmentation on each scale image by using a sliding window with a first preset size to obtain a plurality of corresponding sub-graphs, and input each sub-graph into a trained target detection model to obtain a detection result of the sub-graph, where the trained target detection model is obtained by training a sub-graph sample labeled with a detection result, and the detection result includes a target frame coordinate, a target category, and a target frame confidence;

the first detection result fusion module 1003 is configured to fuse the detection results of the sub-images corresponding to each scale image to obtain a detection result corresponding to the scale image;

the second detection result fusion module 1004 is configured to fuse detection results of the images of each scale corresponding to the super-large pixel image to obtain a detection result of the super-large pixel image.

According to the scheme provided by the application, a plurality of scale images are obtained by reducing the super-large pixel image, a sliding window is adopted to perform sliding segmentation on each scale image to obtain a plurality of sub-images, each sub-image is input into a target detection model to obtain the detection result of each sub-image, then the detection results of all sub-images of each scale image are fused to obtain the detection result of the scale image, and finally the detection results of all scale images of the super-large pixel image are fused to obtain the detection result of the super-large pixel image. According to the scheme, the target detection model is adopted to detect the subgraphs after the super-large pixel image is reduced and segmented to obtain the detection result of each subgraph, and then the detection results of the super-large pixel image are obtained after multiple times of fusion based on the detection results of the subgraphs, so that the detection of the target in the super-large pixel image is realized.

In an optional embodiment of the application, the target detection model comprises a Backbone module, an intermediate Neck module and an output module; the subgraph acquisition and detection module is specifically configured to:

respectively extracting target characteristics of subgraphs through a Transformer layer and a deformable convolution DCN layer in a Backbone layer; fusing the target feature information through a Neck module to obtain corresponding fusion features; and finally, respectively outputting corresponding initial detection results through an output module based on fusion characteristics corresponding to at least two network layers in the Neck module, and acquiring the detection results corresponding to the subgraphs based on a plurality of initial detection results.

Reference is now made to fig. 11, which is a block diagram illustrating an electronic device (e.g., a terminal device or a server performing the method illustrated in fig. 3) 1100 adapted to implement an embodiment of the present application. The electronic device in the embodiments of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), a wearable device, and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

The electronic device includes: a memory for storing a program for executing the method of the above-mentioned method embodiments and a processor; the processor is configured to execute programs stored in the memory. The processor herein may be referred to as the processing device 1101 described below, and the memory may include at least one of a Read Only Memory (ROM)1102, a Random Access Memory (RAM)1103, and a storage device 1108 described below, as follows:

as shown in fig. 11, the electronic device 1100 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1101 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from a storage means 1108 into a Random Access Memory (RAM) 1103. In the RAM1103, various programs and data necessary for the operation of the electronic device 1100 are also stored. The processing device 1101, the ROM 1102, and the RAM1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

Generally, the following devices may be connected to the I/O interface 1105: input devices 1106 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 1107 including, for example, Liquid Crystal Displays (LCDs), speakers, vibrators, and the like; storage devices 1108, including, for example, magnetic tape, hard disk, etc.; and a communication device 1109. The communication means 1109 may allow the electronic device 1100 to communicate wirelessly or wiredly with other devices to exchange data. While fig. 11 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication device 1109, or installed from the storage device 1108, or installed from the ROM 1102. The computer program, when executed by the processing device 1101, performs the above-described functions defined in the methods of the embodiments of the present application.

It should be noted that the computer readable storage medium mentioned in the present application can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

obtaining an oversized pixel image, and reducing the oversized pixel image by at least one preset multiple to obtain a corresponding scale image; for each scale image, performing sliding segmentation on the scale image by using a sliding window with a first preset size to obtain a plurality of corresponding sub-images, and inputting each sub-image into a trained target detection model to obtain a detection result of the sub-image, wherein the trained target detection model is obtained by training a sub-image sample marked with the detection result, and the detection result comprises a target frame coordinate, a target category and a target frame confidence coefficient; fusing the detection results of the sub-images corresponding to each scale image to obtain the detection result corresponding to the scale image; and fusing the detection results of the images of all scales corresponding to the oversized pixel image to obtain the detection result of the oversized pixel image.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units described in the embodiments of the present application may be implemented by software or hardware. The name of a module or a unit does not in some cases constitute a limitation of the unit itself, and for example, the proxy link acquiring module may also be described as a "module that acquires a proxy link".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific method implemented by the computer-readable medium described above when executed by the electronic device may refer to the corresponding process in the foregoing method embodiments, and will not be described herein again.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device realizes the following when executed:

obtaining an oversized pixel image, and reducing the oversized pixel image by at least one preset multiple to obtain a corresponding scale image; for each scale image, performing sliding segmentation on the scale image by using a sliding window with a first preset size to obtain a plurality of corresponding sub-images, and inputting each sub-image into a trained target detection model to obtain a detection result of the sub-image, wherein the trained target detection model is obtained by training a sub-image sample marked with the detection result, and the detection result comprises a target frame coordinate, a target category and a target frame confidence coefficient; fusing the detection results of each subgraph corresponding to each scale image to obtain the detection result corresponding to the scale image; and fusing the detection results of the images of all scales corresponding to the oversized pixel image to obtain the detection result of the oversized pixel image.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method of object detection, comprising:

2. The method according to claim 1, wherein for each scale image, performing sliding segmentation on the scale image by using a sliding window with a first preset size to obtain a plurality of corresponding sub-graphs, includes:

sliding the sliding window on the scale image according to a preset step length, wherein the corresponding area of the sliding window on the scale image is a sub-image after each sliding, and the ratio of the preset step length to the width of the sliding window is in a preset proportion range;

and if the sliding window exceeds the boundary of the scale image after one sliding, translating the sliding window into the scale image to obtain a corresponding sub-image.

3. The method of claim 1, wherein the target detection model comprises a Backbone module, an intermediate Neck module, and an output module;

respectively extracting target characteristics of the subgraph through a Transformer layer and a deformable convolution DCN layer in the Backbone layer; fusing the target feature information through the tack module to obtain corresponding fusion features; and finally, respectively outputting corresponding initial detection results through the output module based on fusion characteristics corresponding to at least two network layers in the Neck module, and acquiring the detection results corresponding to the subgraph based on the plurality of initial detection results.

4. The method of claim 3, wherein the obtaining the corresponding detection result of the sub-graph based on the plurality of initial detection results comprises:

projecting all target frames corresponding to the plurality of initial detection results into the subgraph, and acquiring at least one group of target frames with first intersection ratio IOU not less than a first preset threshold value;

for each group of target frames with the first IOU not smaller than the first preset threshold, acquiring a corresponding fusion target frame by using a weighted frame fusion (WBF) algorithm based on the group of target frames with the first IOU not smaller than the first preset threshold;

and obtaining the detection result of the subgraph based on the fusion target frame corresponding to each group of target frames with the first IOU not less than the first preset threshold value and other target frames except the target frames with the first IOU not less than the first preset threshold value.

5. The method of claim 1, wherein the trained object detection model is trained by:

performing joint data enhancement on each sub-image sample to obtain a corresponding sub-image sample after data enhancement, and training an initial target detection model by using the sub-image samples after the data enhancement of the preset amount until a loss function meets a preset condition to obtain the trained target detection model;

6. The method of claim 5, wherein jointly data enhancing each sub-graph sample to obtain a corresponding data enhanced sub-graph sample comprises:

7. The method of claim 1, wherein the fusing the detection results of the sub-images corresponding to each scale image to obtain the detection result corresponding to the scale image comprises:

for a front sub-image in two adjacent front and rear sub-images, if a target frame corresponding to the detection result of the front sub-image is positioned on the left side of a central line of an overlapping region of the two sub-images or is intersected with the central line, retaining the detection result, and if the target frame corresponding to the detection result of the front sub-image is positioned on the right side of the central line, discarding the detection result; and for a subsequent sub-image in the adjacent two sub-images, if the target frame corresponding to the detection result of the subsequent sub-image is positioned on the right side of the central line, retaining the detection result, and if the target frame corresponding to the detection result of the subsequent sub-image is positioned on the left side of the central line or is intersected with the central line, discarding the detection result.

8. An object detection device, comprising:

the scale image acquisition module is used for acquiring an oversized pixel image and reducing the oversized pixel image by at least one preset multiple to obtain a corresponding scale image;

the subgraph acquisition and detection module is used for performing sliding segmentation on each scale image by using a sliding window with a first preset size to obtain a plurality of corresponding subgraphs, inputting each subgraph into a trained target detection model to obtain a detection result of the subgraph, wherein the trained target detection model is obtained by training a subgraph sample marked with the detection result, and the detection result comprises a target frame coordinate, a target category and a target frame confidence coefficient;

and the second detection result fusion module is used for fusing the detection results of the images of all scales corresponding to the oversized pixel image to obtain the detection result of the oversized pixel image.

9. An electronic device comprising a memory and a processor;

the memory has stored therein a computer program;

the processor for executing the computer program to implement the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1 to 7.