CN117079217A

CN117079217A - Target detection method and target detection device for wide-field image

Info

Publication number: CN117079217A
Application number: CN202311120122.XA
Authority: CN
Inventors: 王震
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2023-08-31
Filing date: 2023-08-31
Publication date: 2023-11-17

Abstract

The invention discloses a target detection method and a target detection device for a wide-field image, wherein the target detection method of an embodiment comprises the following steps: dividing the wide-view-field image to be detected by using sliding windows with at least two sizes, and obtaining each sub-image with different sizes; respectively inputting each sub-graph with different sizes into a preset target detection model and obtaining sub-graph detection results corresponding to each sub-graph one by one, wherein the detection result of each sub-graph comprises a target frame for detecting each target; and selecting a unique target frame of each target according to the detection result of each sub-image, and performing global fusion to generate a detection result of the wide-field image to be detected. According to the target detection method provided by the invention, through the use of the sliding windows with various sizes and the target frame de-duplication and global fusion, the problems of incomplete target cutting and undetectable targets caused by the use of the sliding window with fixed size to cut the wide-view-field image can be avoided, and the accuracy of target detection of the wide-view-field ultrahigh-resolution image can be effectively improved.

Description

Target detection method and target detection device for wide-field image

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection method and a target detection device for a wide-field image.

Background

The target detection is a basic task in computer vision, and along with the development of imaging technology, the target detection of wide-field images gradually becomes a technical hot spot, and has important application in large-range and long-range multi-target vision analysis of stations, squares, schools and the like. Wide field of view images, also known as wide field of view ultra-high resolution images, are typically acquired using a billion pixel camera, while guaranteeing a wide field of view-up to 1 square kilometer of natural scene coverage-and ultra-high resolution-nearly 10 billion pixels per frame of video image. But simultaneously, the characteristics of super-large resolution, obvious target scale change and complex shielding also provide target detection challenges with higher difficulty. The traditional algorithm simply cuts and integrates detection results on the large-resolution image, and the accuracy of target detection is not high.

Disclosure of Invention

In order to solve at least one of the above problems, a first embodiment of the present invention provides an object detection method of a wide field image, including:

dividing the wide-view-field image to be detected by using sliding windows with at least two sizes, and obtaining each sub-image with different sizes;

Respectively inputting each sub-graph with different sizes into a preset target detection model and obtaining sub-graph detection results corresponding to each sub-graph one by one, wherein the detection result of each sub-graph comprises a target frame for detecting each target;

and selecting a unique target frame of each target according to the detection result of each sub-image, and performing global fusion to generate a detection result of the wide-field image to be detected.

For example, in the target detection method provided by some embodiments of the present application, the splitting the wide field of view image to be detected and obtaining each sub-image with different sizes by using sliding windows with at least two sizes respectively further includes:

sliding windows with different sizes are respectively used for sliding in the wide view field images to be detected according to stepping sizes corresponding to the sliding windows one by one to form sub-images with different sizes, and the sliding window sliding area in the wide view field images to be detected is one sub-image at each time.

For example, in the target detection method provided by some embodiments of the present application, the target detection model includes multiple kinds of detection sub-models and at least one trained classification sub-model, and the inputting each sub-image of the different sizes into the preset target detection model and obtaining the sub-image detection result corresponding to each sub-image one to one respectively further includes:

Respectively carrying out target detection on each sub-graph with different sizes by using the multiple kinds of detection sub-models and outputting corresponding multiple kinds of sub-graph detection results;

and respectively carrying out target detection on each sub-graph with different sizes by using the classification sub-model and outputting a corresponding classification sub-graph detection result.

For example, in the target detection method provided in some embodiments of the present application, before the respectively inputting each sub-image of the different sizes into a preset target detection model and obtaining a sub-image detection result corresponding to each sub-image one-to-one, the target detection method further includes:

and training the multiple types of detection submodels by using each type of target frame in training set data to obtain different classification submodels, wherein the training set data comprises each target frame marked with the type.

For example, in the method for detecting targets provided in some embodiments of the present application, the training the multiple types of detection sub-models by using the target boxes of each type in the training set data to obtain different classification sub-models further includes:

dividing the marked wide view field images in the training set data by using a sliding window with a preset size to obtain a plurality of marked subgraphs of each marked wide view field image;

Respectively obtaining the area intersection ratio of the target frame positioned at the edge position in each annotation sub-graph and the corresponding target frame in the corresponding annotated wide view field image, and filtering the target frame with the area intersection ratio smaller than a preset area intersection threshold value;

and training the multiple kinds of detection sub-models by using the target frames of the same kind in each marked wide-field image respectively to obtain different classification sub-models.

For example, in the target detection method provided by some embodiments of the present application, the detection result further includes coordinates of the target frame, and the selecting a unique target frame of each target according to the detection result of each sub-image, and performing global fusion to generate the detection result of the wide field of view image to be detected further includes:

restoring the coordinates of the target frames of each sub-image to the image coordinates of the wide-view-field image to be detected;

selecting a target frame with the optimal quality of the target in the detection results of all sub-images as a unique target frame according to the image coordinates;

and performing global fusion by using the unique target frames of the targets to generate a detection result of the wide-field image to be detected.

For example, in the target detection method provided by some embodiments of the present application, selecting, according to the image coordinates, a target frame with the optimal quality of the target in the detection results of the sub-images as the unique target frame further includes:

Removing the overlapped target frames by using a non-maximum suppression algorithm;

and filtering the target frames with partial overlapping according to the ratio of the overlapping area of the two target frames to the smallest target frame area in the two target frames.

For example, in the object detection method provided in some embodiments of the present application, the filtering the object frame having partial overlap according to the ratio of the overlapping area of two object frames to the smallest object frame area of the two object frames further includes:

and judging whether the target frames with partial overlapping exist or not according to the length and the distance of at least one frame of the two target frames with partial overlapping according to the preset length threshold value and the preset width threshold value.

For example, in the target detection method provided by some embodiments of the present application, a detection result of the wide field of view image to be detected is generated by performing global fusion with a unique target frame of each target, and the target detection method further includes:

performing global fusion by using a unique target frame of each target to generate a global detection result;

and carrying out region division on the wide-view-field image to be detected according to the global detection result, filtering part of the target frame according to the divided region, and generating a detection result of the wide-view-field image to be detected.

For example, in the target detection method provided by some embodiments of the present application, the detection result further includes a confidence level of the target frame, after the preset target detection model is input to each sub-image of the different sizes, and the sub-image detection result corresponding to each sub-image is obtained, before the unique target frame of each target is selected according to the detection result of each sub-image, and global fusion is performed to generate the detection result of the wide field of view image to be detected, the target detection method further includes:

and filtering target frames, in each sub-graph, of which the confidence coefficient in the range of the edge threshold is lower than a preset confidence coefficient threshold, according to preset edge threshold corresponding to the size of each sub-graph.

Determining the region type of each sub-graph according to a preset target quantity threshold corresponding to the size of each sub-graph;

and filtering target frames in each subgraph according to a preset confidence detection threshold corresponding to the region type.

A second embodiment of the present invention provides an object detection apparatus for a wide field-of-view image, including:

the image segmentation unit is used for respectively segmenting the wide-view-field image to be detected by using sliding windows with at least two sizes and obtaining each sub-image with different sizes;

the target detection unit is used for inputting each sub-graph with different sizes into a preset target detection model respectively and obtaining sub-graph detection results corresponding to each sub-graph one by one, and the detection result of each sub-graph comprises a target frame for detecting each target;

and the global fusion unit is used for selecting a unique target frame of each target according to the detection result of each sub-image and performing global fusion to generate the detection result of the wide-field image to be detected.

A third embodiment of the invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in the first embodiment.

A fourth embodiment of the invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first embodiment when executing the program.

The beneficial effects of the invention are as follows:

aiming at the existing problems at present, the invention establishes a target detection method and a target detection device for a wide-view-field image, and divides the wide-view-field image to be detected by using sliding windows with at least two sizes to obtain each sub-image with different sizes, and after the corresponding sub-image detection results are output by each sub-image respectively detected by using a preset target detection model, the repeated target frames in each sub-image detection result are de-duplicated and the sub-images with different sizes are subjected to global fusion to obtain the detection results of the wide-view-field image to be detected; the embodiment can avoid the problems of incomplete cutting of a large-size target and undetectable small-size targets caused by using a fixed-size sliding window for image segmentation in the related art; meanwhile, repeated target frames caused by different size segmentation and sliding window overlapping segmentation are selected and de-duplicated, and unique target frames of each target of the wide-field-of-view image to be detected are used for global fusion to form a detection result of the wide-field-of-view image to be detected, so that the accuracy of target detection of the wide-field-of-view ultrahigh-resolution image is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a flow chart of a method of object detection according to an embodiment of the invention;

FIG. 2 shows a schematic diagram of a city square scene according to one embodiment of the invention;

FIGS. 3a-3c are schematic illustrations of different sliding window size cuts in accordance with an embodiment of the present invention;

FIG. 4 shows a schematic diagram of an embodiment of the present invention including a sub-view of a building;

FIG. 5 shows a schematic diagram of a sub-graph including a dense population in accordance with one embodiment of the present invention;

FIG. 6 illustrates a schematic diagram of scene detection and zoning according to one embodiment of the invention;

FIG. 7 is a block diagram showing the structure of the object detection apparatus according to one embodiment of the present invention;

fig. 8 is a schematic structural diagram of a computer device according to another embodiment of the present invention.

Detailed Description

In order to more clearly illustrate the present invention, the present invention will be further described with reference to preferred embodiments and the accompanying drawings. Like parts in the drawings are denoted by the same reference numerals. It is to be understood by persons skilled in the art that the following detailed description is illustrative and not restrictive, and that this invention is not limited to the details given herein.

In the related art, the prior object detection technology cannot be directly applied to an ultra-high resolution image because of the large-scale scene of a wide-field image, which can accommodate thousands of people and has complex background, the scale change of similar objects in the foreground and the background in the multi-scale image, which can reach more than 100 times, and the pixel height of a distant object in the ultra-high resolution image, which can be clearly judged, which can also reach more than 100.

In the related art, the ultra-high resolution image is difficult to directly perform reasoning detection due to hardware limitation, a sliding window mode is generally adopted to split to form a plurality of subgraphs with smaller sizes, and due to targets with different sizes in the wide view field image, the following problems exist in the use of a sliding window with a fixed size: on one hand, incomplete targets are easy to be caused at cutting boundaries, on the other hand, the problems of incomplete cutting and inaccurate detection of targets exist when sliding windows with smaller sizes are adopted for large-size targets, and meanwhile, the problems of undetectable targets exist when sliding windows with larger sizes are adopted for small-size targets; thereby seriously affecting the accuracy of target detection of the wide field image. In the related art, the ultra-high resolution image is reduced according to different reduction ratios and then is cut through a sliding window, however, the reduction operation itself can reduce the resolution of the image itself, so that the accuracy of target detection is affected.

In view of the above, as shown in fig. 1, the inventors have proposed a target detection method of a wide field image through extensive studies and experiments, comprising:

In this embodiment, aiming at the problems existing in the related art, in particular, aiming at targets with different dimensions in a wide-field image, sliding windows with different dimensions are adopted to perform sliding segmentation, so that the problems that the large-size targets are not completely segmented and the small-size targets cannot be detected when the fixed-size sliding windows are used for performing image segmentation in the related art can be avoided; meanwhile, repeated target frames caused by different size segmentation and sliding window overlapping segmentation are selected and de-duplicated, each target of the wide-field image to be detected is guaranteed to have a unique target frame, and then all target frames in all sub-images are subjected to global fusion to form a detection result of the wide-field image to be detected, so that the accuracy of target detection of the wide-field ultrahigh-resolution image can be effectively improved.

To further illustrate the embodiment of the present application, taking the example of performing object detection on the wide-field image acquired in the square as shown in fig. 2, the method specifically includes:

the first step, at least two sliding windows with different sizes are used for respectively segmenting the wide-field image to be detected, and each sub-image with different sizes is obtained.

In this embodiment, by performing sliding segmentation using sliding windows of different sizes, the problems of insufficient cutting of a large-size target and undetectable small-size targets caused when performing image segmentation using a fixed-size sliding window in the related art can be avoided.

Specifically, in an alternative embodiment, sliding windows with different sizes are respectively used to slide on the wide-field image to be detected according to step sizes corresponding to the sliding windows one by one to form sub-images with different sizes, and the sliding window sliding area in the wide-field image to be detected is one sub-image at a time.

In the embodiment, for the situation that the scale of an image target with a wide field of view to be detected is changed more and is changed greatly, for example, the scale change of similar targets in the foreground and the background in an original image can be more than 100 times, the detection accuracy of the background target is high when the size of a sliding window is smaller, and the foreground target cannot be detected; when the size of the adopted sliding window is larger, all targets can be detected, but the detection effect on the foreground targets is poor. In this embodiment, the original image is clipped by using windows of three scales of 2048 pixels×2048 pixels, 4096 pixels×4096 pixels, 6146 pixels×6146 pixels, so as to form a plurality of sub-images of three scales respectively.

Considering that when cutting and clipping, an incomplete target exists at a clipping boundary, when the sliding window is used for cutting and clipping, the sliding window is cut from left to right and from top to bottom according to a certain step size, two adjacent sub-graphs are partially overlapped, and the incomplete target at the clipping boundary can completely appear in the next sub-graph, so that each target of an image with a wide field of view to be detected is ensured to completely exist in the sub-graph. In particular, the step size of the sliding window and the corresponding sliding window size satisfy a preset proportional relationship, the sliding window size of the embodiment is square, the step size is a common sliding window side length, that is, the step size of the sliding window with 2048 pixels by 2048 pixels is 1024 pixels, the step size of the sliding window with 4096 pixels by 4096 pixels is 2048 pixels, the step size of the sliding window with 6146 pixels by 6146 pixels is 3073 pixels, the overlapping area of two horizontally adjacent pictures is 50%, and the overlapping area of two vertically adjacent pictures is 50%, which not only can solve the problem that the object is at the cutting edge, but also can solve the common problem that the object is detected at the cutting edge. As shown in fig. 3a-3c, for three slide window scale cropped images, a significant change in the proportions of the same object 10 to the full view in the three cropped views is seen: in fig. 3a, the sliding window size is smallest and the target 10 is at a maximum scale of the overall map; in fig. 3b, the sliding window size is centered among the three sizes, with the scale of the target 10 being centered as well; in fig. 3c, the sliding window size is the largest and the target 10 is the smallest on the whole map; according to the embodiment, the sliding windows with different sizes are arranged to segment the wide-view-field image to be detected, so that more recognition opportunities are provided for the target detection model, namely, the targets with different sizes can be effectively and compatibly detected through the sub-images cut out by the sliding windows with different sizes, the target detection is carried out by using the sub-images cut out by the sliding windows with small sizes for the targets with small sizes, and the target detection is carried out by using the sub-images cut out by the sliding windows with large sizes for the targets with large sizes, so that the problem that the targets with large sizes due to the use of the sliding windows with fixed sizes are not cut fully and the targets with small sizes cannot be detected in the related art is avoided, and the accuracy of target detection in the wide-view-field image to be detected is effectively improved.

It should be noted that the size of the sliding window and the step size are not particularly limited in the present application, and those skilled in the art should select a sliding window with a suitable size and a corresponding step size according to practical application requirements, which are not described herein.

Considering that the sliding window size and the size of the wide-field image to be detected are not divisible, in an alternative embodiment, when the sliding window sliding area exceeds the boundary of the wide-field image to be detected when approaching to the edge position of the wide-field image to be detected, the sliding window is translated into the wide-field image to be detected. Specifically, when a sliding window with a size of 2048 pixels by 2048 pixels is used for cutting a wide view field image to be detected from left to right and from top to bottom, and 1024 pixels are used as a translation distance, the position of the left upper corner coordinate of each sub-image relative to the original wide view field image to be detected is recorded, so that the subsequent sub-images can be subjected to global stitching. When clipping to the last column or the last row, when one side of the rest original picture is smaller than 2048 pixels, a window which is 2048 pixels away from the edge of the original wide-field image to be detected is clipped. Namely, in each row, the overlapping area of the sub-graph of the last column and the adjacent sub-graph is larger than or equal to the overlapping area of other adjacent sub-graphs; similarly, in each column, the overlapping area of the sub-image of the last row and the adjacent sub-images is larger than or equal to the overlapping area of other adjacent sub-images, so that the sub-images at all edge positions are complete images, and redundant pixel contents are not existed.

And secondly, respectively inputting the sub-images with different sizes into a preset target detection model, and obtaining sub-image detection results corresponding to the sub-images one by one, wherein the detection results of each sub-image comprise target frames for detecting each target.

In this embodiment, for each obtained sub-image of 2048 pixels by 2048 pixels, each sub-image of 4096 pixels by 6146 pixels, each sub-image is subjected to target detection by using a target detection model, and the sub-image detection result of each sub-image is output, so as to form three sets of sub-images with different sizes. The sub-image detection result comprises target frames corresponding to the targets in the sub-image, and coordinates, categories and confidence levels of the target frames, wherein the coordinates are coordinates of the targets relative to a sub-image coordinate system of the sub-image, the categories are identified categories of the target frames, the confidence levels are the accuracy of identification of the target frames, and the higher the confidence level is, the higher the accuracy of identification is, namely the more credible is.

In view of further improving the accuracy of target detection, in an alternative embodiment, the method specifically includes: firstly, respectively carrying out target detection on each sub-graph with different sizes by using the multi-class detection sub-model, and outputting corresponding multi-class sub-graph detection results. And secondly, respectively carrying out target detection on each sub-graph with different sizes by using the classification sub-model and outputting a corresponding classification sub-graph detection result.

In this embodiment, the multiple kinds of detection sub-models are target detection models, for example, yolov7 is an open-source target detection model, and several tens of targets can be classified, and in this embodiment, yolov7 is used as the multiple kinds of detection sub-models to respectively detect targets of each sub-image and output multiple target frames, and coordinates, types and confidence of each target frame. Meanwhile, to further improve the accuracy of target detection, classification training is performed on multiple types of detection sub-models Yolov7, for example, a billion pixel-level video data set PANDA is deduced by using the university of bloom to train the Yolov7 and form various classification sub-models, for example, a target frame with a target in a data set being a person is used to train the Yolov7 and form a person-classification sub-model specially identifying a person, and similarly, for example, a target frame with a target in a data set being a car is used to train the Yolov7 and form a car-classification sub-model specially identifying a car, and so on, so as to form a classification sub-model specially identifying various specific targets. The method and the device respectively utilize multiple kinds of detection submodels to carry out target detection on each subgraph, and then use all kinds of classification submodels to carry out target detection on each subgraph, namely, multiple kinds of detection submodels and at least one classification submodel are respectively used for carrying out target detection for each target of the wide-field image to be detected, and multiple target frames are generated, so that the accuracy rate of target detection is further improved.

In order to improve the accuracy of target detection of the wide-field image, in an alternative embodiment, the classification training of the Yolov7 of the open source by using target frames of different classifications specifically includes:

In this embodiment, the Yolov7 is trained by using target frames of different classifications to obtain a more targeted classification sub-model. Yolov7 is trained, for example, using target boxes classified as human in the PANDA dataset. The method specifically comprises the following steps:

firstly, a sliding window with a preset size is used for respectively segmenting a plurality of marked wide view field images in the training set data so as to obtain a plurality of marked subgraphs of each marked wide view field image.

In this embodiment, since the pixel level of the wide-field image is too large to directly send into the model for training, the fixed-size segmentation is performed on the wide-field image to generate a labeling sub-graph suitable for model reception, for example, in this embodiment, a 4096 pixel x 4096 pixel sliding window is used to segment a plurality of frame-labeled wide-field images in the PANDA dataset from left to right and from top to bottom in 2048 pixel steps, and when the last column or the last row moves out of the wide-field image, the sub-graph of the sliding window size is cut according to the edge of the wide-field image, so as to obtain a plurality of labeling sub-graphs suitable for inputting and training the model.

And secondly, respectively obtaining the area intersection ratio of the target frame positioned at the edge position in each labeling sub-graph and the corresponding target frame in the corresponding labeled wide-view-field image, and filtering the target frame of which the area intersection ratio is smaller than a preset area intersection threshold value.

In this embodiment, since each target marked with a wide field image already has a marked target frame, for the marked sub-graph obtained by sliding and splitting the sliding window, it can be determined whether the target frame in the marked sub-graph can be used for training by comparing two target frames of the same target. In other words, when preparing the annotation subgraph used for training, in order to ensure the integrity of the target frame used for training, the target frame at the edge position in the annotation subgraph is detected, when the area intersection ratio of the target frame at the edge position in the annotation subgraph and the corresponding target frame in the annotated wide-view image is greater than a predetermined threshold, for example, greater than 80%, the target frame in the annotation subgraph is considered to be used for training the target detection model, so that the target detection accuracy of the classification subgraph can be improved, and otherwise, the target frame is filtered from the annotation subgraph.

And finally, respectively training the multiple kinds of detection sub-models by using the target frames of the same kind in each marked wide-field image to obtain different classification sub-models.

In this embodiment, the Yolov7 is trained by using target frames of the same class in the marked wide-field image, so as to improve the accuracy of target detection of the classification sub-model, for example, the Yolov7 is trained by using target frames of all classes of the target frames to obtain a person-classification sub-model specially identifying a person, and then the Yolov7 is trained by using target frames of all classes of the target frames as the target frames of the automobile to form an automobile-classification sub-model specially identifying the automobile.

Specifically, the present embodiment uses GIOUs as a loss function of Yolov7, which introduces the generalized intersection concept on union (Generalized intersection over Union).

Wherein C is the minimum circumscribed rectangle of the target frame and the prediction frame, A and B represent the target frame and the prediction frame, and the corresponding GIOU loss function is

GIOULoss＝1-GIOU∈[0,2]

GIOULoss considers the problem of gradient disappearance caused by the non-overlapping region of the target frame and the prediction frame, and can obtain a prediction frame with higher precision than MSE and IOU loss functions.

According to the embodiment, the same class of target frames in a plurality of marked subgraphs segmented in a fixed size are used as training set data, yolov7 is trained, training is adjusted and the end of training is determined according to a loss function GIOUloss, and therefore a classification sub-model after classifying and training the multiple classes of detection sub-models Yolov7 is obtained.

In view of the fact that in the application of performing target detection on the wide-view-field image to be detected, each sub-image includes a target frame located at an edge position during segmentation, in order to further improve accuracy, the target detection method further includes:

and thirdly, performing target detection optimization on each sub-graph.

In this embodiment, after detecting each sub-image using a preset target detection model and outputting the sub-image detection result, before generating the wide field of view image to be detected by global fusion, optimization processing is further performed on each sub-image, so as to improve the target recognition accuracy of the wide field of view image to be detected.

Specifically, according to a preset edge threshold corresponding to the size of each sub-graph, filtering target frames, in which the confidence coefficient in the range of the edge threshold is lower than the preset confidence coefficient threshold, in each sub-graph.

In this embodiment, an edge threshold is set, that is, a target frame within a range of the sub-map near the edge position is filtered. The edge threshold value can be a fixed value or can be set in proportion to the sliding window size; the confidence coefficient threshold value is used for filtering the target frame, when the confidence coefficient of the target frame identified in the sub-graph detection result of each sub-graph is larger than the confidence coefficient threshold value, the integrity and the reliability of the target frame are considered to be in accordance with the requirements, and otherwise, the target frame is filtered. For example, when sliding window is cut, the object is located at the edge and only comprises half of the image, the confidence of the object frame identified by the object detection model is smaller, and if the confidence is smaller than the confidence threshold, the object frame is considered to be unreliable and is discarded. By means of the method, the target frame is located in the edge position range in each sub-graph, the confidence coefficient of the target frame is smaller than the confidence coefficient threshold value, subsequent operation is not conducted, and accordingly the reduction of the target detection accuracy caused by low confidence coefficient of the target frame is effectively avoided.

For example, in this embodiment, the definition is performed according to 50 pixels, that is, the target frame in the range of 50 pixels at the four edge positions of the upper, lower, left and right sides of the distance between each sub-image is detected, when the confidence of the target frame in the range of the edge threshold is lower than 0.8, the target frame is considered to be not in accordance with the recognition requirement, and is discarded, and no subsequent operation is performed, so that the accuracy of target detection of the wide-field image to be detected is effectively improved.

It should be noted that, the edge threshold and the confidence threshold are not specifically set in the present application, and a person skilled in the art should select an appropriate edge threshold and confidence threshold according to the actual application requirement, so as to meet the design criterion of filtering the targets of each sub-graph at the edge position, which is not described herein.

Considering that there is a false detection, in an alternative embodiment, determining the region type of each sub-graph according to a preset target number threshold corresponding to the size of each sub-graph; and filtering target frames in each subgraph according to a preset confidence detection threshold corresponding to the region type.

In this embodiment, a dynamic threshold filtering method is used for the false detection situation, so that the problem of target misunderstanding, especially the problem of target misdetection under a small size, can be solved. Specifically, for example, the sliding window size is 2048 pixels×2048 pixels, and some subgraphs, for example, there are a large number of buildings near the wide-field image to be detected, i.e., no person or only incomplete person, and a part of the positions of the wide-field image to be detected, for example, there are dense crowds at a distance. Thus, detecting people far from the whole image is best at a 2048 pixel by 2048 pixel scale, at which point many false positives are caused because textures on nearby buildings are also false-detected as people by the object detection model.

For the above false detection situation, as shown in fig. 4, in order to mainly include a sub-image of a building in a wide view field image to be detected, a large number of building textures exist in the image, in this embodiment, the number of targets in the sub-image is counted first, for example, the number of target persons in the sub-image is counted, the region type of the sub-image is determined according to whether the number of targets exceeds a certain threshold, then a confidence threshold is determined according to the region type, each detected target frame is filtered through the confidence threshold, for example, in the sub-image of 2048 pixels by 2048 pixels, the region is considered to be a non-crowd-dense region when the number of targets is less than or equal to 100, the confidence threshold of the target frame is set to be a high threshold, for example, 0.8, and the category of the target frame is considered to be a human only when the confidence threshold of the target frame is greater than 0.8, otherwise, so that the false detection situation is effectively avoided. For example, in the sub-graph of 2048 pixels by 2048 pixels, when the target number is greater than 100, the region is considered as a crowd-dense region, the confidence threshold of the target frame is set to be a low threshold, for example, 0.6, and the category of the target frame is considered as human only when the confidence threshold of the target frame is greater than 0.6, otherwise, the target frame is discarded, so that the false detection situation is effectively avoided. The dynamic threshold filtering method adopted by the embodiment determines the region types according to the number of the target frames, and adjusts the confidence threshold according to the region types, namely, the target frames are filtered by using different confidence thresholds according to different region types, so that the false detection problem caused by building textures and the like in each subgraph, especially in small-scale subgraphs, is effectively filtered, and the accuracy of target detection is effectively improved.

And fourthly, selecting a unique target frame of each target according to the detection result of each sub-image, and performing global fusion to generate the detection result of the wide-field image to be detected.

In this embodiment, based on the obtained detection results of the sub-images, a plurality of target frames of each target in the wide-field image to be detected are screened, unique target frames are screened, and global fusion is performed by using the unique target frames, so as to generate a detection result of the wide-field image to be detected. The method specifically comprises the following steps:

firstly, the coordinates of the target frame of each sub-image are restored to the image coordinates of the wide-field image to be detected.

In this embodiment, the restored coordinates of each target frame in the wide-field image to be detected are obtained according to the coordinates of the upper left corner of each sub-image, that is, the coordinates of each sub-image corresponding to the wide-field image to be detected, and the coordinates of each target frame in the sub-image relative to the sub-image. In other words, the coordinates of each target frame in the sub-image are converted into the restored coordinates in the wide-field-of-view image to be measured, that is, each target frame in each sub-image is restored into the wide-field-of-view image to be measured.

And secondly, selecting a target frame with the optimal quality of the target in the detection results of all the subgraphs as a unique target frame according to the image coordinates.

In this embodiment, according to the reduction coordinates of each target frame obtained in the previous step, each target in the wide-field-of-view image to be detected includes a plurality of target frames, the target frames with the optimal quality in the detection result are screened from the plurality of target frames, global fusion is performed in the subsequent step by using the target frames with the optimal quality, and the accuracy of target detection is effectively improved.

In an alternative embodiment, further comprising:

first, non-maximum suppression algorithms (NMS) are used to remove overlapping target boxes.

In this embodiment, a conventional NMS is used to remove most of the overlapping boxes. However, in the application scenario of the wide-field image, that is, the crowd-dense scene of billions of pixels, as shown in fig. 5, considering that the overlapping degree of the object frames of the dense crowd is high, the NMS threshold cannot be set to be very small, so that even if there is a slight difference between the object frames of each object, there are 21, 22, and 23 object frames for one object in fig. 5, and all three object frames are reserved.

Second, the target frames that are partially overlapping are filtered according to the ratio (newIOU) of the overlapping area of the two target frames to the smallest target frame area of the two target frames.

In this embodiment, for the above case, based on the duplication elimination performed by using the conventional NMS, the NMS variant method is used to further filter the target frames that have partial overlapping. Specifically, the IOU value of a conventional NMS is obtained by the ratio of the intersection area of two target frames to the union of the two target frames, namely:

where a, B are two target frames, the cross-correlation ratio of the three repeated target frames 21, 22 and 23 shown in fig. 5 using the conventional iou calculation is small, and the set NMS threshold will not filter the three repeated target frames to 1.

The IOU value used in the embodiment is the ratio of the overlapping area of two target frames to the area of the smallest target frame in the two target frames, so that the problem that the traditional NMS cannot filter out a plurality of target frames due to the difference of different target detection models can be solved. The newIOU used in this embodiment is:

wherein A and B are two target frames.

The newIOU employed with this embodiment is able to filter out two of the three duplicate test frames 21, 22 and 23 shown in fig. 5, but the two target frames 31 and 32 holding the child in the middle of fig. 5 are also filtered out one.

In view of the foregoing, in an optional embodiment, the filtering the target frame that has a partial overlap according to a ratio of an overlapping area of two target frames to a minimum target frame area of the two target frames further includes: and judging whether the partially overlapped target frames exist or not according to the length and the distance of at least one frame of the two partially overlapped target frames according to the preset length threshold and the preset width threshold based on the intersection ratio of the two target frames.

In the present embodiment, as shown in fig. 5 at 31 and 32, the limitation of the length and distance of the "upper frame" of the two target frames is increased based on the above-described use of newIOU. Specifically, when newIOU >0.8 is satisfied and the lengths of the upper frames of the two target frames are similar, for example, the lengths of the two upper frames are smaller than 20 pixels, the distances are similar, for example, the distances of the two upper frames are smaller than 10 pixels, the target frames with the confidence degree smaller than the confidence degree threshold value are filtered out, otherwise, the target frames are reserved. Similarly, as shown at 10 in fig. 3b, the detection result is two target frames, and the detection result is defined according to the length and the distance of the "lower frame" of the two target frames, so as to determine whether to filter or retain the two target frames.

It should be noted that, the length and the distance of the frame are not specifically limited in the present application, and those skilled in the art should set according to the actual application requirement to meet the design criterion of being able to accurately identify the partially overlapped target frames, which is not described herein.

And finally, performing global fusion by using the unique target frames of the targets to generate a detection result of the wide-field image to be detected.

In this embodiment, global fusion is performed on a unique target frame screened out for each target in the wide-field image to be detected, that is, the wide-field image is restored according to each unique target frame, so as to realize target detection of each target in the wide-field image.

In order to further improve the accuracy of target detection, in an alternative embodiment, a unique target frame of each target is used for global fusion to generate a global detection result; and carrying out region division on the wide-view-field image to be detected according to the global detection result, filtering part of the target frame according to the divided region, and generating a detection result of the wide-view-field image to be detected.

In this embodiment, the detection result is further optimized in the process of generating the wide field image to be detected after global fusion by using the unique target frame of each target. Specifically, as shown in fig. 2, the method is an urban square scene, the rear part of the wide-field image to be detected comprises a large number of buildings, other buildings positioned at the far distance of the image, sky and the like, after global fusion is performed, the global fused scene is detected, and region division is performed, as shown in fig. 6, according to the scene detection, a building region and a sky region 40 are divided, and considering that no person exists in the two regions, therefore, in the global detection result generated by global fusion, the target frames of the building region and the sky region 40, which are classified as the person, are filtered, and the filtered global detection result is used as the detection result of the wide-field image to be detected; therefore, the false detection target frame is effectively filtered, and the accuracy of target detection of the wide-field image can be improved.

It should be noted that, the scene detection based on the global detection result generated after the global fusion is not particularly limited, and those skilled in the art should understand that the filtering is performed according to the regions divided by the scene detection, for example, the target frames classified as the automobiles are also filtered in the building regions and the sky regions 40 divided in the above embodiment; that is, the person skilled in the art should perform the target frame filtering according to the practical application, so as to satisfy the design criterion of improving the accuracy of the target detection of the wide-field image according to the scene detection division area, which is not described herein.

So far, the detection result of target detection on the wide-view-field image acquired in the urban square scene is completed, in the embodiment, the wide-view-field image to be detected is respectively segmented by using sliding windows with at least two sizes to obtain each sub-image with different sizes, after the corresponding sub-image detection result is respectively detected by using a preset target detection model and output by each sub-image, repeated target frames in each sub-image detection result are de-duplicated, and all sub-images with different sizes are globally fused to obtain the detection result of the wide-view-field image to be detected; the embodiment can avoid the problems of incomplete cutting of a large-size target and undetectable small-size targets caused by using a fixed-size sliding window for image segmentation in the related art; meanwhile, repeated target frames caused by different size segmentation and sliding window overlapping segmentation are selected and de-duplicated, and unique target frames of each target of the wide-field-of-view image to be detected are used for global fusion to form a detection result of the wide-field-of-view image to be detected, so that the accuracy of target detection of the wide-field-of-view ultrahigh-resolution image is effectively improved.

Corresponding to the target detection method provided in the foregoing embodiment, an embodiment of the present application further provides a target detection apparatus applying the target detection method, as shown in fig. 7, including: the image segmentation unit is used for respectively segmenting the wide-view-field image to be detected by using sliding windows with at least two sizes and obtaining each sub-image with different sizes; the target detection unit is used for inputting each sub-graph with different sizes into a preset target detection model respectively and obtaining sub-graph detection results corresponding to each sub-graph one by one, and the detection result of each sub-graph comprises a target frame for detecting each target; and the global fusion unit is used for selecting a unique target frame of each target according to the detection result of each sub-image and performing global fusion to generate the detection result of the wide-field image to be detected.

In this embodiment, aiming at the problems existing in the related art, in particular, aiming at targets with different dimensions in a wide-field image, sliding windows with different dimensions in an image segmentation unit are adopted to perform sliding segmentation, so that the problems that the targets with large dimensions are not fully segmented and the targets with small dimensions cannot be detected when the fixed-dimension sliding window is used for performing image segmentation in the related art can be avoided; detecting the sub-images cut by each size through a target detection unit and generating sub-image detection results comprising each target frame; meanwhile, aiming at repeated target frames in each sub-image detection result caused by different size segmentation and sliding window overlapping segmentation, a global fusion unit is used for selecting, de-duplicating and global fusion, so that each target of the wide-field image to be detected is ensured to have a unique target frame, and global fusion is carried out to form a detection result of the wide-field image to be detected, and the accuracy of target detection of the wide-field ultra-high resolution image can be effectively improved. Since the object detection device provided in the embodiment of the present application corresponds to the object detection method provided in the above-described several embodiments, the foregoing embodiment is also applicable to the object detection device provided in the embodiment, and will not be described in detail in the embodiment.

Another embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements: dividing the wide-view-field image to be detected by using sliding windows with at least two sizes, and obtaining each sub-image with different sizes; respectively inputting each sub-graph with different sizes into a preset target detection model and obtaining sub-graph detection results corresponding to each sub-graph one by one, wherein the detection result of each sub-graph comprises a target frame for detecting each target; and selecting a unique target frame of each target according to the detection result of each sub-image, and performing global fusion to generate a detection result of the wide-field image to be detected.

In practical applications, the computer-readable storage medium may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

As shown in fig. 8, another embodiment of the present invention provides a schematic structural diagram of a computer device. The computer device 12 shown in fig. 8 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in FIG. 8, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 8, commonly referred to as a "hard disk drive"). Although not shown in fig. 8, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown in fig. 8, the network adapter 20 communicates with other modules of the computer device 12 via the bus 18. It should be appreciated that although not shown in fig. 8, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processor unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, to implement an object detection method provided by an embodiment of the present invention.

It should be understood that the foregoing examples of the present invention are provided merely for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention, and that various other changes and modifications may be made therein by one skilled in the art without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A method for detecting an object in a wide field of view image, comprising:

2. The method according to claim 1, wherein the splitting the wide-field image to be detected using sliding windows of at least two sizes and obtaining sub-images of different sizes, respectively, further comprises:

3. The object detection method according to claim 1, wherein the object detection model includes a plurality of kinds of detection sub-models and at least one trained classification sub-model, and the inputting each sub-image of the different sizes into a preset object detection model and obtaining sub-image detection results corresponding to each sub-image one to one respectively further includes:

4. The object detection method according to claim 3, wherein before the respective sub-images of the different sizes are input into a preset object detection model and sub-image detection results corresponding to the respective sub-images one by one are obtained, the object detection method further comprises:

5. The method of claim 4, wherein the training the plurality of classes of detection sub-models to obtain different classification sub-models using respective classes of object boxes in the training set data further comprises:

6. The method according to claim 4, wherein the detection result further includes coordinates of the target frame, and the selecting a unique target frame of each target according to the detection result of each sub-image and performing global fusion to generate the detection result of the wide field image to be detected further includes:

7. The method according to claim 6, wherein selecting, as the unique target frame, a target frame whose quality is the best among the detection results of the respective subgraphs, the target according to the image coordinates, further comprises:

8. The method of claim 6, wherein filtering the partially overlapping target frames according to the ratio of the overlapping area of the two target frames to the smallest target frame area of the two target frames further comprises:

9. The target detection method according to claim 6, wherein the detection result of the wide field image to be detected is generated by global fusion using a unique target frame of each target, the target detection method further comprising:

10. The target detection method according to claim 1, wherein the detection result further includes a confidence level of the target frame, and after the respectively inputting the sub-images of the different sizes into a preset target detection model and obtaining a sub-image detection result corresponding to each sub-image one by one, before the selecting a unique target frame of each target according to the detection result of each sub-image and performing global fusion to generate the detection result of the wide field image to be detected, the target detection method further includes:

11. The target detection method according to claim 1, wherein the detection result further includes a confidence level of the target frame, and after the respectively inputting the sub-images of the different sizes into a preset target detection model and obtaining a sub-image detection result corresponding to each sub-image one by one, before the selecting a unique target frame of each target according to the detection result of each sub-image and performing global fusion to generate the detection result of the wide field image to be detected, the target detection method further includes:

12. An object detection device for a wide field of view image, comprising:

13. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-11.

14. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-11 when the program is executed by the processor.