CN111242066A

CN111242066A - Large-size image target detection method and device and computer readable storage medium

Info

Publication number: CN111242066A
Application number: CN202010053891.2A
Authority: CN
Inventors: 李荣春; 周鑫; 窦勇; 姜晶菲; 牛新; 苏华友; 乔鹏; 潘衡岳
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-05
Anticipated expiration: 2040-01-17
Also published as: CN111242066B

Abstract

The application discloses a large-size image target detection method, a large-size image target detection device and a computer-readable storage medium, wherein the method comprises the following steps: preprocessing an original image to obtain a plurality of groups of small images; inputting the plurality of groups of small images into a target detection network for batch target detection to obtain a detection result of each small image; screening the detection result of the small image; and mapping the screened detection result back to the original image to obtain a target detection result of the original image. According to the large-size image target detection method, the image is divided into a plurality of groups of small images, the small images are detected in groups through the target detection network, the detection results of the small images are mapped back to the original image, the target detection result of the original image is obtained, the target detection speed of the image is high, the detection result accuracy is high, and the method is particularly suitable for target detection of the large-size image.

Description

Large-size image target detection method and device and computer readable storage medium

Technical Field

The application relates to the technical field of computer vision, in particular to a large-size image target detection method and device and a computer readable storage medium.

Background

Object detection is one of the main applications of computer vision. The target detection technology can be used for accurately identifying the target of interest in the image. The target detection technology based on the deep neural network has the characteristics of rapidness, accuracy, high efficiency and the like, and is widely applied to the fields of intelligent security, satellite remote sensing, military reconnaissance and the like.

In a practical application scenario of object detection, there is a case where an object of interest is sought in a large-size image. A large-size image refers to an image that contains more than 100 million pixels, and the pixel size is calculated by multiplying the length by the width, i.e., a 1000 × 1000 size image contains exactly one million pixels. The target detection task for large-size pictures generally has the following characteristics:

(1) the image size is large, typically in the thousands to tens of thousands of pixels; (2) the target size is small, typically tens of pixels; (3) the number of images is large, and target detection tasks need to be carried out on hundreds of images or even thousands of images.

The Faster R-CNN is usually chosen as the detection network, depending on the actual application effect on large-size images. The Faster R-CNN detection network consists of a backbone network and a candidate area generation network, wherein the backbone network is used for extracting image characteristics, and the candidate area generation network is used for generating a target candidate area, screening a target candidate frame and generating a final detection result. The basic implementation steps of the detection are as follows: (1) extracting image features; (2) screening a candidate region; (3) performing regression correction on the candidate region; (4) and (4) classifying the targets.

At present, a target detection task based on a deep neural network is generally realized based on a GPU platform. For large-size images, if the images are sent to the detection network at one time, the problems of memory exhaustion and poor detection effect can be encountered. On one hand, the image size is large, so that the forward reasoning calculation needs to consume a large amount of memory, and under the condition that the memory of the GPU is limited, the whole calculation process is difficult to complete. On the other hand, if the whole picture is sent to the detection network, after the image features are extracted, the size of the feature map is gradually reduced due to operations such as pooling, and the number of pixels occupied by the actual target on the feature map is also gradually reduced, so that the accuracy is greatly reduced when the candidate region is generated. Therefore, the conventional detection method cannot be used for the object detection of a large-size image. In addition, for the target detection of a large-size image, the speed is slow, and in some application scenes with high real-time response requirements, the existing detection method cannot meet the actual requirements along with the increase of the size of the image and the increase of the number scale.

Disclosure of Invention

The application aims to provide a large-size image target detection method and device and a computer readable storage medium. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

According to an aspect of an embodiment of the present application, there is provided a large-size image target detection method, including:

preprocessing an original image to obtain a plurality of groups of small images;

inputting the plurality of groups of small images into a target detection network for batch target detection to obtain a detection result of each small image;

screening the detection result of the small image;

and mapping the screened detection result back to the original image to obtain a target detection result of the original image.

Further, the preprocessing the original image to obtain a plurality of groups of small images includes:

dividing an original image into a plurality of small images;

and grouping the plurality of small images to obtain a plurality of groups of small images.

Further, the dividing the original image into a plurality of small images includes:

and dividing the original image into a plurality of small images with the same size according to a preset dividing size and a preset overlapping area size.

Further, the target detection network comprises a faster region convolutional neural network; the faster regional convolutional neural network comprises a backbone network, a regional generation network, a Proposal layer, an interested region pooling layer and a regression correction classification layer which are sequentially connected, wherein the interested region pooling layer is connected with the backbone network.

Further, the inputting the plurality of groups of small images into a target detection network for batch target detection to obtain a detection result of each small image includes:

inputting each group of small images into the backbone network to obtain a characteristic diagram of each small image;

inputting the feature maps into the region generation network and the region of interest pooling layer respectively;

obtaining a candidate frame of each small image through the area generation network;

screening the candidate frame through the Proposal layer;

inputting the screened candidate boxes into the region-of-interest pooling layer and simultaneously pooling the candidate boxes with the feature map of the small image;

and sending the pooled feature map and the candidate frame to a regression correction classification layer together to obtain the detection result of each small image.

Further, the screening the candidate box through the Proposal layer includes:

removing the candidate frames with confidence scores lower than a preset threshold; the confidence score is obtained through the region generation network;

sorting the rest candidate frames according to the confidence scores, and selecting candidate frames with the confidence scores ranked at the top preset value;

carrying out boundary processing on the previous preset number of candidate frames, and eliminating the part exceeding the image boundary;

screening the candidate frames subjected to boundary processing through a nonlinear maximum suppression algorithm;

and storing the candidate frames obtained after screening.

Further, the screening the detection result of the small image includes:

deleting the candidate frames with the confidence scores lower than a preset threshold value from the detection result of the small image; the confidence score is obtained through the region generation network;

sorting the remaining candidate frames in the detection result of the small image according to the confidence score, and selecting the candidate frames with preset values before the score is ranked;

carrying out boundary processing on the candidate frames with the preset numerical values, and eliminating the parts exceeding the image boundary;

and screening the candidate frames subjected to the boundary processing according to the confidence score and the intersection ratio by using a nonlinear maximum suppression algorithm, and screening the candidate frames meeting a preset score threshold condition and a preset intersection ratio threshold condition.

Further, the mapping the screened detection result back to the original image to obtain a target detection result of the original image includes:

mapping the candidate frame meeting the preset score threshold condition and the preset intersection ratio threshold condition back to the original image according to the position in the original image before segmentation to obtain the coordinates of the candidate frame in the original image; and the target detection result of the original image comprises the coordinates of all candidate frames in the original image.

Further, before the inputting the screened candidate box into the region of interest pooling layer and pooling the feature map of the small image simultaneously, the method further comprises: and continuously storing the detection results of the small images, and setting an index tag for the detection result of each small image.

According to another aspect of the embodiments of the present application, there is provided a large-size image target detection apparatus including:

the preprocessing module is used for preprocessing the original image to obtain a plurality of groups of small images;

the detection module is used for inputting the plurality of groups of small images into a target detection network for batch target detection to obtain the detection result of each small image;

the screening module is used for screening the detection result of the small image;

and the mapping module is used for mapping the screened detection result back to the original image to obtain a target detection result of the original image.

According to another aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program, which is executed by a processor, to implement the above-described large-size image object detection method.

The technical scheme provided by one aspect of the embodiment of the application can have the following beneficial effects:

according to the large-size image target detection method provided by the embodiment of the application, the image is divided into a plurality of groups of small images, the small images are detected in groups through the target detection network, the detection result of the small images is mapped back to the original image, the target detection result of the original image is obtained, the target detection speed of the image is high, the detection result accuracy is high, and the method is particularly suitable for the target detection of the large-size image.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application, or may be learned by the practice of the embodiments. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 shows a flow chart of a large-size image target detection method according to an embodiment of the present application;

FIG. 2 is a block diagram illustrating a large-size image target detection apparatus according to an embodiment of the present application;

FIG. 3 is a flow chart of a large-size image target detection method according to another embodiment of the present application;

FIG. 4 is a schematic diagram illustrating the steps of processing candidate blocks and feature maps of the ROI pooling layer and the regression correction classification layer in another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

When the fast R-CNN network is used for target detection, it is a common practice to send pictures into the network in sequence one by one for detection, but in some application scenarios, a large number of pictures need to be rapidly processed at the same time. In the FasterR-CNN design, data is stored and organized substantially in the form of NCHW, where H and W represent the height and width of the image, respectively, C represents the number of channels of the image, and N represents the number of pictures coming into the network. Such memory structures are widely adopted by frameworks that use GPUs for reasoning, such as Caffe, tensorblow, pyTorch, etc. However, N in the network following the ROIPooling layer in the Faster R-CNN network represents a different meaning, which we define here as N_O. Since the ROIPooling layer outputs the feature map of different candidate regions after ROI pooling, N is adopted at this time_ON in CHW_OWhere N denotes the number of input detection pictures, R denotes the number of candidate regions, and H and W are feature maps of the pooled candidate regions.

At present, the method for realizing batch detection is to send a plurality of pictures into a detection network, extract features through a backbone network, obtain the score of a frame corresponding to each anchor after the feature picture passes through an RPN network, screen candidate frames of each picture according to the score and the IoU condition in a Proposal layer, and then send the screening result into an ROI Pooling layer for subsequent processing. In the process of respectively screening different images in a batch by the Proposal layer, the positions and the number of the obtained candidate frames are necessarily different due to different image contents in each image. The general method is to set a maximum value of a candidate frame for each picture, and ensure that the number of the candidate frames obtained by all the pictures is within the range of the maximum value. Since the coordinate values of the candidate frames generally exist in a continuous storage space, this has the advantage that the address of each picture candidate frame coordinate can be conveniently found according to the maximum value in the ROI Pooling layer for the ROI Pooling calculation. From the foregoing, ROI pooled feature map dimension N_ON in CHW_O＝N×ROI_max. Such asThe batch detection mode can ensure that all candidate frames in the image can be sent to a subsequent network without obstruction. However, this approach is also very memory and time consuming in batch inspection of large size images, since not all pictures get the number of candidate frames to reach the ROI_maxRedundant invalid candidate boxes are sent to a subsequent network, and waste of storage space and computing resources is caused. Although it is possible to reduce the ROI_maxTo save storage and computation, but it is not known how many candidate frames, unreasonable ROIs, are in the image before the result is obtained_maxThe value is liable to cause omission, thereby causing a reduction in detection accuracy. Therefore, it is necessary to improve the batch inspection method to meet the requirement of large-size image batch inspection.

As shown in fig. 1, an embodiment of the present application provides a large-size image target detection method 01, including:

and S10, preprocessing the original image to obtain a plurality of groups of small images.

In certain embodiments, step S10 includes:

s101, dividing an original image into a plurality of small images;

s102, grouping the small images to obtain a plurality of groups of small images.

The dividing of the original image into a plurality of small images includes:

In some embodiments, pre-processing the original image primarily includes segmenting and grouping large size images; the original image is a large-size image, wherein the large-size image refers to an image with the number of pixels more than 100 ten thousand; the large-size image segmentation is to segment a large-size image into a plurality of small images with the same size according to a fixed size, and an appropriate segmentation size and an overlapping area need to be selected during segmentation. The overlapping region is set when the image is divided, in order to prevent missing detection caused by the fact that an actual target is cut at a dividing boundary in the dividing process, and the subsequent dividing region is overlapped with the previous dividing region to a certain extent when the image is divided in the horizontal direction and the vertical direction. The grouping is to group the divided images. The number of small images contained in each group is determined by the number of images that the GPU is able to support simultaneous detection. And taking the small images of one group as a batch, and simultaneously sending the small images into a target detection network for detection.

And S20, inputting the plurality of groups of small images into a deep neural network by taking the groups as units to carry out batch target detection, and obtaining the detection result of each small image.

The deep neural network is a target detection network based on Faster R-CNN (Faster area convolutional neural network), and comprises a class of target detection networks using different backbones.

The fast R-CNN (Faster area convolutional neural network) of the present embodiment includes a backbone network, an RPN (area generation network), a propofol layer, an ROI pooling layer (region of interest pooling layer), and a regression correction classification layer, which are connected in sequence, where the ROI pooling layer is connected to the backbone network. The backbone network may be a VGG 16 network. The regression correction classification layer comprises a plurality of convolution layers.

Batch detection can process a plurality of images simultaneously and obtain detection results simultaneously. In the initial stage of detection, a group of small images are sent to a detection network, and the image data of the small images are simultaneously sent to a GPU memory to participate in neural network operations such as convolution, Pooling and the like, including the operations of a Proposal layer and a ROI Pooling layer (region-of-interest Pooling layer). On the GPU, these computation processes occur within the same execution cycle of the forward inference process.

In certain embodiments, step S20 includes:

s201, inputting each group of small images into a backbone network to obtain a characteristic diagram of each small image;

s202, inputting the feature maps into an RPN layer and an ROI Pooling layer respectively;

s203, obtaining a candidate frame of each small image and a confidence score of each candidate frame through the RPN;

s204, primary screening is carried out on the candidate frame corresponding to each anchor point through a Proposal layer;

s205, inputting the candidate frames left after primary screening into a ROIPooling layer of the faster regional convolutional neural network, and simultaneously performing ROI pooling operation on the candidate frames of the small images and the feature maps of the small images;

s206, the feature maps of the candidate regions after ROI pooling operation are sent to a regression correction classification layer together, and finally the detection result of each small image is obtained.

The detection result of each small image comprises a feature map, a candidate box and a confidence score of the candidate box corresponding to each small image.

In the operation of the Proposal layer of the Faster R-CNN network, the candidate frame of each small image is respectively screened according to the confidence score of the candidate frame obtained through the RPN layer. The method comprises the following steps of performing primary screening on a candidate frame corresponding to each anchor point through a Proposal layer:

the first step is to remove the candidate frames with the confidence scores lower than the preset threshold. The preset threshold may be preset in advance before the detection starts.

Secondly, sorting the rest candidate frames according to the confidence scores, selecting the confidence scores and ranking the confidence scores before S_maxA candidate frame of S_maxThe value may be preset before the detection starts.

Third, for S before_maxAnd carrying out boundary processing on the candidate frames, and eliminating the parts exceeding the image boundary in the candidate frames.

And fourthly, screening the confidence scores and IoU values (IoU, Intersection-over-unity) of all candidate frames of each small picture processed by the steps by using a Non-Maximum Suppression (noms) algorithm. Both the score threshold and the IoU threshold for the screening may be preset prior to detection.

And step five, storing the candidate frames obtained in the step four into a continuous storage space for the use of a subsequent calculation process.

In certain embodiments, the screening the candidate box by the propofol layer comprises:

s2041, removing the candidate frames with the confidence score lower than a preset threshold value;

s2042, sorting the remaining candidate frames according to the confidence scores, and selecting candidate frames with confidence scores ranked at the top preset value;

s2043, carrying out boundary processing on the previous preset number of candidate frames, and eliminating the part exceeding the image boundary;

s2044, screening the candidate frames subjected to boundary processing through a nonlinear maximum suppression algorithm;

and S2045, storing the candidate frame obtained after screening.

In the operation of the ROI Pooling layer (region-of-interest Pooling layer) in the fast R-CNN network, all feature maps of each small image participate together in the ROI (region-of-interest) Pooling operation. And (4) sending the feature map of each small picture subjected to ROI Pooling into a subsequent regression correction network and a classification network together, and finally obtaining a detection result at the same time.

And S30, screening the detection result of the small image.

The step S30 is a second screening step, in which the candidate frames in the detection result are screened again according to the score based on a threshold preset in advance, the candidate frames below the score threshold are removed, and the candidate frames are further screened again by using the nms algorithm. In the method of this embodiment, a total of two screenings are performed, where the primary screening is performed on the candidate frame through the propofol layer, and the secondary screening is performed on the detection result of the small image.

The screening of the detection result of the small image comprises the following steps:

s301, deleting candidate frames with confidence scores lower than a preset threshold value from the detection result of the small image; the confidence score is obtained through the region generation network;

s302, sorting the remaining candidate frames in the detection result of the small image according to the confidence score, and selecting the candidate frames with preset values before the score ranking;

s303, carrying out boundary processing on the candidate frames with the preset numerical values, and eliminating the parts exceeding the image boundary;

s304, screening the candidate frames subjected to the boundary processing according to the confidence score and the intersection ratio value by using a nonlinear maximum suppression algorithm, and screening the candidate frames meeting a preset score threshold condition and a preset intersection ratio threshold condition.

And S40, mapping the screened detection result back to the original image to obtain the target detection result of the original image.

The mapping is coordinate mapping, namely mapping the coordinates of the candidate area obtained by each picture back to the original image before the picture is divided according to the position of the candidate area before the picture is divided, obtaining the coordinates of the candidate frame in the original image, and obtaining the target detection result of the original image.

The large-size image target detection method provided by the embodiment divides the image into a plurality of groups of small images, detects the small images in groups through the target detection network, and maps the detection results of the small images back to the original image to obtain the target detection results of the original image.

In some embodiments, before the inputting the filtered candidate box into the region of interest pooling layer and pooling the feature map of the small image simultaneously, the large-size image target detection method further includes: and continuously storing the detection results of the small images, and setting an index tag for the detection result of each small image.

As shown in fig. 2, the present embodiment further provides a large-size image target detection apparatus, including:

the preprocessing module 100 is configured to preprocess an original image to obtain a plurality of groups of small images;

the detection module 200 is configured to input the plurality of groups of small images into a target detection network for batch target detection, so as to obtain a detection result of each small image;

a screening module 300, configured to screen a detection result of the small image;

and a mapping module 400, configured to map the screened detection result back to the original image to obtain a target detection result of the original image.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program which is executed by a processor to implement the above-described large-size image object detection method.

As shown in fig. 3, another embodiment of the present application provides a large-size image target detection method 02, including:

firstly, preprocessing a large-size image, dividing the large image into small images, inputting each small image into a deep neural network, simultaneously obtaining detection results of a plurality of small images by using a batch detection method, and finally, carrying out a post-processing step, and mapping the result coordinates back to an original image to obtain a final detection result of the large image.

The small pictures are input into a backbone network to be processed, and then a feature map is output, and the feature map is input into an RPN (Region-generated network) to obtain candidate frames and confidence scores of the candidate frames.

The Proposal layer is used for screening the candidate frame corresponding to each anchor point according to the confidence score of the candidate frame obtained by the RPN layer, and screening the candidate frame most possibly containing the target according to the scores. This includes several steps:

1) performing one-time screening according to confidence score threshold value

2) And performing primary sorting according to the confidence scores of the remaining candidate frames, and then selecting the top N items according to the maximum value of the number of the candidate frames.

3) The overlapping candidate boxes are screened by the nms algorithm.

4) And sending the candidate frames of different small pictures to the next layer for calculation.

As shown in fig. 4, the ROI Pooling layer was used to: aiming at the known candidate frames and the image feature maps, the positions of the corresponding image feature maps are found through the candidate frames, Pooling operation is carried out, the sizes of different candidate frames are different, and ROI Pooling can change all the feature maps corresponding to each candidate frame into feature maps with the same size, so that subsequent calculation is facilitated.

In some embodiments, the detection results of different small images in the batch detection process are stored continuously, and an index tag is set for each detection result of the small images for distinguishing. This step is done after screening results of small pictures are obtained in the Proposal layer.

Generally speaking, the storage method of the candidate frames of the batch detection pictures is to set a maximum value of the candidate frames, and because the number of the candidate frames obtained by each small picture is different, if the maximum value of the candidate frames is set according to the small picture with the maximum number of the candidate frames, space waste is caused, and in the calculation process, for the small picture with the number of the candidate frames not reaching the maximum value, the redundancy also occupies calculation resources. If the maximum value of the candidate frame is set to be too small, objects in the small picture with the number of the candidate frames exceeding the value are easily missed, and detection omission is caused.

If the improved method of continuous storage in this embodiment is implemented, the storage is performed according to the actual number of candidate frames of each picture, and an index is set for each candidate frame to identify the small picture to which it belongs, such a situation can be well avoided. The method can reduce the calculation amount and the storage space, accelerate the speed of batch detection and realize higher detection throughput rate under the same calculation resource. Meanwhile, the adverse effect of the setting of the maximum value of the candidate frame on the detection precision is avoided.

Finally, when the original image is mapped back, the position of the original image is also required to be corresponded according to the index tag of the candidate frame.

In the Proposal layer, each input picture generates candidate frames, and the number of the generated candidate frames of each picture is set as r_nWherein N is 1,2, …, N. The inputs to the ROI Pooling layer are ROI (region of interest) and feature map, dimensions R × 4 and N × c × h × w, respectively, and the output is R × c × pool _ h × pool _ w. The ROI Pooling is used for generating a feature map after Pooling of corresponding areas according to the position of each candidate frame. The dimensions of these feature maps are pool _ h × pool _ w. R represents the number of candidate regions.

In the case of batch testing, PropThe subsequent calculation of the osal layer involves the candidate frames generated by a plurality of graphs, the number of the candidate frames of each graph is different, and if all the candidate frames are sent to the ROI Pooling layer at a time, Pooling cannot be determined corresponding to the feature map of which graph is to be Pooling performed when ROIPoling operation is performed. The simplest implementation method is to set a maximum ROI for each candidate frame generated by the picture_maxThen, when ROI Pooling is performed, the ROI can be determined according to the ROI_maxTo determine the corresponding signature. Let the output of the Proposal layer be a ROI matrix with size NxROI_maxX 4, and ROI Pooling can be used to determine the position of ROI corresponding to feature n

Addr＝N×ROI_max×4；

The output of ROI Pooling at this time is NxROI_max×c×pool_h×pool_w. Although the method is convenient for ROIPooling to carry out address calculation and ensures that the output number of the roi of each picture is consistent, the method causes waste of storage and calculation resources. Because if the number of ROI per picture is much smaller than ROI_maxThe excess output is sent to the subsequent network for calculation.

For fast R-CNN with VGG 16 as backbone, this way of uniformly setting the output size has less influence, and VGG 16 has only two fully connected layers after ROI Pooling. However, for fast R-CNN like Resnet-101 as backbone network, the redundant output will significantly increase the amount of computation and thus seriously affect the computation speed. Because the Faster R-CNN, which ResNet-101 acts as backbone, has a significant portion of the convolution calculation after the ROI Pooling layer. If these convolutions are all in accordance with ROI_maxAs the first dimension of its input, the amount of computation involved is considerable. Although it is possible to reduce the ROI_maxTo avoid unnecessary calculations.

For this case, the implementation of the Proposal layer is modified in the algorithm for batch target detection. Because of the different number of candidate frames generated for each picture, it is significantly inefficient to compute if these candidate frames are fed separately to the subsequent network. Therefore, the candidate frames sorted by the nms and the score sorting are marked, and the picture index corresponding to each picture candidate frame is recorded. Then the candidate frames are put in a candidate frame matrix, the dimension of the matrix is R multiplied by 5, the first 4 numbers in a row represent the coordinates of the candidate frames, and the last number represents the index of the picture where the candidate frames are positioned, wherein

R＝r₁+r₂+…+r_n

After this matrix is obtained, it is sent to ROI Pooling for further calculations. In addition to generating the frame candidate matrix at the promosal layer, there is an index vector that marks the number of frame candidates per picture. In the process of result post-processing, the result can be subjected to final nms screening according to the index vector, and the result is mapped back to the original picture through calculating the position.

In the implementation of ROI Pooling layer, since all candidate frames have an index of source picture, at the time of Pooling operation, the corresponding feature map can be determined according to the index. After such optimization, the output of ROIPooling layer becomes R × c × pool_h×pool_wIn the general case of the above-mentioned systems,

R＜N×ROI_max

therefore, the optimization method can greatly increase the calculation speed of batch detection.

It should be noted that:

the term "module" is not intended to be limited to a particular physical form. Depending on the particular application, a module may be implemented as hardware, firmware, software, and/or combinations thereof. Furthermore, different modules may share common components or even be implemented by the same component. There may or may not be clear boundaries between the various modules.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components in the creation apparatus of a virtual machine according to embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The above-mentioned embodiments only express the embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A large-size image target detection method is characterized by comprising the following steps:

screening the detection result of the small image;

2. The method of claim 1, wherein pre-processing the original image to obtain a plurality of groups of small images comprises:

dividing an original image into a plurality of small images;

3. The method of claim 2, wherein the segmenting the original image into a plurality of small images comprises:

4. The method of claim 1, wherein the target detection network comprises a faster region convolutional neural network; the faster regional convolutional neural network comprises a backbone network, a regional generation network, a Proposal layer, an interested region pooling layer and a regression correction classification layer which are sequentially connected, wherein the interested region pooling layer is connected with the backbone network.

5. The method according to claim 4, wherein the inputting the plurality of groups of small images into a target detection network for batch target detection to obtain a detection result of each small image comprises:

screening the candidate frame through the Proposal layer;

6. The method of claim 5, wherein the screening the candidate box through the Propusal layer comprises:

and storing the candidate frames obtained after screening.

7. The method of claim 5, wherein the screening the detection results of the small images comprises:

8. The method of claim 7, wherein the mapping the filtered detection results back to the original image to obtain target detection results for the original image comprises:

9. The method of claim 1, wherein prior to the entering the filtered candidate box into the region of interest pooling layer and the feature map of the small image being pooled simultaneously, the method further comprises: and continuously storing the detection results of the small images, and setting an index tag for the detection result of each small image.

10. A large-size image object detecting apparatus, comprising:

11. A computer-readable storage medium having stored thereon a computer program, characterized in that the program is executed by a processor to implement the large-size image object detection method according to any one of claims 1 to 9.