CN109829909B

CN109829909B - Target detection method, device and storage medium

Info

Publication number: CN109829909B
Application number: CN201910101098.2A
Authority: CN
Inventors: 陈海波
Original assignee: Deep Blue Technology Shanghai Co Ltd
Current assignee: Shenlan Robot Shanghai Co ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2021-06-29
Anticipated expiration: 2039-01-31
Also published as: CN109829909A

Abstract

The application discloses a target detection method, a target detection device and a storage medium, relates to the field of target detection, and aims to solve the problem that in the prior art, the accuracy in target detection by using a faster rcnn model cannot meet the detection requirement. The method comprises the following steps: the ensemble candidate frames are input into the ROI posing layer by performing ensemble operation on the candidate frames of the RPN layer in the plurality of false rcnn. And performing set operation on the output frame output by the ROI posing layer again to obtain a final output result. Therefore, more candidate frames and output frames can be obtained by respectively performing set operation on the RPN layer and the ROI posing layer, so that the accuracy of target detection by using the fast rcnn is improved.

Description

Target detection method, device and storage medium

Technical Field

The present application relates to the field of object detection, and in particular, to an object detection method, an object detection apparatus, and a storage medium.

Background

The target detection, also called target extraction, is an image segmentation based on target geometry and statistical characteristics, which combines the segmentation and identification of targets into one, and the accuracy and real-time performance of the method are important capabilities of the whole system. Especially, in a complex scene, when a plurality of targets need to be processed in real time, automatic target extraction and identification are particularly important.

In order to realize target detection, a fast rcnn (fast Regions with conditional Neural Network) can be adopted to perform target detection on an image to be detected. However, in the prior art, the accuracy of target detection by using the fast rcnn cannot meet the detection requirement.

Disclosure of Invention

The application embodiment provides a target detection method, a target detection device and a storage medium, which are used for solving the problem that the accuracy of target detection by using fast rcnn cannot meet the detection requirement in the prior art.

In a first aspect, an embodiment of the present application provides a target detection method, where the method includes:

mapping the characteristics of the image to be detected through a Region nomination Network (RPN) layer of a plurality of Network models to obtain a plurality of candidate frames of each Network model;

performing aggregation (ensemble) operation on the plurality of candidate frames of each network model to obtain aggregated candidate frames;

inputting the collected candidate frames and the image content corresponding to the candidate frames to a region of interest pooling (ROI pooling) layer in each network model, and determining an output frame of each network model;

and performing set operation on the output frames of the network models to obtain candidate frames of the target to be detected in the image to be detected.

In a second aspect, an embodiment of the present application provides an object detection apparatus, including:

the acquisition module is used for mapping the characteristics of the image to be detected to pass through the regional nomination network layers of the plurality of network models to obtain a plurality of candidate frames of each network model;

a first aggregation module, configured to perform aggregation operation on the multiple candidate frames of each network model to obtain an aggregated candidate frame;

the first determining module is used for inputting the collected candidate frames and the image contents corresponding to the candidate frames to the interested region pooling layer in each network model and determining the output frame of each network model;

and the second aggregation module is used for performing aggregation operation on the output frames of the network models to obtain candidate frames of the target to be detected in the image to be detected.

In a third aspect, another embodiment of the present application further provides a computing device comprising at least one processor; and;

a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute an object detection method provided by the embodiment of the application.

In a fourth aspect, another embodiment of the present application further provides a computer storage medium, where the computer storage medium stores computer-executable instructions for causing a computer to execute an object detection method in an embodiment of the present application.

According to the target detection method, the target detection device and the target detection storage medium, an ensembl operation is performed on candidate frames of RPN layers in a plurality of false rcnn, and the collected candidate frames are input into an ROI posing layer. And performing set operation on the output frame output by the ROI posing layer again to obtain a final output result. Therefore, more candidate frames and output frames can be obtained by respectively performing set operation on the RPN layer and the ROI posing layer, so that the accuracy of target detection by using the fast rcnn is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic diagram of a prior art FPN layer structure in an embodiment of the present application;

FIG. 2 is a first flow chart illustrating an improved FPN layer structure in accordance with an embodiment of the present disclosure;

FIG. 3 is a second flow chart illustrating an improved FPN layer structure according to an embodiment of the present application;

FIG. 4 is a first flowchart illustrating a target detection method according to an embodiment of the present application;

FIG. 5 is a second flowchart illustrating a target detection method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of an improved method for outputting results in an embodiment of the present application;

FIG. 7 is a schematic diagram of a target detection structure in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Detailed Description

In order to solve the problem that the accuracy of target detection by using fast rcnn in the prior art cannot meet the detection requirement, the embodiment of the application provides a target detection method, a target detection device and a storage medium. In order to better understand the technical solution provided by the embodiments of the present application, the following brief description is made on the basic principle of the solution:

the candidate frames of the RPN layers in the multiple faster rcnn are subjected to a gathering operation to obtain gathered candidate frames, and the gathered candidate frames are input into the ROI posing layer. And performing set operation on the output frame output by the ROI posing layer again to obtain a final output result. Therefore, more candidate frames and output frames can be obtained by respectively performing set operation on the RPN layer and the ROI posing layer, so that the accuracy of target detection by using the fast rcnn is improved.

In the prior art, the fast rcnn model can be divided into three layers, namely an FPN layer, an RPN (Region naming Network) layer, and an ROI posing layer. The structure of the FPN layer is shown in FIG. 1. The FPN layer is divided into three portions, a bottom-up portion, a top-down portion, and a lateral connection portion. Wherein:

and the bottom-up part is used for extracting the features of the image to obtain a feature map of the image. The size of the feature map becomes smaller after passing through the multilayer convolution layer, so that a feature pyramid can be formed.

The top-down part is to up-sample the previously acquired feature maps of each layer, so that the size of the up-sampled feature map is the same as that of the feature map of the previous layer.

The cross-connect part fuses the up-sampled result and the feature map of the same size generated by the bottom-up line part.

As shown in fig. 1, the layer 1 is an image, the

layers

2, 3 and 4 are feature maps, and the layers 2 ', 3 ' and 4 ' are feature maps after fusion. The fused feature map of the 4' th layer is obtained by performing dimension reduction processing on the feature map of the 4 th layer; the fused feature map of the 3 ' layer is obtained by adding the feature map obtained by up-sampling the fused feature map of the 4 ' layer and the feature map obtained by the 3 ' layer through dimension reduction processing; the fused feature map of the 2 ' layer is obtained by adding the feature map obtained by up-sampling the fused feature map of the 3 ' layer and the feature map obtained by the dimension reduction processing of the 2 ' layer feature map.

After the fusion, a convolution is performed on each fusion result by using a convolution kernel of 3 × 3 (the size of the convolution kernel can be determined according to actual conditions in specific implementation), so as to eliminate aliasing effect of upsampling. In this way, finally, feature maps such as P4, P3, and P2 of the feature map after the fusion of the layers are obtained for performing target detection on the image to be detected.

In the embodiment of the present application, the structure of the FPN layer is modified as shown in fig. 2. Wherein the top-down line sections and the cross-connect sections are modified in particular.

In the top-down line part, the acquired feature maps of the respective layers are up-sampled and down-sampled so that the feature maps after up-sampling are the same size as the feature map of the previous layer and the feature maps after down-sampling are the same size as the feature map of the next layer.

In the horizontal connection portion, a feature map generated from a bottom-up line portion of a certain layer, a down-sampled result of an upper layer of the layer, and an up-sampled result of a lower layer of the layer are merged.

As shown in fig. 2, the layer 1 is an image, the

layers

2, 3 and 4 are feature maps, and the layers 2 ', 3 ' and 4 ' are feature maps after fusion. The feature map of the 4' th layer after fusion is obtained by performing dimension reduction processing on the feature map of the 4 th layer, and is added with the feature map of the 3 rd layer obtained by performing down-sampling and dimension reduction on the feature map; the fused feature map of the 3 ' layer is obtained by adding up-sampling the feature map fused of the 4 ' layer, the feature map obtained by the dimension reduction processing of the 3 ' layer feature map and the feature map obtained by down-sampling and dimension reduction of the 2 nd layer feature map; the fused feature map of the 2 ' layer is obtained by adding the feature map obtained by up-sampling the fused feature map of the 3 ' layer and the feature map obtained by the dimension reduction processing of the 2 ' layer feature map.

Therefore, compared with the feature mapping in the prior art, the finally obtained feature mapping contains more image information and semantic information, and the accuracy of target detection is improved.

In order to further improve the accuracy of target detection, in the embodiment of the present application, the structure of the FPN layer is modified, as shown in fig. 3. Wherein, the 1 st layer is an image, the 2 nd, 3 rd and 4 th layers are feature maps, and the 2 nd, 3 rd and 4 th layers are fused feature maps.

The feature map of the 4' th layer after fusion is obtained by adding the feature map of the 4 th layer obtained by the dimension reduction processing and the feature map of the 3 rd layer obtained by down-sampling and dimension reduction; the fused feature map of the 3' layer is obtained by adding the feature map obtained by up-sampling and dimensionality reduction on the feature map of the 4 th layer, the feature map obtained by dimensionality reduction on the feature map of the 3 rd layer and the feature map obtained by down-sampling and dimensionality reduction on the feature map of the 2 nd layer; the fused feature map of the 2' layer is obtained by adding the feature map obtained by up-sampling and dimension reduction on the feature map of the 3 rd layer and the feature map obtained by dimension reduction processing on the feature map of the 2 nd layer. Compared with fig. 2, the feature map obtained by fig. 3 contains more image information and semantic information, thereby improving the accuracy of target detection.

The detection of the target will be described in detail with reference to specific examples. Fig. 4 is a schematic flow chart of a target detection method, which includes the following steps:

step 401: and inputting the image to be detected into the laminated multilayer convolution layers for processing to obtain a characteristic diagram corresponding to each convolution layer.

Step 402: and performing feature fusion on the feature map of each convolutional layer and the feature map of the appointed adjacent convolutional layer thereof to obtain a fused feature map of the convolutional layer, wherein the appointed adjacent convolutional layer of at least one convolutional layer is 2.

In the embodiment of the application, only the feature map of a certain upper layer and the feature map after down-sampling of the upper layer of the layer can be fused, and the other layers are still fused according to the prior art; the solution used in the present application may also be applied to each layer, which is not limited in the present application.

Step 403: and for each fused feature map, carrying out convolution processing on the fused feature map according to a preset convolution kernel to obtain the feature mapping of the fused feature map.

Step 404: and carrying out target detection on the image to be detected according to the obtained characteristic mapping.

Therefore, the acquired feature mapping contains more image information and semantic information, and the accuracy of target detection by using the fast rcnn model is improved.

In the embodiment of the present application, the step 402 can be divided into three cases:

the first condition is as follows: the convolutional layers are the 2 nd convolutional layers, such as the 2 nd convolutional layers in fig. 3, in the order of processing data.

And when the convolutional layer is the 2 nd convolutional layer, performing feature fusion on the feature map of the convolutional layer and the feature map of the 3 rd convolutional layer to obtain a fused feature map of the convolutional layer.

Because the 2 nd convolutional layer is the lowest layer, the characteristic diagram corresponding to the 2 nd convolutional layer is the characteristic diagram of the last layer, so that the characteristic diagram only needs to be subjected to characteristic fusion with the characteristic diagram of the 3 rd convolutional layer, and the characteristic diagram does not need to be fused with the characteristic diagram of the previous layer.

Case two: the convolutional layer is the last convolutional layer, such as layer 4 in fig. 3.

And when the convolutional layer is the last convolutional layer, performing feature fusion on the feature map of the convolutional layer and the feature map of the previous convolutional layer to obtain a fused feature map of the convolutional layer.

Because the last convolutional layer is the top layer, the feature map corresponding to the last convolutional layer is the feature map of the top layer, and therefore feature fusion only needs to be performed with the feature map of the previous convolutional layer, and feature fusion with the feature map after fusion of the next layer is not needed.

In the embodiment of the present application, when the convolutional layer is the last convolutional layer, the merging of the layer feature maps can be specifically implemented as steps a 1-A3:

step A1: performing down-sampling on the feature map of the previous volume of the lamination layer to obtain a down-sampled feature map; wherein the downsampled feature map has the same size as the feature map of the convolutional layer.

Step A2: and performing dimensionality reduction on the feature map of the convolution layer and the feature map of the last convolution layer after down-sampling.

Step A3: and adding the feature map of the convolution layer subjected to the dimensionality reduction processing and the feature map subjected to down-sampling of the previous convolution layer to obtain a fused feature map of the convolution layer.

In this way, the feature map after the fusion of the top layer is obtained by fusing the feature map of the layer with the feature map of the layer above, and therefore, more image information and semantic information are provided.

Case three: the convolutional layers are convolutional layers except the 2 nd and last convolutional layers, such as layer 3 in fig. 3.

And when the convolutional layer is a convolutional layer except the 2 nd and last convolutional layers, performing feature fusion on the feature map of the convolutional layer, the feature map of the previous convolutional layer and the feature map of the next convolutional layer to obtain a fused feature map of the convolutional layer.

In the embodiment of the present application, when the convolutional layer is a convolutional layer except the 2 nd and last convolutional layers, the merging of the layer feature maps can be specifically implemented as steps B1-B3:

step B1: performing down-sampling on the feature map of the previous volume of the lamination layer to obtain a down-sampled feature map; and upsampling the feature map of the next volume of the lamination layer to obtain an upsampled fused feature map; the feature map after downsampling and the feature map after upsampling are the same in size as the feature map of the convolutional layer.

Step B2: and performing dimensionality reduction on the feature map of the convolutional layer, the feature map obtained after down-sampling of the previous convolutional layer and the feature map obtained after up-sampling of the next convolutional layer.

Step B3: and adding the feature map of the convolution layer after the dimension reduction processing, the feature map after the down sampling of the previous convolution layer and the feature map after the up sampling of the next convolution layer to obtain a fused feature map of the convolution layer.

In this way, when the convolutional layer is a convolutional layer other than the 2 nd and last convolutional layers, the layer fused feature map is obtained by fusing the layer feature map, the feature map of the layer above and the feature map of the layer below, and therefore, more image information and semantic information are provided.

Therefore, the condition of fusing the characteristic diagrams of each layer is limited, so that the characteristic mapping of each layer contains more image information and semantic information, and the accuracy of target detection by using the faster rcnn model is improved.

In the embodiment of the present application, after the feature map obtained by fusing the layers according to the scheme of the present application is obtained, the fusion may be performed again, which may specifically be implemented as steps C1-C4:

step C1: the number of times of fusion is increased by a specified value.

Step C2: and judging whether the increased fusion frequency reaches an expected value.

Step C3: if the expected value is not reached, executing the following steps for the feature graph after the fusion of each layer: and performing feature fusion on the feature map obtained by fusing the convolutional layers and the feature map obtained by fusing the feature map of the appointed adjacent convolutional layers of the convolutional layers to obtain the feature map obtained by fusing the feature map of the convolutional layers again, wherein the appointed adjacent convolutional layers of at least one convolutional layer are 2.

Step C4: the step of increasing the number of times of fusion by a specified value is performed back.

In the embodiment of the present application, the specified value and the expected value may be set to 4 times, and the recording is performed once when performing a fusion operation, and if the recording number does not reach 4 times, the fusion operation is performed again, and if it reaches 4 times, the output feature mapping is determined.

Therefore, the feature map after the fusion of each layer is fused again, so that the feature mapping finally obtained by each layer has more image information and semantic information than the feature mapping obtained by single fusion operation, and the accuracy of target detection by using the fast rcnn model is improved.

After the modifications to the FPN layer are introduced, the improvement of the RPN layer is described below. In the prior art, after a feature map of an image to be detected is generated through an FPN layer, the feature map is input to an RPN layer to obtain a plurality of candidate frames of an object to be detected, and then the obtained plurality of candidate frames are input to an ROI posing layer to perform ensemble operation, so as to obtain an output result.

In the embodiment of the application, the image to be detected is input into a plurality of false rcnn models to obtain a plurality of candidate frames output by each model on an RPN layer, the plurality of candidate frames of each model are subjected to ensemble operation, and the remaining candidate frames are input into an ROI posing layer of each model to be subjected to ensemble operation again to obtain an output result. Fig. 5 is a flow chart of an improved method for RPN layer, comprising the following steps:

step 501: and mapping the characteristics of the image to be detected through the area nomination network layers of the plurality of network models to obtain a plurality of candidate frames of each network model.

Wherein, the plurality of network models are fast rcnn models in different initial states.

In one embodiment, the plurality of network models may also be other network models having an RPN layer.

Step 502: and performing aggregation operation on the plurality of candidate frames of each network model to obtain aggregated candidate frames.

Step 503: and inputting the collected candidate frames and the image contents corresponding to the candidate frames to an interested area pooling layer in each network model, and determining an output frame of each network model.

Step 504: and performing set operation on the output frames of the network models to obtain candidate frames of the target to be detected in the image to be detected.

Therefore, more candidate frames and output frames can be obtained by respectively performing set operation on the RPN layer and the ROI posing layer, so that the accuracy of target detection by using the fast rcnn model is improved.

In the embodiment of the present application, when the step 502 is executed, the steps D1-D2 can be specifically implemented:

step D1: a plurality of candidate frames of each network model are compared against each other.

The two candidate frames compared can be from the same model or from different network models.

Step D2: and if the overlapping area of the two candidate frames which are compared is larger than a preset threshold, determining the candidate frame with high confidence in the two candidate frames as one member of the candidate frames after the collection.

In one embodiment, if there are 4 candidate frames generated by two network models, that is, the candidate frame of model 1 is 1, 2, 3, 4; the candidate boxes for model 2 are 5, 6, 7, 8. The obtained 8 candidate frames are compared with each other. During detection, whether the overlapping area of the two candidate frames is larger than a preset threshold value or not is compared. In specific implementation, the detection can be performed according to the coordinate positions of the candidate frames, that is, the candidate frames with the coordinate positions substantially the same in the two models are compared; the candidate blocks in the model may also be randomly compared.

In one embodiment, after all candidate boxes have been compared, the remaining candidate boxes are output to the ROI posing layers of model 1 and model 2, respectively. And if the overlapping area of the candidate frame 2 of the model 1 and the candidate frame 5 of the model 2 is larger than a preset threshold value and the confidence in the model 1 is high after the comparison is finished, removing the candidate frame 5 in the model 2. Thus, 7 candidate frames of candidate frames 1, 2, 3, 4, 6, 7, 8 are input to the ROI posing layers of model 1 and model 2, respectively. In this way, model 1 and model 2 obtain more candidate frames, thereby improving accuracy.

Therefore, by collecting and removing the candidate frames in the multiple models, each model can acquire more candidate frames, and the accuracy is improved.

The improvement of the FPN layer and RPN layer is introduced above, and the improvement of the output result is further explained below. Fig. 6 is a flow chart of an improved method of outputting results, comprising the steps of:

step 601: candidate frames contained in another candidate frame are eliminated.

Step 602: and determining the remaining candidate frames as the candidate frames where the detected target is located.

Therefore, by adopting NMS (Non-Maximum Suppression), small candidate frames contained in large candidate frames in an output result are removed, and repeated target detection conditions are filtered, so that the accuracy of the fast rcnn model in target detection is improved.

In the embodiment of the present application, when obtaining candidate boxes, each candidate box corresponds to a score. When a candidate frame completely contained in another candidate frame is removed, firstly, the candidate frame with the highest score needs to be determined; performing, for each candidate box below the highest score: calculating whether the ratio of the overlapping area of the candidate frame and the candidate frame with the highest score to the area of the candidate frame is larger than a preset ratio or not; if the ratio is larger than the preset ratio, the candidate frame is rejected; if not, the candidate frame is retained. In this way, small candidate frames included in the candidate frame corresponding to the highest score are eliminated.

Then, one candidate frame with the highest score is selected again from the remaining candidate frames, and the above operation is continued. And after the candidate frame is selected, outputting the selected candidate frame.

In the embodiment of the present application, the preset ratio may be set to be between 0.9 and 1, and thus, a small candidate box completely included in a large candidate box and a small candidate box having a large area included in the large candidate box may be eliminated. Therefore, the operation of removing the small candidate frames contained in the large candidate frames is realized, and the accuracy of the fast rcnn model in target detection is improved.

Based on the same inventive concept, the embodiment of the application also provides a target detection device. As shown in fig. 7, the apparatus includes:

an obtaining module 701, configured to map features of an image to be detected through a region nomination network layer of multiple network models to obtain multiple candidate frames of each network model;

a first aggregation module 702, configured to perform aggregation operation on the multiple candidate frames of each network model to obtain an aggregated candidate frame;

a first determining module 703, configured to input the collected candidate frames and their corresponding image contents to the region-of-interest pooling layer in each network model, and determine an output frame of each network model;

and a second aggregation module 704, configured to perform aggregation operation on the output frames of the network models to obtain candidate frames of the target to be detected in the image to be detected.

Further, the first aggregation module 702 includes:

the comparison unit is used for comparing a plurality of candidate frames of each network model;

the first determining unit is used for determining the candidate frame with high confidence in the two candidate frames as one member of the candidate frames after the collection if the overlapping area of the two candidate frames for comparison is larger than a preset threshold.

Further, the apparatus further comprises:

the removing module is used for removing the candidate frame contained in the other candidate frame after the second collection module obtains the candidate frame of the target to be detected in the image to be detected;

and the second determining module is used for determining the remaining candidate frames as the candidate frame where the detected target is located.

Further, each candidate frame corresponds to a score when the candidate frame is obtained, and the eliminating module comprises:

a second determination unit configured to determine a candidate frame with a highest score;

a computing unit for performing, for each candidate box below a highest score: calculating whether the ratio of the overlapping area of the candidate frame and the candidate frame with the highest score to the area of the candidate frame is larger than a preset ratio or not;

the rejecting unit is used for rejecting the candidate frame if the ratio is larger than a preset ratio;

a reserving unit, configured to reserve the candidate frame if the candidate frame is not greater than a preset ratio;

and the returning unit is used for forming a to-be-processed set by the reserved candidate frames if the number of the reserved candidate frames is more than 1, and returning to execute the step of determining the candidate frame with the highest score aiming at the to-be-processed set.

Having described the method and apparatus for object detection of an exemplary embodiment of the present application, a computing apparatus according to another exemplary embodiment of the present application is next described.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible implementations, a computing device may include at least one processor, and at least one memory, according to embodiments of the application. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps 501-504 of the object detection method according to various exemplary embodiments of the present application described above in the present specification.

The computing device 80 according to this embodiment of the present application is described below with reference to fig. 8. The computing device 80 shown in fig. 8 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present application. The computing device may be, for example, a cell phone, a tablet computer, or the like.

As shown in fig. 8, computing device 80 is embodied in the form of a general purpose computing device. Components of computing device 80 may include, but are not limited to: the at least one processor 81, the at least one memory 82, and a bus 83 connecting the various system components including the memory 82 and the processor 81.

Bus 811 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, and a processor or local bus using any of a variety of bus architectures.

The memory 82 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)821 and/or cache memory 822, and may further include Read Only Memory (ROM) 823.

Memory 82 may also include a program/utility 825 having a set (at least one) of program modules 824, such program modules 824 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Computing device 80 may also communicate with one or more external devices 84 (e.g., pointing devices, etc.), with one or more devices that enable a user to interact with computing device 80, and/or with any devices (e.g., routers, modems, etc.) that enable computing device 80 to communicate with one or more other computing devices. Such communication may be through input/output (I/O) interfaces 85. Also, computing device 80 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) through network adapter 86. As shown, network adapter 86 communicates with other modules for computing device 80 over bus 83. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 80, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In some possible embodiments, the various aspects of the object detection method provided in this application may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps in the method for object detection according to various exemplary embodiments of this application described above in this specification, when the program product is run on the computer device, to perform

step

501 and 504 as shown in fig. 5.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The object detection method of the embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Moreover, although the operations of the methods of the present application are depicted in the drawings in a sequential order, this does not require or imply that these operations must be performed in this order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a manner that causes the instructions stored in the computer-readable memory to produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of object detection, the method comprising:

mapping the characteristics of the image to be detected through the regional nomination network layers of the plurality of network models to obtain a plurality of candidate frames of each network model;

performing aggregation operation on the plurality of candidate frames of each network model to obtain aggregated candidate frames;

inputting the collected candidate frames and the image content corresponding to the candidate frames to an interested area pooling layer in each network model, and determining an output frame of each network model;

performing set operation on the output frames of the network models to obtain candidate frames of the target to be detected in the image to be detected;

the collecting operation of the multiple candidate frames of each network model to obtain a collected candidate frame specifically includes:

comparing a plurality of candidate frames of each network model;

if the overlapping area of the two candidate frames which are compared is larger than a preset threshold, determining the candidate frame with high confidence in the two candidate frames as one member of the candidate frames after the collection;

the feature mapping of the image to be detected is obtained by the following method:

inputting the detection image to the laminated multilayer convolution layers for processing to obtain a characteristic diagram corresponding to each convolution layer;

for each convolutional layer, performing feature fusion on the feature graph of the convolutional layer and the feature graph of the appointed adjacent convolutional layer of the convolutional layer to obtain a fused feature graph of the convolutional layer, wherein the appointed adjacent convolutional layer of at least one convolutional layer is 2;

and for each fused feature map, carrying out convolution processing on the fused feature map according to a preset convolution kernel to obtain feature mapping of the fused feature map.

2. The method of claim 1, wherein after obtaining a candidate frame of the object to be detected in the image to be detected, the method further comprises:

eliminating a candidate frame contained in another candidate frame;

and determining the remaining candidate frames as the candidate frames where the detected target is located.

3. The method according to claim 2, wherein each candidate frame corresponds to a score when obtaining the candidate frames, and the eliminating the candidate frames included in another candidate frame specifically comprises:

determining a candidate box with the highest score;

performing, for each candidate box below the highest score:

calculating whether the ratio of the overlapping area of the candidate frame and the candidate frame with the highest score to the area of the candidate frame is larger than a preset ratio or not;

if the ratio is larger than the preset ratio, the candidate frame is rejected;

if the candidate frame is not larger than the preset ratio, the candidate frame is reserved;

if the number of the reserved candidate boxes is more than 1, the reserved candidate boxes form a to-be-processed set, and the step of determining the candidate box with the highest score is executed for the to-be-processed set.

4. An object detection apparatus, characterized in that the apparatus comprises:

the second aggregation module is used for performing aggregation operation on the output frames of the network models to obtain candidate frames of the target to be detected in the image to be detected;

wherein the first aggregation module comprises:

the first determining unit is used for determining the candidate frame with high confidence in the two candidate frames as one member of the candidate frames after the collection if the overlapping area of the two candidate frames for comparison is larger than a preset threshold;

the acquisition module is further configured to: inputting the detection image to the laminated multilayer convolution layers for processing to obtain a characteristic diagram corresponding to each convolution layer; for each convolutional layer, performing feature fusion on the feature graph of the convolutional layer and the feature graph of the appointed adjacent convolutional layer of the convolutional layer to obtain a fused feature graph of the convolutional layer, wherein the appointed adjacent convolutional layer of at least one convolutional layer is 2; and for each fused feature map, carrying out convolution processing on the fused feature map according to a preset convolution kernel to obtain feature mapping of the fused feature map.

5. The apparatus of claim 4, further comprising:

the removing module is used for removing the candidate frame completely contained in the other candidate frame after the second collection module obtains the candidate frame of the target to be detected in the image to be detected;

6. The apparatus of claim 5, wherein each candidate box is obtained with a score, and wherein the culling module comprises:

a computing unit for performing, for each candidate box below a highest score: calculating whether the area of the overlapping area of the candidate frame and the candidate frame with the highest score is the same as the area of the candidate frame;

a rejecting unit, configured to reject the candidate frame if the areas are the same;

a reserving unit, configured to reserve the candidate frame if the areas are different;

7. A computer-readable medium having stored thereon computer-executable instructions for performing the method of any one of claims 1-3.

8. A computing device, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3.