CN111144407A

CN111144407A - Target detection method, system, device and readable storage medium

Info

Publication number: CN111144407A
Application number: CN201911332544.7A
Authority: CN
Inventors: 张润泽; 郭振华; 赵雅倩
Original assignee: Inspur Beijing Electronic Information Industry Co Ltd
Current assignee: Inspur Beijing Electronic Information Industry Co Ltd
Priority date: 2019-12-22
Filing date: 2019-12-22
Publication date: 2020-05-12

Abstract

The application discloses a target detection method, a system, a device and a readable storage medium, comprising the following steps: acquiring an image to be detected; inputting the image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j-th level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j-th level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and less than K, and j is more than or equal to 1 and less than or equal to N; and inputting the output characteristic diagram of each level of the Kth independent backbone network into the target detection network. According to the method and the device, the backbone network does not need to be trained from the beginning, but the mature independent backbone networks are cascaded, so that the high-level features and the low-level features of the backbone networks are fused, the target detection precision is improved, and the cost for training the backbone networks is saved.

Description

Target detection method, system, device and readable storage medium

Technical Field

The present invention relates to the field of computer vision, and in particular, to a method, a system, an apparatus, and a readable storage medium for target detection.

Background

The target detection has a very important position in the field of computer vision, belongs to the basic field of computer vision, and is also a research hotspot currently entering the field of computer vision.

Generally, the target detection framework includes backbone networks Backbones, a Feature Pyramid Network (FPN), a Region frame extraction Network (RPN), and specific task header networks Heads, and if the backbone networks can extract more representative features, the performance of corresponding target detection is better. However, the cost of designing a complex backbone network that can extract powerful features is extremely high, and how to provide a solution to the above technical problem is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention provides a method, a system, a device and a readable storage medium for object detection. The specific scheme is as follows:

a method of target detection, comprising:

acquiring an image to be detected;

inputting the image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j-th level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j-th level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and is less than or equal to K, and j is more than or equal to 1 and is less than or equal to N;

and inputting the output characteristic diagram of each level of the Kth independent backbone network into the target detection network.

Preferably, the process of preprocessing the output characteristic diagram of the jth level network module of the ith independent backbone network specifically includes:

and preprocessing an output characteristic diagram of the jth level network module of the ith independent backbone network so as to keep the output characteristic diagram consistent with the output characteristic diagram of the jth-1 level network module of the (i + 1) th independent backbone network in terms of resolution and channel number.

Preferably, the step of preprocessing the output characteristic diagram of the jth level network module of the ith independent backbone network specifically includes:

performing 1 × 1 convolution operation on an output characteristic diagram of a jth level network module of an ith independent backbone network;

and performing up-sampling operation on the output characteristic diagram of the j-th level network module of the ith independent backbone network.

Preferably, the process of performing an upsampling operation on the output characteristic diagram of the jth level network module of the ith independent backbone network specifically includes:

and performing nearest neighbor interpolation calculation or bilinear interpolation calculation or bicubic interpolation calculation on the output characteristic graph of the j-th level network module of the ith independent backbone network.

Preferably, the independent backbone network is specifically resnet50, resnet101, resnext152 or senet 154.

Preferably, the network module is specifically a stage network module;

each of the independent backbone networks further comprises:

and the stem network module is positioned in front of the N network modules.

Preferably, the object detection network specifically includes an FPN network and/or an RPN network and/or an ads network.

Correspondingly, the invention also discloses a target detection system, which comprises:

the input module is used for acquiring an image to be detected;

the cascade backbone module is used for inputting the image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j-th level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j-th level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and is less than or equal to K, and j is more than or equal to 1 and is less than or equal to N;

and the target detection module is used for inputting the output characteristic diagram of each level of the Kth independent backbone network into the target detection network.

Correspondingly, the invention also discloses a target detection device, which comprises:

a memory for storing a computer program;

a processor for implementing the steps of the object detection method as claimed in any one of the above when executing the computer program.

Accordingly, the present invention also discloses a readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the object detection method as described in any one of the above.

The application discloses a target detection method, which comprises the following steps: acquiring an image to be detected; inputting the image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j-th level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j-th level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and is less than or equal to K, and j is more than or equal to 1 and is less than or equal to N; and inputting the output characteristic diagram of each level of the Kth independent backbone network into the target detection network. According to the method and the device, the backbone network does not need to be trained from the beginning, but the mature independent backbone networks are cascaded, so that the high-level features and the low-level features of the backbone networks are fused, the target detection precision is improved, and the cost for training the backbone networks is saved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a method for detecting a target according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a network architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a target detection system according to an embodiment of the present invention;

fig. 4 is a structural distribution diagram of an object detection apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Generally, the target detection framework comprises a backbone network, an FPN, an RPN and Heads, and if the backbone network can extract more representative features, the performance of corresponding target detection is better. But the cost of designing a complex backbone network that can extract powerful features is extremely high. According to the method and the device, the backbone network does not need to be trained from the beginning, but the mature independent backbone networks are cascaded, so that the high-level features and the low-level features of the backbone networks are fused, the target detection precision is improved, and the cost for training the backbone networks is saved.

The embodiment of the invention discloses a target detection method, which is shown in figure 1 and comprises the following steps:

s1: acquiring an image to be detected;

s2: inputting an image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and less than K, and j is more than or equal to 1 and less than or equal to N;

it can be understood that the cascaded backbone network in this embodiment is composed of a plurality of independent backbone networks, each of the independent backbone networks has the same network structure, the first K-1 independent backbone networks serve as auxiliary backbone networks, the kth backbone network serves as a main backbone network, and an output characteristic diagram of the kth backbone network serves as an input characteristic diagram of the target detection network. The independent backbone network may be any model weight trained in the ImageNet data set, such as resnet50 or resnet101, and if the video memory is sufficient, a deeper network model such as resnext152 or senet154 may be selected. Each independent backbone network comprises N network modules, specifically a stage network module, and further comprises a stem network module in front of the N network modules. Taking resnet50 as an example, referring to the schematic diagram of the network structure shown in fig. 2, in each independent backbone network, conv1 indicates that a stem network module acquires an image to be detected, and then the image passes through four stage network modules, each stage network module is composed of a plurality of residual modules, and the resolution of the output feature map of each stage network module is half of that of the input feature map.

The feature graph with high resolution has weaker semantic information and less information loss, and the feature graph with low resolution has stronger semantic information but more information loss due to feature selection, so that the embodiment completes feature fusion of high resolution and low resolution by using the cascade connection between independent backbone networks, specifically:

is an output characteristic diagram of a j-th level network module of an i-th independent backbone network,

is a pair of

The pre-treatment is carried out, and the pretreatment,

is an output characteristic diagram of a j-1 level network module of an i +1 th independent backbone network,

is an output characteristic diagram of a j-th level network module of an i +1 th independent backbone network,

and performing internal calculation on the input characteristic diagram for the j-th level network module of the (i + 1) th independent backbone network.

The process of preprocessing the output characteristic diagram of the jth level network module of the ith independent backbone network specifically includes:

and preprocessing the output characteristic diagram of the j-th level network module of the ith independent backbone network so as to keep the output characteristic diagram consistent with the output characteristic diagram of the j-1 level network module of the (i + 1) th independent backbone network in terms of resolution and channel number.

Further, the process of preprocessing the output characteristic diagram of the jth level network module of the ith independent backbone network specifically includes:

The process of performing upsampling operation on the output characteristic diagram of the jth level network module of the ith independent backbone network specifically includes:

It is understood that the main purpose of the preprocessing is to keep the resolution and the number of channels of the two output feature maps consistent, and the preprocessing can be related to other optimization purposes, and is not limited herein; when the resolution and the channel number of two output characteristic graphs are unified, the purpose can be realized by other processing modes in the preprocessing besides the convolution operation and the up-sampling operation; further, besides the interpolation calculations, the upsampling operation may also be performed by another calculation method.

S3: and inputting the output characteristic diagram of each level of the Kth independent backbone network into the target detection network.

It is understood that the object detection network in this embodiment may adopt any object detection framework, and may specifically include an FPN network and/or an RPN network and/or an ads network. The subsequent preset processing steps are completed through the target detection network, and the specific operation belongs to the prior art and is not described herein again.

The embodiment of the application discloses a target detection method, which comprises the following steps: acquiring an image to be detected; inputting an image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and less than K, and j is more than or equal to 1 and less than or equal to N; and inputting the output characteristic diagram of each level of the Kth independent backbone network into the target detection network. According to the method and the device, the backbone network does not need to be trained from the beginning, the mature independent backbone networks are cascaded, the high-level features and the low-level features of the backbone networks are fused, and the target detection precision is improved.

The embodiment of the invention discloses a specific target detection system, and compared with the previous embodiment, the technical scheme is further explained and optimized in the embodiment. Specifically, the method comprises the following steps:

in the embodiment, simple model network construction is performed through the current mainstream deep learning framework pytorch and the deep learning model library torchvision, and the pre-training weight of the independent backbone network is loaded from the torchvision library without retraining.

This example was performed in an experimental environment with 8V 100 GPUs, with the COCO dataset used for the database. The number of pictures that can be processed by each GPU is 2, and the initial learning rate is 0.01. During training, data enhancement adopts horizontal inversion, the short edge of the picture is set to be 800, and the long edge of the picture is set to be 1333. During the test, for fair comparison, the Soft-NMS method is not adopted.

The COCO test-2017 data set results are shown in Table 1.

Table 1: COCO test-2017 dataset results

	Backbone	AP_box	AP₅₀	AP₇₅
					Cascade RCNN	Resnet101	42.8	62.1	46.3
Cascaded backbone network (K2)	Resnet101	44.1	62.3	47.9

	Backbone	AP_mask	AP₅₀	AP₇₅
					Mask RCNN	Resnet101	35.9	57.9	38
Cascaded backbone network (K2)	Resnet101	36.9	59.5	39.2

Two sets of experiments were performed here to compare the algorithm performance of this embodiment with other algorithms on target detection and example segmentation, respectively. The first group of experiments are Cascade RCNN network models with better target detection performance, the second group of experiments are Mask RCNN network models with better example segmentation performance, and backbone networks all adopt Resnet 101. It can be seen from table 1 that, no matter the target detection or the example segmentation is performed, the performance of the method of this embodiment is improved by one percent compared with that of the reference method under the same condition of other network configurations of the model.

Correspondingly, the present invention also discloses a target detection system, as shown in fig. 3, including:

the input module 01 is used for acquiring an image to be detected;

a cascade backbone module 02 for inputting the image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and less than K, and j is more than or equal to 1 and less than or equal to N;

and the target detection module 03 is configured to input the output feature map of each stage of the kth independent backbone network into the target detection network.

In the embodiment, the backbone network does not need to be trained from the beginning, but the mature independent backbone networks are cascaded, so that the high-level features and the low-level features of the backbone networks are fused, and the target detection precision is improved.

In some specific embodiments, the cascade backbone module 02 is specifically configured to:

In some specific embodiments, the independent backbone network is specifically resnet50, resnet101, resnext152, or senet 154.

In some specific embodiments, the network module is specifically a stage network module;

each independent backbone network further comprises:

and the stem network module is positioned in front of the N network modules.

In some specific embodiments, the object detection network specifically includes an FPN network and/or an RPN network and/or an ads network.

Correspondingly, the invention also discloses a target detection device, which is shown in fig. 4 and comprises a processor 11 and a memory 12; wherein the processing 11 implements the following steps when executing the computer program stored in the memory 12:

acquiring an image to be detected;

inputting an image to be detected into a cascade backbone network; the cascade backbone network comprises K independent backbone networks, each independent backbone network comprises N network modules, an output characteristic diagram of a j level network module of an ith independent backbone network is preprocessed and then summed with an output characteristic diagram of a j-1 level network module of an (i + 1) th independent backbone network and input into the j level network module of the (i + 1) th independent backbone network, wherein i is more than or equal to 1 and less than K, and j is more than or equal to 1 and less than or equal to N;

According to the method and the device, the backbone network does not need to be trained from the beginning, the mature independent backbone networks are cascaded, the high-level features and the low-level features of the backbone networks are fused, and the target detection precision is improved.

In some specific embodiments, when the processor 11 executes the computer subprogram stored in the memory 12, the following steps may be specifically implemented: and preprocessing the output characteristic diagram of the j-th level network module of the ith independent backbone network so as to keep the output characteristic diagram consistent with the output characteristic diagram of the j-1 level network module of the (i + 1) th independent backbone network in terms of resolution and channel number.

In some specific embodiments, when the processor 11 executes the computer subprogram stored in the memory 12, the following steps may be specifically implemented:

each independent backbone network further comprises:

and the stem network module is positioned in front of the N network modules.

Further, the target detection apparatus in this embodiment may further include:

the input interface 13 is configured to obtain a computer program imported from the outside, store the obtained computer program in the memory 12, and further be configured to obtain various instructions and parameters transmitted by an external terminal device, and transmit the instructions and parameters to the processor 11, so that the processor 11 performs corresponding processing by using the instructions and parameters. In this embodiment, the input interface 13 may specifically include, but is not limited to, a USB interface, a serial interface, a voice input interface, a fingerprint input interface, a hard disk reading interface, and the like.

And an output interface 14, configured to output various data generated by the processor 11 to a terminal device connected thereto, so that other terminal devices connected to the output interface 14 can acquire various data generated by the processor 11. In this embodiment, the output interface 14 may specifically include, but is not limited to, a USB interface, a serial interface, and the like.

A communication unit 15 for establishing a telecommunication connection between the object detection loader and the external server so that the object detection device can mount the image file to the external server. In this embodiment, the communication unit 15 may specifically include, but is not limited to, a remote communication unit based on a wireless communication technology or a wired communication technology.

And the keyboard 16 is used for acquiring various parameter data or instructions input by a user through real-time key cap knocking.

And the display 17 is used for displaying relevant information of the target detection process in real time so that a user can know the target detection condition in time.

The mouse 18 may be used to assist the user in entering data and to simplify the user's operation.

Further, embodiments of the present application also disclose a computer-readable storage medium, where the computer-readable storage medium includes Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable hard disk, CD-ROM, or any other form of storage medium known in the art. The computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring an image to be detected;

each independent backbone network further comprises:

and the stem network module is positioned in front of the N network modules.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above detailed description is provided for a target detection method, a system, an apparatus and a readable storage medium, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the description of the above examples is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of object detection, comprising:

acquiring an image to be detected;

2. The object detection method of claim 1, wherein the step of preprocessing the output characteristic diagram of the jth network module of the ith independent backbone network specifically comprises:

3. The object detection method of claim 2, wherein the step of preprocessing the output characteristic map of the jth network module of the ith independent backbone network specifically comprises:

4. The method according to claim 3, wherein the process of performing an upsampling operation on the output characteristic map of the jth network module of the ith independent backbone network specifically includes:

5. The object detection method according to any of claims 1 to 4, characterized in that the independent backbone network is in particular resnet50, resnet101, resnext152 or senet 154.

6. The object detection method according to claim 5,

the network module is specifically a stage network module;

each of the independent backbone networks further comprises:

and the stem network module is positioned in front of the N network modules.

7. The object detection method according to claim 6, wherein the object detection network specifically comprises an FPN network and/or an RPN network and/or an HEADS network.

8. An object detection system, comprising:

the input module is used for acquiring an image to be detected;

9. An object detection device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the object detection method according to any one of claims 1 to 7 when executing the computer program.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the object detection method according to any one of claims 1 to 7.