CN116229142A

CN116229142A - Online target detection method and system based on depth feature matching

Info

Publication number: CN116229142A
Application number: CN202211664409.4A
Authority: CN
Inventors: 张天昊; 田涛; 王浩; 苏龙飞; 于泽婷; 蔡慧敏
Original assignee: Tianjin Binhai Artificial Intelligence Innovation Center
Current assignee: Tianjin Binhai Artificial Intelligence Innovation Center
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-06-06

Abstract

The invention provides an online target detection method and system based on depth feature matching, comprising the following steps: acquiring a new class target sample image and an image to be detected; inputting the new class target sample image and the image to be tested into a trained neural network backbone network and an area generating network to obtain a class activation feature map and a regression activation feature map of the image to be tested; obtaining the category and the position frame of the new category target sample image in the image to be detected based on the category activation feature map and the regression activation feature map; the invention avoids complex data acquisition, data annotation and off-line training of an algorithm model of a new class of target sample image, and obtains the class and position frame of the new class of target sample image in the image to be detected by detecting one or a small number of new class of target sample images through a neural network backbone network and a regional generation network, thereby being particularly suitable for video monitoring, unmanned aerial vehicle earth detection and high-value target discovery of remote sensing images.

Description

Online target detection method and system based on depth feature matching

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an online target detection method and system based on depth feature matching.

Background

Deep learning (deep learning) has been widely used in various fields of society. Target detection is one of the basic problems in the field of computer vision, and the most excellent target detection algorithm at present is a target detection algorithm based on deep learning. The current target detector depends on a large amount of annotation data for training, and large data sets of categories such as faces, pedestrians, vehicles and the like exist at present, but for some new categories and new samples, large data are difficult to construct. The target detection based on the deep learning is the same as most of the deep learning algorithms, a large amount of marked data is required to perform supervised learning training, and when the number of marked data is limited, the accuracy and generalization capability of the algorithm are difficult to ensure. When a sufficient number of samples can be obtained, we can label the data manually or automatically or semi-automatically, thereby obtaining a large amount of labeled data. However, in some application scenarios, it is difficult to obtain many sample data, for example, search a suspicious person from a monitoring video, the suspicious person only has one or several pictures, it is impossible to perform large-scale deep learning training, and it is difficult to obtain an effective deep learning model, and then the application of a target detection algorithm based on deep learning is greatly limited.

Currently, detection is performed on targets with only one sample or few samples, and a template matching method is generally adopted. Template matching is the most basic pattern recognition method, and is the most basic and most commonly used matching method in image processing, wherein the pattern of a specific object is researched and positioned in the image, so that the object is recognized. Template matching has the limitation of itself, mainly represented by that it can only make parallel movement, if the matching target in the original image is rotated or changed in size, the algorithm performance is suddenly reduced, that is, the template matching method generally has no rotation invariance. The current deep learning method is widely applied, and the feature map extracted by the deep neural network has translational invariance and rotational invariance, so that a very good effect is achieved in the field of target detection.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides an online target detection method based on depth feature matching, which comprises the following steps:

acquiring a new class target sample image and an image to be detected;

inputting the new class target sample image and the image to be tested into a neural network backbone network and an area generating network after training is completed, and obtaining a class activation feature map and a regression activation feature map of the image to be tested;

obtaining the category and the position frame of the new category target sample image in the image to be detected based on the category activation feature map and the regression activation feature map;

the new class target sample image is an image obtained from the image to be detected; the training of the neural network backbone network and the area generating network is training based on the old class target sample image and the image to be tested, and the trained weight is obtained.

Preferably, the region generation network includes a region generation network classification branch and a region generation regression classification branch, and the training of the region generation network includes:

acquiring an old class target sample image and an image to be detected;

inputting the old class target sample image into the region generation network classification branch to generate a first classification feature image, and calculating a cross entropy loss function of a preset anchor frame class and a target real class label in the image to be detected based on the first classification feature image;

inputting the old class target sample image into the region to generate a regression classification branch, generating a first regression feature map, and calculating a loss function of a preset anchor frame position and a target real position in an image to be detected based on the first regression feature map;

and training the regional generation network by adopting a random gradient descent method based on the cross entropy loss function and the loss function to obtain the weight of the regional generation network.

Preferably, the inputting the new class target sample image and the image to be tested into the trained neural network backbone network and the region generating network to obtain a class activation feature map and a regression activation feature map of the image to be tested includes:

inputting the new class target sample image into a trained neural network backbone network to obtain a first depth feature image, inputting the first depth feature image into the region generation network classification branch to generate a classification convolution kernel, and inputting the first depth feature image into the region generation network regression branch to generate a regression convolution kernel;

inputting the image to be tested into a trained neural network backbone network to obtain a second depth feature map, inputting the second depth feature map into the region generation network classification branch, generating a classification activation feature map based on the classification convolution kernel, inputting the second depth feature map into the region generation network regression branch, and generating a regression activation feature map based on the regression convolution kernel.

Preferably, the obtaining the category and the position frame of the new category target sample image in the image to be detected based on the category activation feature map and the regression activation feature map includes:

activating the values of the classified activation feature images through an activation function, so that the classified activation feature images are mapped onto the images to be detected, and the categories of the new category target sample images are obtained;

and mapping the regression activation feature map to the image to be detected based on the anchor frame position obtained in advance during training, and obtaining a position frame of the new class target sample image.

Preferably, the activation function is as follows:

wherein, softmax (x _i ) An activation function for the x-th feature, x _i For the x-th feature, x _j For the j-th feature, max (x) is the maximum value in the input features, and C is the category number.

Based on the same inventive concept, the invention also provides an online target detection system based on depth feature matching, which comprises:

the device comprises an image acquisition module, a feature map acquisition module and a target detection module;

the image acquisition module is used for acquiring a new class target sample image and an image to be detected;

the feature map acquisition module is used for inputting the new class target sample image and the image to be tested into a neural network backbone network and a region generating network after training is completed, and obtaining a classified activation feature map and a regression activation feature map of the image to be tested;

the target detection module is used for obtaining the category and the position frame of the new category target sample image in the image to be detected based on the category activation feature map and the regression activation feature map;

Preferably, the region generating network in the feature map acquiring module includes a region generating network classification branch and a region generating regression classification branch, and the training of the region generating network includes:

acquiring an old class target sample image and an image to be detected;

Preferably, the feature map obtaining module is specifically configured to:

Preferably, the target detection module is specifically configured to:

Preferably, the activation function of the target detection module is as follows:

Compared with the closest prior art, the invention has the following beneficial effects:

the invention provides an online target detection method and system based on depth feature matching, comprising the following steps: acquiring a new class target sample image and an image to be detected; inputting the new class target sample image and the image to be tested into a neural network backbone network and an area generating network after training is completed, and obtaining a class activation feature map and a regression activation feature map of the image to be tested; obtaining the category and the position frame of the new category target sample image in the image to be detected based on the category activation feature map and the regression activation feature map; the new class target sample image is an image obtained from the image to be detected; training the neural network backbone network and the area generating network based on the old class target sample image and the image to be tested and obtaining trained weights; the invention avoids complex data acquisition, data annotation and off-line training of an algorithm model of a new class of target sample image, and obtains the class and position frame of the new class of target sample image in the image to be detected by detecting one or a small number of new class of target sample images through a neural network backbone network and a regional generation network, thereby being particularly suitable for video monitoring, unmanned aerial vehicle earth detection and high-value target discovery of remote sensing images.

Drawings

FIG. 1 is a schematic flow chart of an online target detection method based on depth feature matching;

FIG. 2 is a training flowchart of an online target detection algorithm based on depth feature matching in an embodiment provided by the invention;

FIG. 3 is a flowchart of an online target detection algorithm based on depth feature matching in an embodiment provided by the invention;

FIG. 4 is a schematic diagram of a classification characteristic diagram according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a regression feature map according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an online object detection system based on depth feature matching according to the present invention.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

Example 1:

the on-line target detection method based on depth feature matching provided by the invention is shown in fig. 1, and comprises the following steps:

step 1: acquiring a new class target sample image and an image to be detected;

step 2: inputting the new class target sample image and the image to be tested into a neural network backbone network and an area generating network after training is completed, and obtaining a class activation feature map and a regression activation feature map of the image to be tested;

step 3: obtaining the category and the position frame of the new category target sample image in the image to be detected based on the category activation feature map and the regression activation feature map;

Specifically, step 1 includes:

according to the method, when the unknown type target is detected, the target can be detected in the image only by one new type target sample image, so that complicated data acquisition, data labeling and offline training of an algorithm model of the new type target sample image are avoided.

Specifically, step 2 includes:

and selecting image data covering various categories as much as possible from a public database or an actually collected data set, intercepting the category corresponding to the label from the original image to serve as an old category target sample image, and taking the original image as an image to be detected to form an old category target sample image and an image pair to be detected.

Neural network backbone networks include, but are not limited to, the following: alexNet, VGGNet, googleNet, resNet (residual network), resNeXt, resNeSt, denseNet (dense connectivity network), squeezeNet, shuffleNet, mobileNet, efficientNet, transducer, region generation network includes region generation network classification branches and region generation regression classification branches.

In the training phase for the neural network backbone network and the region generation network, as shown in fig. 2, the training phase includes:

the resolution of the existing old class target sample image is adjusted to be 1 multiplied by 3 multiplied by 127, then the old class target sample image is input into a neural network backbone network, the neural network backbone network of the embodiment is AlexNet, a depth feature map of the old class target sample image is obtained, the resolution is 1 multiplied by 256 multiplied by 6, the region generation network classification branch and the region generation regression classification branch comprise convolution operation units, the convolution operation units comprise two-dimensional convolution operation, normalization operation and activation operation, the obtained depth feature map is input into the region generation network classification branch, obtaining a classification feature map of an old class target sample image, wherein the resolution of the classification feature map is 1×2560×4×4, generating a classification convolution kernel after proper matrix deformation operation, wherein the resolution of the classification convolution kernel is 10×256×4×4, generating a network regression branch by an input area of the obtained depth feature map, obtaining a regression feature map of the old class target sample image, wherein the resolution of the regression feature map is 1×5120×4×4, generating a regression convolution kernel after proper matrix deformation operation, and the resolution of the regression convolution kernel is 20×256×4×4;

the resolution of the image to be detected is adjusted to be 1 multiplied by 3 multiplied by 271, the required resolution of the image to be detected is higher than that of the old class target sample image, preferably the resolution is more than 2 times that of the old class target sample image, and then the image to be detected is input into AlexNet to obtain a depth feature map of the image to be detected, wherein the resolution is 1 multiplied by 256 multiplied by 24; inputting a depth feature image of an image to be detected into a region generation network classification branch to generate a classification feature image of the image to be detected, wherein the resolution of the classification feature image is 1×256×22×22, inputting the depth feature image of the image to be detected into a region generation network regression branch to generate a regression feature image of the image to be detected, the resolution of the regression feature image is 1×256×22×22, carrying out convolution operation on a classification convolution kernel of an old class target sample image on the classification feature image obtained by the image to be detected, thereby obtaining a classification activation feature image of the image to be detected, the resolution of the classification activation feature image is (1×10×19×19), the channel number of the classification activation feature image is 2k, carrying out convolution operation on a regression convolution kernel of the old class target sample image on the regression feature image obtained by the image to be detected, thereby obtaining a regression activation feature image of the image to be detected, the resolution of the regression activation feature image is (1×20×19×19), and the output channel number of the regression feature image is 4k;

the number of anchor frames is defined as 5 and the aspect ratios are [0.33,0.5,1,2,3], respectively. Calculating cross entropy loss functions on five anchor frames of each pixel point (19 multiplied by 19) on the classification characteristic diagram, wherein the cross entropy loss functions are as follows:

wherein L is _cls To determine the loss function for cross entropy, N is the number of samples, c _i Class labels for sample i, with target 1, no target 0, p _i Predicted as sample iProbability of 1;

calculating smoothL1 loss functions of each pixel point on five anchor boxes on the regression feature map:

wherein, the liquid crystal display device comprises a liquid crystal display device,

detecting a regression loss function for a target, wherein x is an input characteristic, and sigma is an adjustment parameter;

the offset of the abscissa of the central point between the target frame and the anchor frame is as follows:

wherein delta [0 ]]T is the offset of the abscissa of the central point between the target frame and the anchor frame _x Is the abscissa of the central point of the target frame, A _x Is the abscissa of the center point of the anchor frame, A _w Is the width of the anchor frame;

the offset of the ordinate of the central point between the target frame and the anchor frame is as follows:

/>

wherein delta 1]T is the offset of the ordinate of the central point between the target frame and the anchor frame _y Is the ordinate of the center point of the target frame, A _y Is the ordinate of the center point of the anchor frame, A _h Is the height of the anchor frame;

the wide offset between the target frame and the anchor frame is:

wherein delta [2 ]]For a wide offset between the target frame and the anchor frame, T _w Is the width of the target frame;

the high offset between the target frame and the anchor frame is:

wherein delta 3]For high offset between target frame and anchor frame, T _h Is the height of the target frame;

the loss function of the target detection algorithm is:

wherein f _loss Is the loss function of the target detection algorithm, lambda is the hyper-parameter,

is delta [ i ]]A target detection regression loss function of (2);

deriving the loss function of the target detection algorithm, and obtaining weights of the neural network backbone network and the regional generation network by using a random gradient descent (GSD), wherein the trained weights are required to be used in the following detection process.

The process of online detecting the new class of target sample image, as shown in fig. 3, includes:

the resolution of the new class target sample image is adjusted to be 1 multiplied by 3 multiplied by 127, the new class target sample image is input into a neural network backbone network, a depth feature image of the new class target sample image is obtained, the neural network backbone network at the moment is consistent with a training stage, the weight obtained in the training stage is loaded, a network classification branch is generated in the input area of the depth feature image of the new class target sample image, a classification feature image of the new class target sample image is obtained, the resolution of the classification feature image is 1 multiplied by 2560 multiplied by 4, and a classification convolution kernel is generated after proper matrix deformation operation, wherein the resolution of the classification convolution kernel is 10 multiplied by 256 multiplied by 4; and inputting the depth feature images of the new class of target sample images into the area to generate network regression branches, obtaining regression feature images of the new class of target sample images, wherein the resolution of the regression feature images is 1 multiplied by 5120 multiplied by 4, and generating regression convolution kernels after proper matrix deformation operation, and the resolution of the regression convolution kernels is 20 multiplied by 256 multiplied by 4.

The resolution of the image to be detected is adjusted to be 1 multiplied by 3 multiplied by 271, the required resolution of the image to be detected is higher than that of the new class of target sample images, and the resolution is preferably more than 2 times that of the new class of target sample images; then inputting the image to be detected into AlexNet to obtain a depth feature map of the image to be detected, wherein the resolution is 1 multiplied by 256 multiplied by 24; inputting the depth feature image of the image to be detected into a region generation network classification branch to generate a classification feature image of the image to be detected, wherein the resolution of the classification feature image is 1 multiplied by 256 multiplied by 22, inputting the depth feature image of the image to be detected into a region generation network regression branch to generate a regression feature image of the image to be detected, the resolution of the regression feature image is 1 multiplied by 256 multiplied by 22, carrying out convolution operation on a classification convolution kernel (10 multiplied by 256 multiplied by 4) of a new class target sample image on the classification feature image (1 multiplied by 256 multiplied by 22) of the obtained image to be detected, obtaining a classification activation feature map of the image to be detected, wherein the resolution of the classification activation feature map is (1×10×19×19), the channel number of the classification activation feature map is 2k, and performing convolution operation on a regression convolution kernel (20×256×4×4) of the new class target sample image on a regression feature map (1×256×22×22) of the image to be detected, so as to obtain a regression activation feature map of the image to be detected, the resolution of the regression activation feature map is (1×20×19×19), and the output channel number of the regression activation feature map is 4k; according to the invention, the channel number of the feature map is adjusted through the convolution operation unit of the regional generation network, so that the input feature map required by subsequent operation can be obtained, after the neural network backbone network and the regional generation network are trained in advance, when the unknown class targets are required to be detected, the new class target sample images are only required to be detected on the trained neural network backbone network and the regional generation network, and the subsequent required classified activation feature map and regression activation feature map can be rapidly obtained.

Specifically, step 3 includes:

the number and the shape of the anchor frames are consistent with those of the training stage, the number of the anchor frames is 5, the width of the anchor frames is 19, the height of the anchor frames is 19,

the aspect ratio of the anchor frame is [0.33,0.5,1,2,3], and the classification feature map of the network classification branch output is generated for the region, as shown in fig. 4, the feature tensor of the classification feature map is:

for classifying feature tensors of a feature map +.>

For the ith abscissa corresponding to the anchor frame, < ->

For the anchor frame corresponds to the j-th ordinate, < >>

For the first target probability value, i e [0,19 ], j e [0,19), l e [0, 10);

the odd channels represent targets in the anchor frame with the position, and the softmax activation function is used for selecting the odd channels

The several values with the largest class feature values, the softmax activation function, are as follows:

Order the

The category characteristic value is the largestThe positions corresponding to the plurality of values are:

wherein, CLS ^* Is that

The positions corresponding to a plurality of values with the largest category characteristic values;

for a regression feature map of regional generation network regression branch output, as shown in fig. 5, the feature tensor of the regression feature map is:

for regression of feature tensors of feature map, +.>

For the anchor frame corresponding to the ith abscissa, < + >>

For the anchor frame corresponding to the j-th ordinate, dx _p ^reg For the p-th offset of the abscissa of the center point between the target frame and the anchor frame, +.>

Is the p-th offset of the ordinate of the central point between the target frame and the anchor frame, +.>

P-th offset for the width of the center between the target frame and the anchor frame>

P-th high center point between target frame and anchor frameOffset (s)/(s)>

For the p-th target probability value, i e [0,19 ], j e [0,19), p e [0, 5);

the anchor frame set obtained according to the regional generation network classification branch is as follows:

wherein ANCHOR is as follows ^* A set of anchor boxes obtained by the network classification branch is generated for the region,

the ith abscissa of the center point of the anchor frame,/->

The j-th ordinate of the center point of the anchor frame,>

is the width of the anchor frame>

Is the height of the anchor frame;

the position offset and the wide and high offset sets of the corresponding target frame position relative to the anchor frame can be obtained in the regression feature map as follows:

wherein REGRESSION ^* To obtain the corresponding position offset of the target frame position relative to the anchor frame in the regression feature map and a wide and high offset set,

for the first offset of the abscissa of the center point between the target frame and the anchor frame,

is the first offset of the ordinate of the central point between the target frame and the anchor frame, +.>

For the first offset of the width between target frame and anchor frame, +.>

A high first offset between the target frame and the anchor frame;

computing the abscissa of the position of the mapped target frame

The calculation formula is as follows:

an abscissa of the position of the mapped target frame;

computing the ordinate of the position of the mapped target frame

The calculation formula is as follows:

an ordinate that is the position of the mapped target frame;

calculating the width of the mapped target frame

The calculation formula is as follows:

width of the target frame for mapping;

computing the height of the mapped target frame

The calculation formula is as follows:

high for the mapped target box;

through the calculation, the position of the mapped target frame can be obtained

And size->

The target position and the size obtained by mapping are subjected to a non-maximum suppression (NMS) algorithm to obtain the final target position and the final target size; according to the method, the classification activation feature map and the regression activation feature map of the image to be detected are mapped back to the original image to be detected, so that the online detection of a new class is realized, and the method is particularly suitable for video monitoring, unmanned aerial vehicle ground detection and high-value target discovery of remote sensing images.

Example 2:

based on the same inventive concept, the invention also provides an online target detection system based on depth feature matching, as shown in fig. 6:

acquiring an old class target sample image and an image to be detected;

Preferably, the feature map obtaining module is specifically configured to:

Preferably, the target detection module is specifically configured to:

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the scope of protection thereof, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that various changes, modifications or equivalents may be made to the specific embodiments of the application after reading the present invention, and these changes, modifications or equivalents are within the scope of protection of the claims appended hereto.

Claims

1. An online target detection method based on depth feature matching is characterized by comprising the following steps:

acquiring a new class target sample image and an image to be detected;

2. The method of claim 1, wherein the region-generating network comprises a region-generating network classification branch and a region-generating regression classification branch, the training of the region-generating network comprising:

acquiring an old class target sample image and an image to be detected;

3. The method of claim 2, wherein inputting the new class target sample image and the image to be tested into the trained neural network backbone network and the region generation network to obtain a class activation feature map and a regression activation feature map of the image to be tested, comprises:

4. The method of claim 3, wherein the obtaining the class and location box of the new class target sample image in the image to be measured based on the class activation feature map and the regression activation feature map comprises:

5. The method of claim 4, wherein the activation function is as follows:

6. An online object detection system based on depth feature matching, comprising:

7. The system of claim 6, wherein the region-generating network of the feature map acquisition module includes a region-generating network classification branch and a region-generating regression classification branch, the training of the region-generating network comprising:

acquiring an old class target sample image and an image to be detected;

8. The system of claim 7, wherein the feature map acquisition module is specifically configured to:

9. The system of claim 8, wherein the object detection module is specifically configured to:

10. The system of claim 9, wherein the activation function of the object detection module is as follows: