CN117611830A - Random class target positioning and counting method based on few sample labeling - Google Patents
Random class target positioning and counting method based on few sample labeling Download PDFInfo
- Publication number
- CN117611830A CN117611830A CN202311370741.4A CN202311370741A CN117611830A CN 117611830 A CN117611830 A CN 117611830A CN 202311370741 A CN202311370741 A CN 202311370741A CN 117611830 A CN117611830 A CN 117611830A
- Authority
- CN
- China
- Prior art keywords
- sample
- size
- suggested
- features
- point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000002372 labelling Methods 0.000 title claims abstract description 23
- 108091006146 Channels Proteins 0.000 claims abstract description 38
- 230000003993 interaction Effects 0.000 claims abstract description 34
- 238000010586 diagram Methods 0.000 claims abstract description 31
- 238000011176 pooling Methods 0.000 claims abstract description 19
- 239000011159 matrix material Substances 0.000 claims abstract description 10
- 238000013507 mapping Methods 0.000 claims abstract description 7
- 230000008569 process Effects 0.000 claims description 17
- 238000000605 extraction Methods 0.000 claims description 9
- 238000012216 screening Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 239000003054 catalyst Substances 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 2
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 230000006870 function Effects 0.000 description 9
- 230000000694 effects Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000009776 industrial production Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000005251 gamma ray Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000007420 reactivation Effects 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30242—Counting objects in image
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for positioning and counting any type of targets based on few sample labeling, which comprises the following steps: acquiring a plurality of layers of output characteristics of an input image, and carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frames of a given input image on different layers; pairing output features of different layers using different mapping layersPerforming alignment treatment to obtain (n×L) ×C N Is a feature matrix of (1); according to N ex Performing first mutual attention calculation on the sample characteristics, and integrating size information into the sample characteristics; n to be obtained ex Spatially interacting the individual sample features with the output features of the last layer to obtain a query feature map; carrying out channel interaction on the sample characteristics, determining the weight of each channel in the query characteristic diagram, and obtaining a query correlation diagram; and positioning the target category and predicting the count according to the query correlation diagram. The invention effectively models the target category representing characteristics with more discriminant, and does not lose and fully utilizes the size information of the sample.
Description
Technical Field
The invention relates to the technical field of computer vision, in particular to a method for positioning and counting any type of targets based on few sample labels.
Background
The visual target counting task has very important practical value in living and industrial production, is often used for counting crowds, vehicles or workpieces and the like, and the main research paradigm is special for special models, one model can only be used for detecting the same kind of objects, and the paradigm cannot effectively process the kind of objects without a targeted pre-training counting model; and in the training process of the model, a large amount of marking data is often needed, and the condition that the marking sample is very few is difficult to process.
Thus, a new research paradigm is proposed, namely Class-independent counting (CAC-Agnostic Counting), requiring a model to complete counting Class objects corresponding to a sample given an image and a small number of target samples (typically 1-3), which requires a model to be able to handle any Class of objects without separately training the model for new Class targets, and thus effectively solves the above-mentioned problems, namely without retraining the model for new classes and without additionally labeling a large number of Class samples to be counted.
The current research direction of the paradigm is a regression method based on density map estimation. Specifically, the GMN splices the pooling result and the query feature through pooling sample features, and then learns a regression head to perform point-to-point feature comparison. FamNet uses sample features to generate convolution kernels to obtain feature correlations for prediction and uses adaptive loss to fine tune model parameters at test time. Shi et al propose a joint learning framework bmnet+ that combines representation learning and similarity metrics for generating a density map with more accurate counting results. SAFECount uses support (sample) features to enhance query (image) features so that extracted features are more refined, thereby optimizing the generated density map.
The above method does complete the object counting task under the condition that a small number of sample objects can be given, but in a complex real scene, only the number of objects or a density map can not meet the requirement of a downstream task, because the number of objects is only a rough index, and in some scenes, the position information of the objects is very important, for example, in industrial production, the position information of the objects needs to be known for accurate production scheduling, so that in many scenes, even the density map can not meet the practical requirement, and it is noted that the current method still has a certain improvement space in counting accuracy.
Disclosure of Invention
The invention provides a method for positioning and counting any type of targets based on few sample labels, which aims to solve the problems of the defects and the defects existing in the prior art.
In order to achieve the above purpose of the present invention, the following technical scheme is adopted:
an arbitrary category target positioning and counting method based on few sample labeling, which comprises the following steps:
extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and pooling each sample region characteristic to be 1x1xC in size i Is a multi-scale feature of (2);
the output features of different layers are aligned by using different mapping layers to obtain a feature channel number (n multiplied by L) multiplied by C N Is a feature matrix of (1); wherein each row in the feature matrix represents a sample feature representation;
according to the obtained N ex First mutual attention calculation is carried out on the n multiplied by L sample characteristics, and size information is integrated in a size embedding modeCarrying out cooperative enhancement in the sample characteristics to realize information exchange among different sample characteristics;
n to be obtained ex Spatially interacting the individual sample features with the output features of the last layer to obtain a query feature map;
carrying out channel interaction on the sample characteristics, and determining the weight of each channel in the query characteristic diagram, so that each channel in the query characteristic diagram is weighted to obtain a query correlation diagram, and the correlation between the mined image and the sample characteristics is realized;
performing target category positioning prediction according to the query correlation diagram, wherein the target category positioning prediction comprises predicting suggestion points, predicting the confidence level of each suggestion point and predicting the suggestion frame size of each suggestion point;
and finally screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value as a counting result, and taking the counting result as a final prediction result.
Preferably, the method comprises the step of obtaining N according to ex The first mutual attention calculation is performed on the n×l sample features, so as to realize information exchange between different sample features, including: by L ca The Transformer Encoder layers perform a first mutual attention calculation on the nxl sample features,
the calculation process of the first mutual attention calculation is as follows:
z′ l =MHSA(LN(z l-1 ))+z l-1 ,l=1,…,L ca
z l =MLP(LN(z′ l ))+z′ l ,l=1,…,L ca
wherein z is 0 Initial features (including size-embedded sample features) representing attention calculations;representing the ith sample feature; z'. l Representing intermediate results of the layer I attention calculation; z l-1 Representing the results of the layer 1 attention calculation; z l The result MHSA, which represents the layer i attention computation, represents the multi-headed self-attention mechanism layer; LN represents the normalization layer; MLP represents a multi-layer perceptron; />Representing size embeddings.
Further, incorporating the size information into the feature in combination with the size embedding includes:
firstly, counting the range and frequency of the occurrence of the size of the sample in the image training set;
then dividing the equal frequency of the size range into K different size ranges according to the principle that the occurrence frequencies in the size different ranges are the same, and obtaining 2K size embedded feature representations corresponding to the different size ranges;
searching the size interval corresponding to the current width and height during use to obtain the corresponding height and width size embedded E h And E is w Wherein E is w 、E h Is half of the feature dimension of the sample, and the two are spliced together to be embedded as the final dimension of the sample.
Preferably, the N to be obtained ex The spatial interaction between the sample features and the output features of the last layer is carried out to obtain a query feature map, which comprises the following steps:
n to be obtained ex The individual sample features are used as a representation template set of the target class, namely Key F is generated K And Value F V ;
The output feature of the last layer is used as the image representation feature, and the feature of each position is used as one element of the Query set to form a Query F Q And F is combined with K And F V And simultaneously sending the query feature map to a feature query module for second mutual attention calculation, and finally obtaining the query feature map.
Further, the calculation process of the second mutual attention calculation is as follows:
wherein E is pos Representation position embedding (Position Embedding) using a sine and cosine representation commonly used by a transducer; MHSA represents a multi-head self-attention mechanism layer, LN represents a normalization layer, and MLP represents a multi-layer perceptron;initial features (including sample features with embedded locations) representing the attention calculations; />Representing intermediate results of the layer I attention calculation;representing the results of the layer 1 attention calculation; />The result of the layer i attention calculation is shown.
Preferably, the target category positioning prediction according to the query correlation diagram includes predicting a suggestion point, predicting a confidence level of each suggestion point, predicting a suggestion frame size of each suggestion point, and includes:
first define the query correlogram asThe size of the catalyst is H r ×W r The method comprises the steps of carrying out a first treatment on the surface of the Each pixel point on the query correlation graph corresponds to a patch with the size of s multiplied by s in the original image, and the patch represents a region associated with a certain point;
for each patch, defining k anchor points, then for the wholeQuerying the correlation diagram to obtain an anchor point set A= { A i |i∈{1,…,H r ×W r X k }, where the coordinates of the anchor point are A i =(x i ,y i );
For each anchor point A i Predicting suggested points and anchor points a using offset heads i Offset betweenThe final suggested point coordinates are +.>Wherein λ is the scaling factor;
and, for each anchor point, using the classification head to generate a confidence C for each suggested point i The method comprises the steps of carrying out a first treatment on the surface of the Generating width and height of potential targets for each point using a size headMultiplying the scaling factor beta to obtain the final target sizeBased on the suggested point coordinates and the corresponding target size, a new set of suggested point coordinates is obtained>The width and height of the center are +.>Is a suggestion frame of (1);
generating k suggested points and corresponding confidence levels and suggested frames for each pixel point, so that H is added in total r ×W r X k suggested points, i.e., set P, and also a corresponding confidence set C and suggested box size set S.
Further, after screening the suggested points by confidence, counting the number of suggested points with confidence exceeding a set threshold (0.5) as a count result, the method further includes: and matching the suggested point set and the target point set, and optimizing the target category positioning prediction process based on a matching result.
Still further, the matching the set of suggested points with the set of target points includes:
each target point in the target point set is respectively matched to one of the suggested points in the suggested point set, and zeta (i) represents the target point G i Matched suggested point subscripts, i.e. suggested point P ζ(i) Is the target point G i Is determined by the position of the predicted position of (2); and each sample target box in the sample target box set is respectively matched to a certain suggested box size.
Still further, optimizing the target category location prediction process based on the matching result includes:
for confidence, optimization is performed using cross entropy, and the loss function is defined as L cls ;
For coordinates, the loss is calculated using the mean square error, the loss function being defined as L loc ;
For sample target frame sizes, the loss is calculated using Manhattan distance, and the loss function is defined as L size ;
The optimized total loss function is calculated as follows:
L=L cls +λ 1 L loc +λ 2 L size
wherein lambda is 1 And lambda (lambda) 2 Is a weight parameter.
A system for arbitrary class target positioning and counting method based on few sample labeling, the system comprising: the system comprises a layered sample collaborative enhancement module, a characteristic interaction module and a multi-head positioning module;
the hierarchical sample collaborative enhancement module comprises a hierarchical sample extraction module, a cross-scale alignment module and feature collaborative enhancement;
the layered sample extraction module is used for extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and carrying out region-specific operation on each sampleCondition were pooled to a size of 1x1xC i Is a multi-scale feature of (2);
the cross-scale alignment module is used for performing feature channel number alignment processing on the output features of different layers by using different mapping layers to obtain a module with the size of (n multiplied by L) multiplied by C N Is a feature matrix of (1); wherein each row represents a representation of a feature of a sample;
the characteristic is synergistically enhanced for the following N ex Performing first mutual attention calculation on n multiplied by L sample features, and integrating size information into the sample features in a size embedding mode to perform cooperative enhancement so as to realize information exchange among different sample features;
the feature interaction module comprises a space interaction module and a channel interaction module;
the space interaction module is used for obtaining N ex The n multiplied by L sample features are spatially interacted with the output features of the last layer to obtain a query feature map;
the channel interaction module is used for carrying out channel interaction on the obtained sample characteristics and determining the weight of each channel in the query characteristic diagram so as to weight each channel of the query characteristic diagram to obtain a query correlation diagram and realize the correlation between the mined image and the sample characteristics;
the multi-head positioning module is used for carrying out target category positioning prediction according to the query correlation diagram, and comprises the steps of predicting the suggested points, predicting the confidence coefficient of each suggested point and predicting the suggested frame size of each suggested point; and screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value as a counting result, and taking the counting result as a final prediction result.
The beneficial effects of the invention are as follows:
the method can be directly used for target detection tasks under any category and very few labeling conditions, and can meet the requirements of complex downstream tasks more than the existing method.
In the method, sample characteristics of different scales can be effectively extracted in the process of layered sample characteristic extraction and collaborative enhancement, and characteristic representations from different samples can be interactively shared through first mutual attention calculation, so that target category representing characteristics with more discriminant are effectively modeled, in addition, size information is fused into the characteristic representations in a size embedding mode before the first mutual attention calculation, and the model is not lost and the size information of the samples is fully utilized.
The invention also effectively carries out feature interaction on the image features and any number of sample features in parallel from two dimensions of the space and the channel, thereby improving the calculation efficiency of the model and the comprehensive matching capability of the sample features.
Drawings
FIG. 1 is a flow chart of the method for locating and counting arbitrary targets based on few sample labels according to the present invention.
FIG. 2 is a system schematic block diagram of the arbitrary class target locating and counting method based on the few sample labeling according to the present invention.
Detailed Description
Further advantages and effects of the present invention will become readily apparent to those skilled in the art from the disclosure herein, by referring to the accompanying drawings and the preferred embodiments. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation.
It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
Example 1
As shown in fig. 1, a method for positioning and counting arbitrary category targets based on few sample labeling includes the following steps:
extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and pooling each sample region characteristic to be 1×1×C in size i Is a multi-scale feature of (2);
the output features of different layers are aligned by using different mapping layers to obtain a feature channel number (n multiplied by L) multiplied by C N Is a feature matrix of (1); wherein each row in the feature matrix represents a feature representation of a sample;
according to the obtained N ex Performing first mutual attention calculation on n multiplied by L sample features, and integrating size information into the sample features in a size embedding mode to perform cooperative enhancement so as to realize information exchange among different sample features;
n to be obtained ex Spatially interacting the individual sample features with the output features of the last layer to obtain a query feature map;
carrying out channel interaction on the sample characteristics, and determining the weight of each channel in the query characteristic diagram, so that each channel in the query characteristic diagram is weighted to obtain a query correlation diagram, and the correlation between the mined image and the sample characteristics is realized;
performing target category positioning prediction according to the query correlation diagram, wherein the target category positioning prediction comprises predicting suggestion points, predicting the confidence level of each suggestion point and predicting the suggestion frame size of each suggestion point;
and finally screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value (0.5) as a counting result, and taking the counting result as a final prediction result.
Since only about 1 to 3 target labeling frames of the target objects can be obtained in this embodiment, it is necessary to be able to extract features that are sufficiently rich and can characterize a specified class from among the target objects for the subsequent similarity calculation process. For this purpose not only all samples are used, but also each layer of output features (4 layers) of the feature extractor is fully utilized. Concrete embodimentsIn (3) a region of interest pooling (RoIPooling) operation is performed on each extracted layer of output features to pool each sample region feature to a size of 1×1×C i Assuming that the feature extractor can output N layers of output features, n×l output features for the target class object can be obtained for N samples.
In this embodiment, for the image to be counted, only a small amount (at least one) of target object areas in the image need to be selected, and the counting, positioning and framing tasks of other objects in the image can be more accurately completed by the method of the present invention.
In a specific embodiment, because different objects of the same class may have very different morphologies in the same image, if the morphologies are not so similar to the given examples, capturing commonalities and differences between the different examples is particularly important. However, so far, the features from different examples have not seen each other, i.e. the commonalities and differences between the different examples cannot be exploited and exploited without information interaction. Thus we apply to all acquired N ex The =n×l features perform mutual attention calculations, further optimizing feature expression. We use L ca The Transformer Encoder layers perform a first attention calculation on these feature vectors.
In a specific embodiment, the method comprises the steps of ex The first mutual attention calculation is performed on the n×l sample features, so as to realize information exchange between different sample features, including: by L ca Transformer Encoder layer pairs N ex The first mutual attention calculation is performed for the =n×l sample features,
the calculation process of the first mutual attention calculation is as follows:
z′ l =MHSA(LN(z l-1 ))+z l-1 ,l=1,…,L ca
z l =MLP(LN(z′ l ))+z′ l ,l=1,…,L ca
wherein z is 0 Initial features (including size-embedded sample features) representing attention calculations;representing the ith sample feature; z'. l Representing intermediate results of the layer I attention calculation; z l-1 Representing the results of the layer 1 attention calculation; z l Representing the result of the layer I attention calculation; MHSA represents a multi-headed self-attention mechanism layer; LN represents the normalization layer; MLP represents a multi-layer perceptron; />Representing size embeddings.
The present embodiment introduces size embedding because the ROI Pooling is used to compress the spatial dimensions of the features during feature extraction, which causes the model to lose the size information of the sample object. It is necessary to incorporate size information into the features here by way of size embedding.
In this embodiment, the method for integrating the size information into the feature in combination with the size embedding includes:
firstly, counting the range and frequency of the occurrence of the size of the sample in the image training set;
and then dividing the frequency of the size range into K different size ranges according to the principle that the frequency appears the same in the size different ranges (without losing generality, the maximum value of the last range is positive infinity), and counting the width and the height respectively to obtain 2K size embedded characteristic representations corresponding to different size ranges because the distinction of the width and the height is obviously important for some types of objects (such as books and the like, which are not similar in width).
Searching the size interval corresponding to the current width and height during use to obtain the corresponding height and width size embedded E h And E is w Wherein E is w 、E h The dimension of the sample is half of the characteristic dimension of the sample, and the two are spliced to be used as the sampleThe specific splice formula is as follows:
E s =[E h ,E w ]
after the above calculations are completed, the same number of sample features are obtained as inputs but the features represent more complete, which will be used in the feature query module as a set of rich and target class specific features for the feature query.
In the present embodiment, only one L is used fq The Transformer Decoder layer of the layers is realized. Unlike previous methods, which perform interaction calculations on a sample-by-sample basis, the method described in this embodiment is more efficient and can interact with any number of sample feature representations simultaneously.
Said N to be obtained ex The spatial interaction between the sample features and the output features of the last layer is carried out to obtain a query feature map, which comprises the following steps:
n to be obtained ex The individual sample features are used as a representation template set of the target class, namely Key F is generated K And Value F V The method comprises the steps of carrying out a first treatment on the surface of the The target class represents the class of the sample, namely the target class which needs to be counted.
The output feature of the last layer is used as the image representation feature, and the feature of each position is used as one element of the Query set to form a Query F Q And F is combined with K And F V And simultaneously sending the query feature map to a feature query module for second mutual attention calculation, and finally obtaining the query feature map.
Preferably, the calculation process of the second mutual attention calculation is as follows:
wherein E is pos Representation position embedding (Position Embedding) using a sine and cosine representation commonly used by a transducer; MHSA represents a multi-head self-attention mechanism layer, LN represents a normalization layer, and MLP represents a multi-layer perceptron;initial features (including sample features with embedded locations) representing the attention calculations; />Representing intermediate results of the layer I attention calculation;representing the results of the layer 1 attention calculation; />The result of the layer i attention calculation is shown.
It should be noted that to prevent the model from overfitting the training set image, we have removed one of the multiple self-attention layers of Transformer Decoder, avoiding self-attention computation of image features.
In implementation, channel interaction is carried out on sample features, weights of all channels in a query feature map are determined, specifically, global average pooling is carried out on all sample features obtained through second mutual attention calculation in a space dimension, then compression reactivation (Squeeze & specification) operation is carried out in a feature dimension, weights corresponding to different channels of the query feature map are output, and accordingly weighting is carried out on all channels of the query feature map, a query correlation map is obtained, and correlation between an excavated image and sample features is achieved.
In a specific embodiment, the feature map size is reduced relative to the original image, and for better explaining the method, the target category positioning prediction according to the query correlation map includes predicting suggested points, predicting a confidence level of each suggested point, and predicting a suggested frame size of each suggested point, including:
first, defining a query correlation diagram obtained by channel re-weighting asThe size of the catalyst is H r ×W r Each pixel point on the query correlation graph corresponds to a patch with the size of s multiplied by s in the original image, and the patch represents a region associated with a certain point;
for each patch, defining k anchor points, and obtaining an anchor point set A= { A for the whole query correlation graph i |i∈{1,…,H r ×W r X k }, where the coordinates of the anchor point are A i =(x i ,y i );
For each anchor point A i Predicting suggested points and anchor points a using offset heads i Offset betweenThe final suggested point coordinates are +.>Wherein α is a scaling factor;
and, for each anchor point, using the classification head to generate a confidence C for each suggested point i The method comprises the steps of carrying out a first treatment on the surface of the Generating width and height of potential targets for each point using a size headMultiplying the scaling factor beta to obtain the final target sizeBased on the suggested point coordinates and the corresponding target size, a new set of suggested point coordinates is obtained>The width and height of the center are +.>Advice of (2)And (3) a frame. Let α=β=100 here.
Generating k suggested points and corresponding confidence levels and suggested frames for each pixel point, so that H is added in total r ×W r X k suggested points, i.e., set P, and also a corresponding confidence set C and suggested box size set S. Wherein each point in the set of suggested points has a corresponding coordinate, confidence level, and suggested frame size.
And finally, screening the suggested points through the confidence coefficient, wherein the suggested points with the confidence coefficient exceeding 0.5 form a set to serve as a final prediction result, and the number of the suggested points in the formed set serves as a counting result, namely the final prediction result.
The invention not only comprises an offset head and a classification head for generating the suggested points and the confidence of the suggested points, but also designs a size head for predicting the size of the corresponding position of the suggested points, which not only can predict the target point, but also can predict the exact area where the corresponding target of the point is located. In addition, a three-layer full convolution design is used for offset, sort, and size heads.
In this embodiment, after screening the suggested points by confidence, counting the number of suggested points whose confidence exceeds a set threshold (0.5) as a count result, the method further includes: matching the suggested point set and the target point set; and optimizing the target category positioning prediction process based on the matching result. The target point represents the coordinates of each target given by the dataset.
Further, by matching the proposed point set with the target point set, it includes: each target point in the target point set is respectively matched to one of the suggested points in the suggested point set, and zeta (i) represents the target point G i Matched suggested point subscripts, i.e. suggested point P ζ(i) Is the target point G i Is determined by the position of the predicted position of (2); and each sample target box in the sample target box set is respectively matched to a certain suggested box size.
Still further, optimizing the target category location prediction process based on the matching result includes:
cross entropy for confidenceOptimizing, loss function is defined as L cls ;
For coordinates, the loss is calculated using the mean square error, the loss function being defined as L loc ;
For sample target frame sizes, the loss is calculated using Manhattan distance, and the loss function is defined as L size ;
The loss function is defined as follows:
therein, N, M, N ex Respectively representing the number of suggested points, the number of target points and the number of sample target frames;indicating whether the current predicted point is matched with a true value point; gamma ray 1 Penalty factors that are mismatching suggested points; />And->X and y coordinates representing the i-th target point; />And->X and y coordinates representing the proposed point to which the i-th target point is matched; />Andrepresenting the width and height of a suggestion point corresponding to the suggestion frame to which the ith target point is matched; />And->Then the width and height of the ith sample in the small number (1-3) of samples corresponding to the current sample are indicated, since only the targets corresponding to the samples in the data set give the size, L size The loss is calculated only for the suggested boxes to which the sample corresponds to the target point matching.
The optimized total loss function is calculated as follows:
L=L cls +λ 1 L loc +λ 2 L size
wherein lambda is 1 And lambda (lambda) 2 Is a weight parameter.
According to the invention, a small number of sample marking frames are used for monitoring the predicted size, and the size monitoring method is introduced to improve the counting accuracy, so that the target frame predicting effect can be achieved while the target number can be accurately predicted.
Compared with a counting scheme based on a density map, the method provided by the invention not only can predict the number of target objects, but also can predict the position of a target point and a corresponding target area, so that the method can be even directly used for target detection tasks under any category and very few labeling conditions, and can further meet the requirements of complex downstream tasks.
A large number of experiments show that the method for realizing space interaction and channel interaction is superior to the most advanced method in popular CAC benchmark test, and good positioning accuracy and target frame generation effect are also obtained. In addition, experiments show that the reasoning time of the invention is obviously superior to other methods.
Example 2
Based on the arbitrary category target positioning and counting method based on the few sample labeling described in embodiment 1, the present invention further provides a system of the arbitrary category target positioning and counting method based on the few sample labeling, as shown in fig. 2, the system includes: the system comprises a layered sample collaborative enhancement module, a characteristic interaction module and a multi-head positioning module;
the hierarchical sample collaborative enhancement module comprises a hierarchical sample extraction module, a cross-scale alignment module and feature collaborative enhancement;
the hierarchical sample extraction module is used for acquiring a plurality of layers of output features of an input image, carrying out region-of-interest pooling operation with the output features of at least 1 sample target frame of a given input image on different layers, and pooling each sample region feature to a size of 1x1xC i Is a multi-scale feature of (2);
the cross-scale alignment module is used for extracting layered sample characteristics of an input image, and performing characteristic channel number alignment processing on output characteristics of different layers by using different mapping layers to obtain a (n multiplied by L) multiplied by C N Is a feature matrix of (1); wherein each row represents a representation of a feature of a sample;
the feature cooperative enhancement is used for carrying out first mutual attention calculation according to the obtained n multiplied by L sample features, and integrating the size information into the features in a size embedding mode for cooperative enhancement so as to realize information exchange among different sample features;
the feature interaction module comprises a space interaction module and a channel interaction module;
the space interaction module is used for obtaining N ex Spatially interacting the individual sample features with the output features of the last layer to obtain a query feature map;
the channel interaction module is used for carrying out channel interaction on the sample characteristics obtained through the first mutual attention calculation, determining the weight of each channel in the query characteristic diagram, obtaining the query correlation diagram, and realizing the correlation between the mined image and the sample characteristics;
the multi-head positioning module is used for carrying out target category positioning prediction according to the query correlation diagram, and comprises the steps of predicting the suggested points, predicting the confidence coefficient of each suggested point and predicting the suggested frame size of each suggested point.
The hierarchical sample extraction module comprises a feature extractor for acquiring a plurality of layers of output features of an input image, and a region pooling operation layer for pooling the output features of the layers of output features and the output features of a sample target frame of a given input image on different layers.
The multi-head positioning module comprises an offset head, a classification head and a size head.
It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.
Claims (10)
1. A method for positioning and counting any type of targets based on few sample labeling is characterized by comprising the following steps: the method comprises the following steps:
extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and pooling each sample region characteristic to be 1×1×C in size i Is a multi-scale feature of (2);
the output features of different layers are aligned by using different mapping layers to obtain a feature channel number (n multiplied by L) multiplied by C N Is a feature matrix of (1); wherein each row in the feature matrix represents a sample feature representation;
according to the obtained N ex Performing first mutual attention calculation on n multiplied by L sample features, and integrating size information into the sample features in a size embedding mode to perform cooperative enhancement so as to realize information exchange among different sample features;
n to be obtained ex Output of individual sample features and last layerThe features are spatially interacted to obtain a query feature map;
carrying out channel interaction on the sample characteristics, and determining the weight of each channel in the query characteristic diagram, so that each channel in the query characteristic diagram is weighted to obtain a query correlation diagram, and the correlation between the mined image and the sample characteristics is realized;
performing target category positioning prediction according to the query correlation diagram, wherein the target category positioning prediction comprises predicting suggestion points, predicting the confidence level of each suggestion point and predicting the suggestion frame size of each suggestion point;
and finally screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value as a counting result, and taking the counting result as a final prediction result.
2. The method for locating and counting any type of targets based on few-sample labeling according to claim 1, wherein the method comprises the following steps: the N is obtained according to ex The first mutual attention calculation is performed on the n×l sample features, so as to realize information exchange between different sample features, including: by L ca The Transformer Encoder layers perform a first mutual attention calculation on the nxl sample features,
the calculation process of the first mutual attention calculation is as follows:
z′ l =MHSA(LN(z l-1 ))+z l-1 ,l=1,...,L ca
z l =MLP(LN(z′ l ))+z′ l ,l=1,…,L ca
wherein z is 0 Initial features (including size-embedded sample features) representing attention calculations;indicate->Sample characteristics; z'. l Representing intermediate results of the layer I attention calculation; z l-1 Representing the results of the layer 1 attention calculation; z l The result MHSA, which represents the layer i attention computation, represents the multi-headed self-attention mechanism layer; LN represents the normalization layer; MLP represents a multi-layer perceptron;representing size embeddings.
3. The method for locating and counting any type of targets based on few-sample labeling according to claim 2, wherein the method comprises the following steps: incorporating size information into features in combination with a size embedding includes:
firstly, counting the range and frequency of the occurrence of the size of the sample in the image training set;
then dividing the equal frequency of the size range into K different size ranges according to the principle that the occurrence frequencies in the size different ranges are the same, and obtaining 2K size embedded feature representations corresponding to the different size ranges;
searching the size interval corresponding to the current width and height during use to obtain the corresponding height and width size embedded E h And E is w Wherein E is w 、E h Is half of the feature dimension of the sample, and the two are spliced together to be embedded as the final dimension of the sample.
4. The method for locating and counting any type of targets based on few-sample labeling according to claim 1, wherein the method comprises the following steps: said N to be obtained ex The spatial interaction between the sample features and the output features of the last layer is carried out to obtain a query feature map, which comprises the following steps:
n to be obtained ex The individual sample features are used as a representation template set of the target class, namely Key F is generated K And Value F V ;
The output features of the last layer are taken as image representation features, the features of each position are taken as one element of the query set, and the structure is formedBecome Query F Q And F is combined with K And F V And simultaneously sending the query feature map to a feature query module for second mutual attention calculation, and finally obtaining the query feature map.
5. The method for locating and counting any type of targets based on few-sample labeling according to claim 4, wherein the method comprises the following steps: the calculation process of the second mutual attention calculation is as follows:
wherein E is pos Representation position embedding (Position Embedding) using a sine and cosine representation commonly used by a transducer; MHSA represents a multi-head self-attention mechanism layer, LN represents a normalization layer, and MLP represents a multi-layer perceptron;initial features (including sample features with embedded locations) representing the attention calculations; />Representing intermediate results of the layer I attention calculation; />Representing the results of the layer 1 attention calculation; />Junction representing layer I attention calculationsAnd (5) fruits.
6. The method for locating and counting any type of targets based on few-sample labeling according to claim 1, wherein the method comprises the following steps: the target category positioning prediction according to the query correlation diagram comprises predicting a suggested point, predicting the confidence level of each suggested point and predicting the size of a suggested frame of each suggested point, and the method comprises the following steps:
first define the query correlogram asThe size of the catalyst is H r ×W r The method comprises the steps of carrying out a first treatment on the surface of the Each pixel point on the query correlation graph corresponds to a patch with the size of s multiplied by s in the original image, and the patch represents a region associated with a certain point;
for each patch, defining k anchor points, and obtaining an anchor point set A= { A for the whole query correlation graph i |i∈{1,…,H r ×W r X k }, where the coordinates of the anchor point are A i =(x i ,y i );
For each anchor point A i Predicting suggested points and anchor points a using offset heads i Offset betweenThe final suggested point coordinates are +.>Wherein λ is the scaling factor;
and, for each anchor point, using the classification head to generate a confidence C for each suggested point i The method comprises the steps of carrying out a first treatment on the surface of the Generating width and height of potential targets for each point using a size headMultiplying the scaling factor beta again to obtain the final target size +.>According to the constructionThe proposed point coordinates and the corresponding target dimensions, the proposed point coordinates are obtained +.>The width and height of the center are +.>Is a suggestion frame of (1);
generating k suggested points and corresponding confidence levels and suggested frames for each pixel point, so that H is added in total r ×W r X k suggested points, i.e., set P, and also a corresponding confidence set C and suggested box size set S.
7. The method for positioning and counting any category of targets based on few sample labeling according to any one of claims 1 to 6, wherein the method is characterized by comprising the following steps: after screening the suggested points by confidence, counting the number of suggested points with confidence exceeding a set threshold (0.5) as a count result, the method further comprises: and matching the suggested point set and the target point set, and optimizing the target category positioning prediction process based on a matching result.
8. The method for locating and counting any type of targets based on few-sample labeling according to claim 7, wherein the method comprises the following steps: the matching of the suggested point set and the target point set comprises the following steps:
each target point in the target point set is respectively matched to one of the suggested points in the suggested point set, and zeta (i) represents the target point G i Matched suggested point subscripts, i.e. suggested point P ζ(i) Is the target point G i Is determined by the position of the predicted position of (2); and each sample target box in the sample target box set is respectively matched to a certain suggested box size.
9. The method for locating and counting any type of targets based on few-sample labeling according to claim 8, wherein the method comprises the following steps: optimizing the target category positioning prediction process based on the matching result, including:
for confidence, optimization is performed using cross entropy, and the loss function is defined as L cls ;
For coordinates, the loss is calculated using the mean square error, the loss function being defined as L loc ;
For sample target frame sizes, the loss is calculated using Manhattan distance, and the loss function is defined as L size ;
The optimized total loss function is calculated as follows:
L=L cls +λ 1 L loc +λ 2 L size
wherein lambda is 1 And lambda (lambda) 2 Is a weight parameter.
10. A system based on any class of object locating and counting method based on few sample labeling as claimed in any one of claims 1 to 9, characterized in that: the method comprises the following steps: the system comprises a layered sample collaborative enhancement module, a characteristic interaction module and a multi-head positioning module;
the hierarchical sample collaborative enhancement module comprises a hierarchical sample extraction module, a cross-scale alignment module and feature collaborative enhancement;
the layered sample extraction module is used for extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and pooling the region characteristics of each sample to a size of 1x1xC i Is a multi-scale feature of (2);
the cross-scale alignment module is used for performing feature channel number alignment processing on the output features of different layers by using different mapping layers to obtain a module with the size of (n multiplied by L) multiplied by C N Is a feature matrix of (1); wherein each row represents a representation of a feature of a sample;
the characteristic is synergistically enhanced for the following N ex The first mutual attention calculation is carried out on the n×l sample features, the size information is integrated into the sample features in a size embedding mode for cooperative enhancement,the information exchange between different example features is realized;
the feature interaction module comprises a space interaction module and a channel interaction module;
the space interaction module is used for obtaining N ex The n multiplied by L sample features are spatially interacted with the output features of the last layer to obtain a query feature map;
the channel interaction module is used for carrying out channel interaction on the obtained sample characteristics and determining the weight of each channel in the query characteristic diagram so as to weight each channel of the query characteristic diagram to obtain a query correlation diagram and realize the correlation between the mined image and the sample characteristics;
the multi-head positioning module is used for carrying out target category positioning prediction according to the query correlation diagram, and comprises the steps of predicting the suggested points, predicting the confidence coefficient of each suggested point and predicting the suggested frame size of each suggested point; and screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value as a counting result, and taking the counting result as a final prediction result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311370741.4A CN117611830A (en) | 2023-10-20 | 2023-10-20 | Random class target positioning and counting method based on few sample labeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311370741.4A CN117611830A (en) | 2023-10-20 | 2023-10-20 | Random class target positioning and counting method based on few sample labeling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117611830A true CN117611830A (en) | 2024-02-27 |
Family
ID=89955045
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311370741.4A Pending CN117611830A (en) | 2023-10-20 | 2023-10-20 | Random class target positioning and counting method based on few sample labeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117611830A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117809293A (en) * | 2024-03-01 | 2024-04-02 | 电子科技大学 | Small sample image target counting method based on deep neural network |
-
2023
- 2023-10-20 CN CN202311370741.4A patent/CN117611830A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117809293A (en) * | 2024-03-01 | 2024-04-02 | 电子科技大学 | Small sample image target counting method based on deep neural network |
CN117809293B (en) * | 2024-03-01 | 2024-05-03 | 电子科技大学 | Small sample image target counting method based on deep neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110135267B (en) | Large-scene SAR image fine target detection method | |
CN113191215B (en) | Rolling bearing fault diagnosis method integrating attention mechanism and twin network structure | |
CN112966684B (en) | Cooperative learning character recognition method under attention mechanism | |
CN107766894B (en) | Remote sensing image natural language generation method based on attention mechanism and deep learning | |
CN111738363B (en) | Alzheimer disease classification method based on improved 3D CNN network | |
Li et al. | Adaptive deep convolutional neural networks for scene-specific object detection | |
CN112069868A (en) | Unmanned aerial vehicle real-time vehicle detection method based on convolutional neural network | |
Yang et al. | A deep multiscale pyramid network enhanced with spatial–spectral residual attention for hyperspectral image change detection | |
CN113609896A (en) | Object-level remote sensing change detection method and system based on dual-correlation attention | |
Yadav et al. | An improved deep learning-based optimal object detection system from images | |
CN114120361B (en) | Crowd counting and positioning method based on coding and decoding structure | |
CN108664986B (en) | Based on lpNorm regularized multi-task learning image classification method and system | |
CN117611830A (en) | Random class target positioning and counting method based on few sample labeling | |
CN111639697B (en) | Hyperspectral image classification method based on non-repeated sampling and prototype network | |
CN116524356A (en) | Ore image small sample target detection method and system | |
CN115311502A (en) | Remote sensing image small sample scene classification method based on multi-scale double-flow architecture | |
CN115830449A (en) | Remote sensing target detection method with explicit contour guidance and spatial variation context enhancement | |
CN115239765A (en) | Infrared image target tracking system and method based on multi-scale deformable attention | |
Li et al. | Learning to holistically detect bridges from large-size vhr remote sensing imagery | |
CN113496260A (en) | Grain depot worker non-standard operation detection method based on improved YOLOv3 algorithm | |
CN116758419A (en) | Multi-scale target detection method, device and equipment for remote sensing image | |
CN113780092B (en) | Crowd counting method based on block weak labeling | |
CN114913504A (en) | Vehicle target identification method of remote sensing image fused with self-attention mechanism | |
CN115272741A (en) | Detection method of slender flexible object, terminal equipment and storage medium | |
CN108154107A (en) | A kind of method of the scene type of determining remote sensing images ownership |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |