CN117611830A

CN117611830A - Random class target positioning and counting method based on few sample labeling

Info

Publication number: CN117611830A
Application number: CN202311370741.4A
Authority: CN
Inventors: 吴贺丰; 陈燕栋; 王可泽; 林倞
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2023-10-20
Filing date: 2023-10-20
Publication date: 2024-02-27

Abstract

The invention discloses a method for positioning and counting any type of targets based on few sample labeling, which comprises the following steps: acquiring a plurality of layers of output characteristics of an input image, and carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frames of a given input image on different layers; pairing output features of different layers using different mapping layersPerforming alignment treatment to obtain (n×L) ×C _N Is a feature matrix of (1); according to N _ex Performing first mutual attention calculation on the sample characteristics, and integrating size information into the sample characteristics; n to be obtained _ex Spatially interacting the individual sample features with the output features of the last layer to obtain a query feature map; carrying out channel interaction on the sample characteristics, determining the weight of each channel in the query characteristic diagram, and obtaining a query correlation diagram; and positioning the target category and predicting the count according to the query correlation diagram. The invention effectively models the target category representing characteristics with more discriminant, and does not lose and fully utilizes the size information of the sample.

Description

Random class target positioning and counting method based on few sample labeling

Technical Field

The invention relates to the technical field of computer vision, in particular to a method for positioning and counting any type of targets based on few sample labels.

Background

The visual target counting task has very important practical value in living and industrial production, is often used for counting crowds, vehicles or workpieces and the like, and the main research paradigm is special for special models, one model can only be used for detecting the same kind of objects, and the paradigm cannot effectively process the kind of objects without a targeted pre-training counting model; and in the training process of the model, a large amount of marking data is often needed, and the condition that the marking sample is very few is difficult to process.

Thus, a new research paradigm is proposed, namely Class-independent counting (CAC-Agnostic Counting), requiring a model to complete counting Class objects corresponding to a sample given an image and a small number of target samples (typically 1-3), which requires a model to be able to handle any Class of objects without separately training the model for new Class targets, and thus effectively solves the above-mentioned problems, namely without retraining the model for new classes and without additionally labeling a large number of Class samples to be counted.

The current research direction of the paradigm is a regression method based on density map estimation. Specifically, the GMN splices the pooling result and the query feature through pooling sample features, and then learns a regression head to perform point-to-point feature comparison. FamNet uses sample features to generate convolution kernels to obtain feature correlations for prediction and uses adaptive loss to fine tune model parameters at test time. Shi et al propose a joint learning framework bmnet+ that combines representation learning and similarity metrics for generating a density map with more accurate counting results. SAFECount uses support (sample) features to enhance query (image) features so that extracted features are more refined, thereby optimizing the generated density map.

The above method does complete the object counting task under the condition that a small number of sample objects can be given, but in a complex real scene, only the number of objects or a density map can not meet the requirement of a downstream task, because the number of objects is only a rough index, and in some scenes, the position information of the objects is very important, for example, in industrial production, the position information of the objects needs to be known for accurate production scheduling, so that in many scenes, even the density map can not meet the practical requirement, and it is noted that the current method still has a certain improvement space in counting accuracy.

Disclosure of Invention

The invention provides a method for positioning and counting any type of targets based on few sample labels, which aims to solve the problems of the defects and the defects existing in the prior art.

In order to achieve the above purpose of the present invention, the following technical scheme is adopted:

an arbitrary category target positioning and counting method based on few sample labeling, which comprises the following steps:

extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and pooling each sample region characteristic to be 1x1xC in size _i Is a multi-scale feature of (2);

the output features of different layers are aligned by using different mapping layers to obtain a feature channel number (n multiplied by L) multiplied by C _N Is a feature matrix of (1); wherein each row in the feature matrix represents a sample feature representation;

according to the obtained N _ex First mutual attention calculation is carried out on the n multiplied by L sample characteristics, and size information is integrated in a size embedding modeCarrying out cooperative enhancement in the sample characteristics to realize information exchange among different sample characteristics;

n to be obtained _ex Spatially interacting the individual sample features with the output features of the last layer to obtain a query feature map;

carrying out channel interaction on the sample characteristics, and determining the weight of each channel in the query characteristic diagram, so that each channel in the query characteristic diagram is weighted to obtain a query correlation diagram, and the correlation between the mined image and the sample characteristics is realized;

performing target category positioning prediction according to the query correlation diagram, wherein the target category positioning prediction comprises predicting suggestion points, predicting the confidence level of each suggestion point and predicting the suggestion frame size of each suggestion point;

and finally screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value as a counting result, and taking the counting result as a final prediction result.

Preferably, the method comprises the step of obtaining N according to _ex The first mutual attention calculation is performed on the n×l sample features, so as to realize information exchange between different sample features, including: by L _ca The Transformer Encoder layers perform a first mutual attention calculation on the nxl sample features,

the calculation process of the first mutual attention calculation is as follows:

z′ _l ＝MHSA(LN(z _l-1 ))+z _l-1 ,l＝1,…,L _ca

z _l ＝MLP(LN(z′ _l ))+z′ _l ,l＝1,…,L _ca

wherein z is ₀ Initial features (including size-embedded sample features) representing attention calculations;representing the ith sample feature; z'. _l Representing intermediate results of the layer I attention calculation; z _l-1 Representing the results of the layer 1 attention calculation; z _l The result MHSA, which represents the layer i attention computation, represents the multi-headed self-attention mechanism layer; LN represents the normalization layer; MLP represents a multi-layer perceptron; />Representing size embeddings.

Further, incorporating the size information into the feature in combination with the size embedding includes:

firstly, counting the range and frequency of the occurrence of the size of the sample in the image training set;

then dividing the equal frequency of the size range into K different size ranges according to the principle that the occurrence frequencies in the size different ranges are the same, and obtaining 2K size embedded feature representations corresponding to the different size ranges;

searching the size interval corresponding to the current width and height during use to obtain the corresponding height and width size embedded E _h And E is _w Wherein E is _w 、E _h Is half of the feature dimension of the sample, and the two are spliced together to be embedded as the final dimension of the sample.

Preferably, the N to be obtained _ex The spatial interaction between the sample features and the output features of the last layer is carried out to obtain a query feature map, which comprises the following steps:

n to be obtained _ex The individual sample features are used as a representation template set of the target class, namely Key F is generated _K And Value F _V ；

The output feature of the last layer is used as the image representation feature, and the feature of each position is used as one element of the Query set to form a Query F _Q And F is combined with _K And F _V And simultaneously sending the query feature map to a feature query module for second mutual attention calculation, and finally obtaining the query feature map.

Further, the calculation process of the second mutual attention calculation is as follows:

wherein E is _pos Representation position embedding (Position Embedding) using a sine and cosine representation commonly used by a transducer; MHSA represents a multi-head self-attention mechanism layer, LN represents a normalization layer, and MLP represents a multi-layer perceptron;initial features (including sample features with embedded locations) representing the attention calculations; />Representing intermediate results of the layer I attention calculation;representing the results of the layer 1 attention calculation; />The result of the layer i attention calculation is shown.

Preferably, the target category positioning prediction according to the query correlation diagram includes predicting a suggestion point, predicting a confidence level of each suggestion point, predicting a suggestion frame size of each suggestion point, and includes:

first define the query correlogram asThe size of the catalyst is H _r ×W _r The method comprises the steps of carrying out a first treatment on the surface of the Each pixel point on the query correlation graph corresponds to a patch with the size of s multiplied by s in the original image, and the patch represents a region associated with a certain point;

for each patch, defining k anchor points, then for the wholeQuerying the correlation diagram to obtain an anchor point set A= { A _i |i∈{1,…,H _r ×W _r X k }, where the coordinates of the anchor point are A _i ＝(x _i ,y _i )；

For each anchor point A _i Predicting suggested points and anchor points a using offset heads _i Offset betweenThe final suggested point coordinates are +.>Wherein λ is the scaling factor;

and, for each anchor point, using the classification head to generate a confidence C for each suggested point _i The method comprises the steps of carrying out a first treatment on the surface of the Generating width and height of potential targets for each point using a size headMultiplying the scaling factor beta to obtain the final target sizeBased on the suggested point coordinates and the corresponding target size, a new set of suggested point coordinates is obtained>The width and height of the center are +.>Is a suggestion frame of (1);

generating k suggested points and corresponding confidence levels and suggested frames for each pixel point, so that H is added in total _r ×W _r X k suggested points, i.e., set P, and also a corresponding confidence set C and suggested box size set S.

Further, after screening the suggested points by confidence, counting the number of suggested points with confidence exceeding a set threshold (0.5) as a count result, the method further includes: and matching the suggested point set and the target point set, and optimizing the target category positioning prediction process based on a matching result.

Still further, the matching the set of suggested points with the set of target points includes:

each target point in the target point set is respectively matched to one of the suggested points in the suggested point set, and zeta (i) represents the target point G _i Matched suggested point subscripts, i.e. suggested point P _ζ(i) Is the target point G _i Is determined by the position of the predicted position of (2); and each sample target box in the sample target box set is respectively matched to a certain suggested box size.

Still further, optimizing the target category location prediction process based on the matching result includes:

for confidence, optimization is performed using cross entropy, and the loss function is defined as L _cls ；

For coordinates, the loss is calculated using the mean square error, the loss function being defined as L _loc ；

For sample target frame sizes, the loss is calculated using Manhattan distance, and the loss function is defined as L _size ；

The optimized total loss function is calculated as follows:

L＝L _cls +λ ₁ L _loc +λ ₂ L _size

wherein lambda is ₁ And lambda (lambda) ₂ Is a weight parameter.

A system for arbitrary class target positioning and counting method based on few sample labeling, the system comprising: the system comprises a layered sample collaborative enhancement module, a characteristic interaction module and a multi-head positioning module;

the hierarchical sample collaborative enhancement module comprises a hierarchical sample extraction module, a cross-scale alignment module and feature collaborative enhancement;

the layered sample extraction module is used for extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and carrying out region-specific operation on each sampleCondition were pooled to a size of 1x1xC _i Is a multi-scale feature of (2);

the cross-scale alignment module is used for performing feature channel number alignment processing on the output features of different layers by using different mapping layers to obtain a module with the size of (n multiplied by L) multiplied by C _N Is a feature matrix of (1); wherein each row represents a representation of a feature of a sample;

the characteristic is synergistically enhanced for the following N _ex Performing first mutual attention calculation on n multiplied by L sample features, and integrating size information into the sample features in a size embedding mode to perform cooperative enhancement so as to realize information exchange among different sample features;

the feature interaction module comprises a space interaction module and a channel interaction module;

the space interaction module is used for obtaining N _ex The n multiplied by L sample features are spatially interacted with the output features of the last layer to obtain a query feature map;

the channel interaction module is used for carrying out channel interaction on the obtained sample characteristics and determining the weight of each channel in the query characteristic diagram so as to weight each channel of the query characteristic diagram to obtain a query correlation diagram and realize the correlation between the mined image and the sample characteristics;

the multi-head positioning module is used for carrying out target category positioning prediction according to the query correlation diagram, and comprises the steps of predicting the suggested points, predicting the confidence coefficient of each suggested point and predicting the suggested frame size of each suggested point; and screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value as a counting result, and taking the counting result as a final prediction result.

The beneficial effects of the invention are as follows:

the method can be directly used for target detection tasks under any category and very few labeling conditions, and can meet the requirements of complex downstream tasks more than the existing method.

In the method, sample characteristics of different scales can be effectively extracted in the process of layered sample characteristic extraction and collaborative enhancement, and characteristic representations from different samples can be interactively shared through first mutual attention calculation, so that target category representing characteristics with more discriminant are effectively modeled, in addition, size information is fused into the characteristic representations in a size embedding mode before the first mutual attention calculation, and the model is not lost and the size information of the samples is fully utilized.

The invention also effectively carries out feature interaction on the image features and any number of sample features in parallel from two dimensions of the space and the channel, thereby improving the calculation efficiency of the model and the comprehensive matching capability of the sample features.

Drawings

FIG. 1 is a flow chart of the method for locating and counting arbitrary targets based on few sample labels according to the present invention.

FIG. 2 is a system schematic block diagram of the arbitrary class target locating and counting method based on the few sample labeling according to the present invention.

Detailed Description

Further advantages and effects of the present invention will become readily apparent to those skilled in the art from the disclosure herein, by referring to the accompanying drawings and the preferred embodiments. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

Example 1

As shown in fig. 1, a method for positioning and counting arbitrary category targets based on few sample labeling includes the following steps:

extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and pooling each sample region characteristic to be 1×1×C in size _i Is a multi-scale feature of (2);

the output features of different layers are aligned by using different mapping layers to obtain a feature channel number (n multiplied by L) multiplied by C _N Is a feature matrix of (1); wherein each row in the feature matrix represents a feature representation of a sample;

according to the obtained N _ex Performing first mutual attention calculation on n multiplied by L sample features, and integrating size information into the sample features in a size embedding mode to perform cooperative enhancement so as to realize information exchange among different sample features;

and finally screening the suggested points through the confidence coefficient, counting the number of the suggested points with the confidence coefficient exceeding a set threshold value (0.5) as a counting result, and taking the counting result as a final prediction result.

Since only about 1 to 3 target labeling frames of the target objects can be obtained in this embodiment, it is necessary to be able to extract features that are sufficiently rich and can characterize a specified class from among the target objects for the subsequent similarity calculation process. For this purpose not only all samples are used, but also each layer of output features (4 layers) of the feature extractor is fully utilized. Concrete embodimentsIn (3) a region of interest pooling (RoIPooling) operation is performed on each extracted layer of output features to pool each sample region feature to a size of 1×1×C _i Assuming that the feature extractor can output N layers of output features, n×l output features for the target class object can be obtained for N samples.

In this embodiment, for the image to be counted, only a small amount (at least one) of target object areas in the image need to be selected, and the counting, positioning and framing tasks of other objects in the image can be more accurately completed by the method of the present invention.

In a specific embodiment, because different objects of the same class may have very different morphologies in the same image, if the morphologies are not so similar to the given examples, capturing commonalities and differences between the different examples is particularly important. However, so far, the features from different examples have not seen each other, i.e. the commonalities and differences between the different examples cannot be exploited and exploited without information interaction. Thus we apply to all acquired N _ex The =n×l features perform mutual attention calculations, further optimizing feature expression. We use L _ca The Transformer Encoder layers perform a first attention calculation on these feature vectors.

In a specific embodiment, the method comprises the steps of _ex The first mutual attention calculation is performed on the n×l sample features, so as to realize information exchange between different sample features, including: by L _ca Transformer Encoder layer pairs N _ex The first mutual attention calculation is performed for the =n×l sample features,

z′ _l ＝MHSA(LN(z _l-1 ))+z _l-1 ,l＝1,…,L _ca

z _l ＝MLP(LN(z′ _l ))+z′ _l ,l＝1,…,L _ca

wherein z is ₀ Initial features (including size-embedded sample features) representing attention calculations;representing the ith sample feature; z'. _l Representing intermediate results of the layer I attention calculation; z _l-1 Representing the results of the layer 1 attention calculation; z _l Representing the result of the layer I attention calculation; MHSA represents a multi-headed self-attention mechanism layer; LN represents the normalization layer; MLP represents a multi-layer perceptron; />Representing size embeddings.

The present embodiment introduces size embedding because the ROI Pooling is used to compress the spatial dimensions of the features during feature extraction, which causes the model to lose the size information of the sample object. It is necessary to incorporate size information into the features here by way of size embedding.

In this embodiment, the method for integrating the size information into the feature in combination with the size embedding includes:

and then dividing the frequency of the size range into K different size ranges according to the principle that the frequency appears the same in the size different ranges (without losing generality, the maximum value of the last range is positive infinity), and counting the width and the height respectively to obtain 2K size embedded characteristic representations corresponding to different size ranges because the distinction of the width and the height is obviously important for some types of objects (such as books and the like, which are not similar in width).

Searching the size interval corresponding to the current width and height during use to obtain the corresponding height and width size embedded E _h And E is _w Wherein E is _w 、E _h The dimension of the sample is half of the characteristic dimension of the sample, and the two are spliced to be used as the sampleThe specific splice formula is as follows:

E _s ＝[E _h ,E _w ]

after the above calculations are completed, the same number of sample features are obtained as inputs but the features represent more complete, which will be used in the feature query module as a set of rich and target class specific features for the feature query.

In the present embodiment, only one L is used _fq The Transformer Decoder layer of the layers is realized. Unlike previous methods, which perform interaction calculations on a sample-by-sample basis, the method described in this embodiment is more efficient and can interact with any number of sample feature representations simultaneously.

Said N to be obtained _ex The spatial interaction between the sample features and the output features of the last layer is carried out to obtain a query feature map, which comprises the following steps:

n to be obtained _ex The individual sample features are used as a representation template set of the target class, namely Key F is generated _K And Value F _V The method comprises the steps of carrying out a first treatment on the surface of the The target class represents the class of the sample, namely the target class which needs to be counted.

Preferably, the calculation process of the second mutual attention calculation is as follows:

It should be noted that to prevent the model from overfitting the training set image, we have removed one of the multiple self-attention layers of Transformer Decoder, avoiding self-attention computation of image features.

In implementation, channel interaction is carried out on sample features, weights of all channels in a query feature map are determined, specifically, global average pooling is carried out on all sample features obtained through second mutual attention calculation in a space dimension, then compression reactivation (Squeeze & specification) operation is carried out in a feature dimension, weights corresponding to different channels of the query feature map are output, and accordingly weighting is carried out on all channels of the query feature map, a query correlation map is obtained, and correlation between an excavated image and sample features is achieved.

In a specific embodiment, the feature map size is reduced relative to the original image, and for better explaining the method, the target category positioning prediction according to the query correlation map includes predicting suggested points, predicting a confidence level of each suggested point, and predicting a suggested frame size of each suggested point, including:

first, defining a query correlation diagram obtained by channel re-weighting asThe size of the catalyst is H _r ×W _r Each pixel point on the query correlation graph corresponds to a patch with the size of s multiplied by s in the original image, and the patch represents a region associated with a certain point;

for each patch, defining k anchor points, and obtaining an anchor point set A= { A for the whole query correlation graph _i |i∈{1,…,H _r ×W _r X k }, where the coordinates of the anchor point are A _i ＝(x _i ,y _i )；

For each anchor point A _i Predicting suggested points and anchor points a using offset heads _i Offset betweenThe final suggested point coordinates are +.>Wherein α is a scaling factor;

and, for each anchor point, using the classification head to generate a confidence C for each suggested point _i The method comprises the steps of carrying out a first treatment on the surface of the Generating width and height of potential targets for each point using a size headMultiplying the scaling factor beta to obtain the final target sizeBased on the suggested point coordinates and the corresponding target size, a new set of suggested point coordinates is obtained>The width and height of the center are +.>Advice of (2)And (3) a frame. Let α=β=100 here.

Generating k suggested points and corresponding confidence levels and suggested frames for each pixel point, so that H is added in total _r ×W _r X k suggested points, i.e., set P, and also a corresponding confidence set C and suggested box size set S. Wherein each point in the set of suggested points has a corresponding coordinate, confidence level, and suggested frame size.

And finally, screening the suggested points through the confidence coefficient, wherein the suggested points with the confidence coefficient exceeding 0.5 form a set to serve as a final prediction result, and the number of the suggested points in the formed set serves as a counting result, namely the final prediction result.

The invention not only comprises an offset head and a classification head for generating the suggested points and the confidence of the suggested points, but also designs a size head for predicting the size of the corresponding position of the suggested points, which not only can predict the target point, but also can predict the exact area where the corresponding target of the point is located. In addition, a three-layer full convolution design is used for offset, sort, and size heads.

In this embodiment, after screening the suggested points by confidence, counting the number of suggested points whose confidence exceeds a set threshold (0.5) as a count result, the method further includes: matching the suggested point set and the target point set; and optimizing the target category positioning prediction process based on the matching result. The target point represents the coordinates of each target given by the dataset.

Further, by matching the proposed point set with the target point set, it includes: each target point in the target point set is respectively matched to one of the suggested points in the suggested point set, and zeta (i) represents the target point G _i Matched suggested point subscripts, i.e. suggested point P _ζ(i) Is the target point G _i Is determined by the position of the predicted position of (2); and each sample target box in the sample target box set is respectively matched to a certain suggested box size.

cross entropy for confidenceOptimizing, loss function is defined as L _cls ；

The loss function is defined as follows:

therein, N, M, N _ex Respectively representing the number of suggested points, the number of target points and the number of sample target frames;indicating whether the current predicted point is matched with a true value point; gamma ray ₁ Penalty factors that are mismatching suggested points; />And->X and y coordinates representing the i-th target point; />And->X and y coordinates representing the proposed point to which the i-th target point is matched; />Andrepresenting the width and height of a suggestion point corresponding to the suggestion frame to which the ith target point is matched; />And->Then the width and height of the ith sample in the small number (1-3) of samples corresponding to the current sample are indicated, since only the targets corresponding to the samples in the data set give the size, L _size The loss is calculated only for the suggested boxes to which the sample corresponds to the target point matching.

The optimized total loss function is calculated as follows:

L＝L _cls +λ ₁ L _loc +λ ₂ L _size

wherein lambda is ₁ And lambda (lambda) ₂ Is a weight parameter.

According to the invention, a small number of sample marking frames are used for monitoring the predicted size, and the size monitoring method is introduced to improve the counting accuracy, so that the target frame predicting effect can be achieved while the target number can be accurately predicted.

Compared with a counting scheme based on a density map, the method provided by the invention not only can predict the number of target objects, but also can predict the position of a target point and a corresponding target area, so that the method can be even directly used for target detection tasks under any category and very few labeling conditions, and can further meet the requirements of complex downstream tasks.

A large number of experiments show that the method for realizing space interaction and channel interaction is superior to the most advanced method in popular CAC benchmark test, and good positioning accuracy and target frame generation effect are also obtained. In addition, experiments show that the reasoning time of the invention is obviously superior to other methods.

Example 2

Based on the arbitrary category target positioning and counting method based on the few sample labeling described in embodiment 1, the present invention further provides a system of the arbitrary category target positioning and counting method based on the few sample labeling, as shown in fig. 2, the system includes: the system comprises a layered sample collaborative enhancement module, a characteristic interaction module and a multi-head positioning module;

the hierarchical sample extraction module is used for acquiring a plurality of layers of output features of an input image, carrying out region-of-interest pooling operation with the output features of at least 1 sample target frame of a given input image on different layers, and pooling each sample region feature to a size of 1x1xC _i Is a multi-scale feature of (2);

the cross-scale alignment module is used for extracting layered sample characteristics of an input image, and performing characteristic channel number alignment processing on output characteristics of different layers by using different mapping layers to obtain a (n multiplied by L) multiplied by C _N Is a feature matrix of (1); wherein each row represents a representation of a feature of a sample;

the feature cooperative enhancement is used for carrying out first mutual attention calculation according to the obtained n multiplied by L sample features, and integrating the size information into the features in a size embedding mode for cooperative enhancement so as to realize information exchange among different sample features;

the space interaction module is used for obtaining N _ex Spatially interacting the individual sample features with the output features of the last layer to obtain a query feature map;

the channel interaction module is used for carrying out channel interaction on the sample characteristics obtained through the first mutual attention calculation, determining the weight of each channel in the query characteristic diagram, obtaining the query correlation diagram, and realizing the correlation between the mined image and the sample characteristics;

the multi-head positioning module is used for carrying out target category positioning prediction according to the query correlation diagram, and comprises the steps of predicting the suggested points, predicting the confidence coefficient of each suggested point and predicting the suggested frame size of each suggested point.

The hierarchical sample extraction module comprises a feature extractor for acquiring a plurality of layers of output features of an input image, and a region pooling operation layer for pooling the output features of the layers of output features and the output features of a sample target frame of a given input image on different layers.

The multi-head positioning module comprises an offset head, a classification head and a size head.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. A method for positioning and counting any type of targets based on few sample labeling is characterized by comprising the following steps: the method comprises the following steps:

n to be obtained _ex Output of individual sample features and last layerThe features are spatially interacted to obtain a query feature map;

2. The method for locating and counting any type of targets based on few-sample labeling according to claim 1, wherein the method comprises the following steps: the N is obtained according to _ex The first mutual attention calculation is performed on the n×l sample features, so as to realize information exchange between different sample features, including: by L _ca The Transformer Encoder layers perform a first mutual attention calculation on the nxl sample features,

z′ _l ＝MHSA(LN(z _l-1 ))+z _l-1 ,l＝1,...,L _ca

z _l ＝MLP(LN(z′ _l ))+z′ _l ，l＝1,…,L _ca

wherein z is ₀ Initial features (including size-embedded sample features) representing attention calculations;indicate->Sample characteristics; z'. _l Representing intermediate results of the layer I attention calculation; z _l-1 Representing the results of the layer 1 attention calculation; z _l The result MHSA, which represents the layer i attention computation, represents the multi-headed self-attention mechanism layer; LN represents the normalization layer; MLP represents a multi-layer perceptron;representing size embeddings.

3. The method for locating and counting any type of targets based on few-sample labeling according to claim 2, wherein the method comprises the following steps: incorporating size information into features in combination with a size embedding includes:

4. The method for locating and counting any type of targets based on few-sample labeling according to claim 1, wherein the method comprises the following steps: said N to be obtained _ex The spatial interaction between the sample features and the output features of the last layer is carried out to obtain a query feature map, which comprises the following steps:

The output features of the last layer are taken as image representation features, the features of each position are taken as one element of the query set, and the structure is formedBecome Query F _Q And F is combined with _K And F _V And simultaneously sending the query feature map to a feature query module for second mutual attention calculation, and finally obtaining the query feature map.

5. The method for locating and counting any type of targets based on few-sample labeling according to claim 4, wherein the method comprises the following steps: the calculation process of the second mutual attention calculation is as follows:

wherein E is _pos Representation position embedding (Position Embedding) using a sine and cosine representation commonly used by a transducer; MHSA represents a multi-head self-attention mechanism layer, LN represents a normalization layer, and MLP represents a multi-layer perceptron;initial features (including sample features with embedded locations) representing the attention calculations; />Representing intermediate results of the layer I attention calculation; />Representing the results of the layer 1 attention calculation; />Junction representing layer I attention calculationsAnd (5) fruits.

6. The method for locating and counting any type of targets based on few-sample labeling according to claim 1, wherein the method comprises the following steps: the target category positioning prediction according to the query correlation diagram comprises predicting a suggested point, predicting the confidence level of each suggested point and predicting the size of a suggested frame of each suggested point, and the method comprises the following steps:

and, for each anchor point, using the classification head to generate a confidence C for each suggested point _i The method comprises the steps of carrying out a first treatment on the surface of the Generating width and height of potential targets for each point using a size headMultiplying the scaling factor beta again to obtain the final target size +.>According to the constructionThe proposed point coordinates and the corresponding target dimensions, the proposed point coordinates are obtained +.>The width and height of the center are +.>Is a suggestion frame of (1);

7. The method for positioning and counting any category of targets based on few sample labeling according to any one of claims 1 to 6, wherein the method is characterized by comprising the following steps: after screening the suggested points by confidence, counting the number of suggested points with confidence exceeding a set threshold (0.5) as a count result, the method further comprises: and matching the suggested point set and the target point set, and optimizing the target category positioning prediction process based on a matching result.

8. The method for locating and counting any type of targets based on few-sample labeling according to claim 7, wherein the method comprises the following steps: the matching of the suggested point set and the target point set comprises the following steps:

9. The method for locating and counting any type of targets based on few-sample labeling according to claim 8, wherein the method comprises the following steps: optimizing the target category positioning prediction process based on the matching result, including:

The optimized total loss function is calculated as follows:

L＝L _cls +λ ₁ L _loc +λ ₂ L _size

wherein lambda is ₁ And lambda (lambda) ₂ Is a weight parameter.

10. A system based on any class of object locating and counting method based on few sample labeling as claimed in any one of claims 1 to 9, characterized in that: the method comprises the following steps: the system comprises a layered sample collaborative enhancement module, a characteristic interaction module and a multi-head positioning module;

the layered sample extraction module is used for extracting layered sample characteristics of an input image, acquiring a plurality of layers of output characteristics of the input image, carrying out region-of-interest pooling operation on the output characteristics of at least 1 sample target frame of a given input image on different layers, and pooling the region characteristics of each sample to a size of 1x1xC _i Is a multi-scale feature of (2);

the characteristic is synergistically enhanced for the following N _ex The first mutual attention calculation is carried out on the n×l sample features, the size information is integrated into the sample features in a size embedding mode for cooperative enhancement,the information exchange between different example features is realized;