CN115311449A - Weak supervision image target positioning analysis system based on class reactivation mapping chart - Google Patents

Weak supervision image target positioning analysis system based on class reactivation mapping chart Download PDF

Info

Publication number
CN115311449A
CN115311449A CN202210864306.6A CN202210864306A CN115311449A CN 115311449 A CN115311449 A CN 115311449A CN 202210864306 A CN202210864306 A CN 202210864306A CN 115311449 A CN115311449 A CN 115311449A
Authority
CN
China
Prior art keywords
class
foreground
image
reactivation
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210864306.6A
Other languages
Chinese (zh)
Inventor
张玥杰
徐际岚
刘靖正
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202210864306.6A priority Critical patent/CN115311449A/en
Publication of CN115311449A publication Critical patent/CN115311449A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/245Aligning, centring, orientation detection or correction of the image by locating a pattern; Special marks for positioning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a weak supervision image target positioning analysis system based on a class reactivation mapping chart. The invention comprises the following steps: the device comprises a category context feature learning module, a category mapping map reactivation module and a category mapping map calibration module. The category context feature learning module extracts image features by using a convolutional neural network to generate an initial category mapping graph as index learning category context features; the category mapping image reactivation module takes the category context characteristics as a clustering cluster center, applies an expectation maximization algorithm to cluster the image pixel characteristics, and takes a hidden variable as a category reactivation mapping image; the class map calibration module calibrates foreground background activation values of the class reactivation maps and aggregates the class maps. The method effectively solves the problem of confusion of foreground and background activation values of the initial class mapping image, enables the distinction degree of the foreground and background activation values to be obvious, and improves the target positioning result when only the image class label is used as supervision.

Description

Weak supervision image target positioning analysis system based on class reactivation mapping chart
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a weak supervision image target positioning analysis system based on a class reactivation mapping chart.
Background
In recent years, deep learning has achieved a pleasing result in a variety of different computer vision applications. As one of the very considerable problems to be explored, image targeting aims at locating key objects for a given image, and this task plays a crucial role for image content analysis and scene understanding. Under the problem setting, a weak supervision target positioning task is derived. Compared with the traditional image target positioning, the task has no object positioning label in the training process and only has an image category label with weak first-level semantic features. Compared with the traditional target positioning task, the weak supervision target positioning task has higher difficulty and is more suitable for practical application. There are a large number of image-category label pairs on the internet, but few fine-label object locations. The development of weakly supervised object localization techniques makes it possible to learn with a large amount of internet data.
The main method of the image weak supervision target positioning task is to train a classification model and calculate a class mapping chart to position an image target. In the class map, the foreground is typically the region where the activation value is greater than the threshold τ, and the rest is defined as the background. However, the category map usually can only locate the most significant region of the target in the map, resulting in incomplete location. The main reason is that the classification model only needs to pay attention to the most significant part in the image and complete the classification judgment in the training process, and the classification result is obtained without paying attention to all the regions of the object. Thus, the category maps lack the ability to fully locate the target. The existing method mainly aims at the problem to provide a method comprising a first stage and a second stage. In a one-stage method comprising: (1) Continuously erasing the most discriminant area in the image in the process of training the classification model, and forcing the model to focus on other parts; (2) The method comprises the steps of (1) constraining a model to concern more objects by adding a regular term on a model loss function; and (3) adding an attention module to sense the rest part of the object. Although these methods have achieved good results in alleviating the incomplete target location problem, they are still limited to performing the task of weakly supervised target location under the classification framework based on class maps. The procedure defines the weakly supervised targeting problem as "which pixels contribute to the final class prediction". The two-stage method additionally trains a target locator on the basis of the first stage to decouple two subtasks of target location and image classification. The work is focused on that the perception of the overall position of the target is better in the early training stage in a one-stage method, but the classification effect is poor. With the progress of training, the accuracy of object classification is improved in the later stage of training, but positioning is incomplete. Decoupling the localization and classification therefore enables both to achieve better results at the same time. The system of the present invention can be applied to both the one-stage method and the two-stage method. In order to solve the above problems, it is very necessary to introduce a completely new target positioning paradigm besides a single dependency classification model, increase the discrimination between the foreground part and the background part of the class mapping image, and improve the image weak supervision target positioning result.
Disclosure of Invention
The invention aims to provide a weak supervision image target positioning system based on a class reactivation mapping chart, which is used for solving the problem that the foreground positioning in the current weak supervision image target positioning is incomplete.
The invention provides a weak supervision image target positioning system based on a class reactivation mapping chart, which comprises a class context feature learning module, a class map reactivation module and a class reactivation mapping chart calibration module; the category context feature learning module extracts image features and generates an initial category mapping graph as index learning category context features; the class mapping map reactivation module receives the image features and the class context features, judges the foreground and the background through pixel-level clustering, generates a class reactivation mapping map and inputs the class reactivation mapping map to the class mapping map calibration module; the class map calibration module locates the coarse foreground and background regions according to the class map and directs the class reactivation map to calibrate the foreground and background activation values.
In the invention, the category context feature learning module comprises an image feature extraction network and a full-connection neural network classifier; the image feature extraction network carries out hierarchical feature extraction on the image by using a VGG16 or inclusion-V3 or ResNet50 deep convolution neural network to generate a spatial feature vector f with dimensions h multiplied by w multiplied by 1, 024; the feature vector f is sent into a full-connection neural network classifier; the full-connection neural network classifier performs weighted summation on the spatial feature vector f and the full-connection network weight w on the c category to obtain an initial category mapping chart M with dimension h multiplied by w c (ii) a The process can be represented as:
Figure BDA0003756544990000021
f k being the kth component of the spatial feature vector f,
Figure BDA0003756544990000022
the kth component of the weight w corresponding to the c-th class; based on the class map, the final class prediction for the image by the classifier can be expressed as:
Figure BDA0003756544990000023
where i, j represents a spatial position. From equation (2), solving the image weak supervised targeting problem using the fully connected neural network classifier and the class map can be generalized to solve "which pixels contribute to the class prediction". The class map is normalized to the [0,1] interval and binarized by a threshold τ. And regarding the numerical value of each position in the normalized class mapping image, if the numerical value is larger than tau, the numerical value is taken as a foreground, and otherwise, the numerical value is taken as a background. Incomplete positioning results from the class map focusing too much on the salient regions of the object.
The present invention further proposes to maintain class-by-class contextual feature vectors. For each class c, the foreground context feature vector and the background context feature vector are respectively represented as
Figure BDA0003756544990000024
And
Figure BDA0003756544990000025
superscripts fg and bg represent foreground and background, respectively, the same as below; the context feature vectors are all d-dimensional feature vectors and can be used as the cluster center for each class to summarize the common foreground and background features of the class. Firstly, the invention binarizes the initial category mapping map:
Figure BDA0003756544990000026
wherein δ represents a threshold value; 1 () represents a directive function.
Figure BDA0003756544990000027
And
Figure BDA0003756544990000028
can be used as rough estimates of the foreground and background. For each sample, the deep features are F, using estimates of the foreground and background
Figure BDA0003756544990000029
And
Figure BDA00037565449900000210
and respectively obtaining foreground and background features, and updating the context feature vector by using the mean value. The specific process can be expressed as:
Figure BDA0003756544990000031
Figure BDA0003756544990000032
wherein, F ij A value representing the feature F at spatial location (i, j); i | · | purple wind 0 All non-zero values are counted. The foreground and background context feature vectors are updated using momentum, with a momentum parameter of λ. The use of momentum updates can ensure that context features are updated slowly and more historical features are maintained.
In the invention, the category mapping image reactivation module reactivates the category mapping image, improves the activation value of the foreground part, increases the discrimination of the foreground and background activation values, and enables the target positioning to be more accurate. The module defines the reactivation problem as a gaussian mixture model based parameter estimation problem and solves it using an expectation maximization algorithm. The expectation-maximization algorithm is an extension of the maximum likelihood estimation on a probability model containing hidden variables. Specifically, the method comprises the following steps:
for each sample x, the goal is to maximize the likelihood:
Figure BDA0003756544990000033
wherein the parameters
Figure BDA0003756544990000034
For model parameters, superscripts fg and bg represent the foreground and background, respectively. For each image pixel x ij It obeys a probability mixture model, which consists of foreground gaussian distribution and background gaussian distribution:
Figure BDA0003756544990000035
wherein the mixing weight a fg ,a bg Is [0,1]]Real number in between, and in line with a fg +a bg =1; foreground and background base model p fg And p bg And measuring image features and the learned category context feature vector. In the present invention, the radial direction in the Gaussian mixture model is not adopted in consideration of the realization efficiency factorBasis functions, and cosine similarity is taken as a measure:
Figure BDA0003756544990000036
Figure BDA0003756544990000037
wherein, sigma is the degree of smooth control of the hyper-parameter.
Next, the model is solved using an expectation-maximization algorithm. Defining hidden variables Z fg And Z bg Representing the probability that the image pixel belongs to the foreground and the background at position (i, j), respectively.
The expectation-maximization algorithm first empirically assigns an initial distribution to each class (i.e., hidden variable) by assuming a distribution parameter V fg And V bg And solving hidden variable expectation of each data according to the distribution parameters (step E); then, calculating the maximum likelihood value of the distribution parameters according to the classification result, and recalculating the expectation of the hidden variable of each data according to the maximum likelihood value (M step); and circulating until convergence.
First, in step E, the parameters of the current model are used to calculate the posterior distribution of the hidden variables, i.e.
Figure BDA0003756544990000041
And
Figure BDA0003756544990000042
in T (T is more than or equal to 1 and less than or equal to T) in each iteration process, assuming that the model parameters are fixed, the calculation process of the hidden variables is represented as:
Figure BDA0003756544990000043
Figure BDA0003756544990000044
from the clustering perspective, formula (8) calculates the similarity between the pixel-by-pixel feature and the foreground and background context feature vectors, and assigns a soft label to the pixel, i.e., the probability of belonging to the foreground or the background. The method is different from a common expectation maximization algorithm in that the common expectation maximization algorithm uses random initialization, and the method is obtained by class context characteristic learning, so that a clustering center has more definite and richer image class semantic information.
In the M step, the goal is to adjust the context feature vector to fit to the current image feature. This is done by using calculated hidden variable values to maximize the expected likelihood of image features. The model new parameter update process is represented as:
Figure BDA0003756544990000045
Figure BDA0003756544990000046
Figure BDA0003756544990000047
Figure BDA0003756544990000048
wherein, V fg And V bg Updated by a weighted average of the features, a fg And a bg Updated by the number of valid pixels.
The E step and the M step are alternately carried out until convergence. At this time, the hidden variable Z fg(T) And Z bg(T) Representing the probability that a pixel feature belongs to the foreground and the background. The hidden variables can successfully complete reactivation because probabilities are used instead of the original activation values in the clustering process. The larger the probability, the more the representative pixel belongs to the foreground part. In addition to this, using feature clustering instead of the original classification modelThe class mapping map can also prevent the implicit bias of over-focusing on the local area caused by the global average pooling layer.
In the invention, the category reactivation map calibration module calibrates the category reactivation map. The resulting hidden variable Z is due to the use of the expectation-maximization algorithm in the class-map reactivation module fg(T) And Z bg(T) As an initial reactivation map, the foreground activation value is not guaranteed to be greater than the background activation value, and therefore calibration is required. I.e. if and only if Z fg(T) When the foreground activation value of (2) is greater than the background activation value, Z fg(T) As a class reactivation map, otherwise, selecting Z bg(T) . The module uses the initial class activation map as a guide for calibration. From equation (3), the coarse foreground portion can be obtained
Figure BDA0003756544990000051
With background section
Figure BDA0003756544990000052
The invention by estimating Z fg(T) And Z bg(T) Average probability of belonging to foreground
Figure BDA0003756544990000053
And
Figure BDA0003756544990000054
and (4) judging:
Figure BDA0003756544990000055
Figure BDA0003756544990000056
wherein the foreground part is passed
Figure BDA0003756544990000057
It is given. Although it is not limited to
Figure BDA0003756544990000058
Only partial foreground can be marked, but the rough foreground and background regions are not influenced, and the obvious foreground region can still be used as a foreground prompt. According to the formulas (16) and (17), the class reactivation map corresponding to the one with the higher foreground average probability should belong to the class reactivation map of the foreground. The calibrated foreground reactivation map
Figure BDA0003756544990000059
The calculation process of (c) can be expressed as:
Figure BDA00037565449900000510
to further distinguish the foreground and background partial activation values, the class reactivation map and the initial class map are fused in the class reactivation map calibration module, and the calculation process can be expressed as:
Figure BDA00037565449900000511
the resulting final class reactivation map, after normalization and thresholding, may generate a mask or bounding box marker image target location.
The invention relates to a weak supervision image target positioning analysis system based on a class reactivation mapping chart, which comprises the following working procedures:
firstly, training a deep convolutional neural network model in a category context feature learning module, performing feature representation on an image, and extracting deep image feature representation; utilizing a full-connection network classifier to represent deep features of the image, and carrying out weighted summation to obtain an initial class mapping chart; calculating the average feature vector of the foreground and the background based on the initial category mapping image and the image feature, and updating the category context feature by using momentum;
(II) taking deep features and category context features of the image as the input of a category map reactivation module, clustering each pixel feature of the image into foreground features or background features by using an expectation maximization algorithm according to the similarity of the deep features and the category context features of the image, and taking hidden foreground variables and hidden background variables as category reactivation maps;
thirdly, the foreground and background class reactivation mapping image is used as the input of a class reactivation mapping image calibration module, and according to the rough foreground and background positioning of the initial class mapping image, the foreground and the background of the reactivation mapping image are distinguished and the class reactivation mapping image is calibrated; and simultaneously, fusing the initial category mapping map and the category reactivation mapping map to obtain a final image target positioning result.
According to the invention, a target positioning label is not needed in the whole process, the image weak supervision target positioning is completed only by using the image type label, and an accurate positioning result is obtained.
The advantages of the invention include:
first, the problem of incomplete localization in exploring the initial class map is due to the confusion of the activation values of its foreground and background parts. In order to solve the problem, a weak supervision image target positioning analysis system based on a class reactivation mapping chart is provided, and accurate and complete image target positioning is achieved;
secondly, a category context feature learning module is provided for the first time, and momentum updating is carried out on foreground and background context features through the image features and the initial category mapping graph, so that each category of foreground and background context features can represent common features of the image foreground and background parts of the category;
and thirdly, a category map reactivation module is firstly provided, reactivation is used as a parameter estimation problem of the Gaussian mixture model, an expectation maximization algorithm is used for solving, and the obtained hidden variable is used as a category reactivation map. On the basis, a class reactivation map calibration module is provided, and the initial class map is calibrated and fused for the class reactivation map. The obtained final class reactivation mapping map can obviously distinguish the foreground from the background;
finally, the optimal weak supervision image target positioning result is obtained from the public data sets ImageNet, CUB and OpenImages, and the positioning result has interpretability.
Drawings
FIG. 1 is a system diagram of the present invention.
Fig. 2 is a full framework diagram of the model in the present invention.
Detailed Description
As is known in the art, most of the previous studies have been faced with the problems: incomplete targeting in an initial class map generated using a classification model. The invention carries out intensive research aiming at the problems, and the problem of incomplete target positioning is caused by the confusion of the activation values of the foreground part and the background part of the initial class mapping. Aiming at the problem, the invention provides a weakly supervised image target positioning analysis system based on a class reactivation mapping chart to realize accurate positioning, and a deep convolutional neural network is combined with a traditional expectation maximization algorithm in the system to solve the problem of weakly supervised target positioning from a novel clustering paradigm. The weak supervision image target positioning system provided by the invention is suitable for all one-stage and two-stage positioning models, and can remarkably improve the positioning accuracy.
The invention will be described in detail hereinafter with reference to the drawings.
As shown in the first figure, the class reactivation map-based weak surveillance image target localization analysis system of the present invention includes a class context feature learning module, a class map reactivation module, and a class map calibration module, and the work flow thereof is as follows:
firstly, the method comprises the following steps: the context characteristic learning module firstly extracts the hierarchical characteristics of the image, generates a spatial characteristic vector and sends the spatial characteristic vector to the full-connection neural network classifier; carrying out weighted summation on the space characteristic vector and the weight of the full-connection network to obtain an initial category mapping chart M c . The process can be represented as:
Figure BDA0003756544990000061
on the basis of the above, defining the category context characteristics
Figure BDA0003756544990000062
And with
Figure BDA0003756544990000063
Respectively representing the common characteristics of the foreground and the background of the c-th category. According to the initial category mapping chart, respectively obtaining foreground areas
Figure BDA0003756544990000064
And
Figure BDA0003756544990000065
and multiplying the image deep features element by element, averaging to obtain foreground region features and background region features, and updating the category context features. The update process is represented as:
Figure BDA0003756544990000071
Figure BDA0003756544990000072
II, secondly, the method comprises the following steps: the category mapping image reactivation module reactivates the category mapping image, improves the activation value of the foreground part, increases the discrimination of the foreground and background activation values, and enables the target positioning to be more accurate. First, class map reactivation is considered as a parameter estimation problem for the gaussian mixture model: secondly, for each image pixel x ij It obeys the probability mixture model and consists of foreground Gaussian distribution and background Gaussian distribution.
Figure BDA0003756544990000073
The problem is solved by an expectation maximization algorithm. Introducing an implicit variable Z fg And Z bg Representing the probability that the image pixel belongs to the foreground and the background at position (i, j), respectively. The expectation-maximization algorithm completes learning by alternately iterating the steps E and M. In step E, the hidden variables are estimated, assuming the model parameters are fixed.
Figure BDA0003756544990000074
Figure BDA0003756544990000075
In the step M, model parameters are updated assuming that hidden variables are fixed:
Figure BDA0003756544990000076
Figure BDA0003756544990000077
after T iterations, the model converges. Resulting hidden variable Z fg(T) And Z bg(T) As an initial class reactivation map, the foreground can be significantly distinguished from the background.
Thirdly, the steps of: the class reactivation map calibration module calibrates the class reactivation maps and fuses the initial class reactivation maps. The calibration process utilizes the initial class map as the foreground index
Figure BDA0003756544990000078
Calculating Z fg(T) And Z bg(T) The average probability of belonging to the foreground. The calculation process is as follows:
Figure BDA0003756544990000079
Figure BDA0003756544990000081
and selecting the class reactivation mapping map with high average probability as calibrated class reactivation mapping map, and fusing the initial class reactivation mapping map to further enhance the discrimination of the foreground and background activation values, so that the foreground positioning is more accurate.
The public data sets CUB, ILSVRC and OpenImages were used for the experiments. The CUB data set contains 200 different kinds of birds, and the training set and the test set respectively contain 5994 pictures and 5794 pictures. The ILSVRC contains 1.2 million training pictures and 5 million test pictures. Both the CUB and ILSVRC datasets provide an object bounding box as a label. The OpenImages dataset contains 100 types of pictures, and the training, validation and test sets contain 29819,2500 and 5000 pictures, respectively. Since the CUB and ILSVRC datasets provide object bounding boxes as labels, the evaluation index max box accuracy (MaxBoxAccV 2) is employed. The OpenImages dataset provides only pixel level labeling and therefore uses a threshold independent pixel level average accuracy (PxAP) evaluation. The feature extractor of the category context feature learning module in the invention adopts VGG16 or inclusion V3 or ResNet50 which is pre-trained on an ImageNet data set, an input image is firstly scaled to 256 × 256 resolution and randomly cut out to 224 × 224 resolution. Total training rounds of 50, 6 and 10 were set on the CUB, ILSVRC and OpenImages datasets, respectively. The initial learning rate was 0.001 and was reduced by a factor of 10 at each 15, 2 and 3 rounds, respectively. The neural network classifier in the category context feature learning module adopts random initialization, and the learning rate of the neural network classifier is set to be 10 times of the original learning rate. The model parameters were updated using a random batch gradient descent algorithm with the batch size set to 32. The comparison methods are classical weakly supervised object localization models CAM, HAS, ACoL, SPG, ADL and CutMix. The results of the experiments on the three datasets CUB, ILSVRC and OpenImages are shown in table 1, table 2 and table 3, respectively. Among them, the HaS, ACoL, SPG, ADL, cutMix and CREAM (the present invention) are expressed by the relative CAM (class map) values to show the superiority and inferiority. On three data sets, the method is obviously superior to other methods, and is obviously improved compared with a class mapping graph. Especially on the CUB data set, the network of the ResNet as the feature extractor is improved by 10.5 compared with the CAM. The invention embodies more accurate positioning capability in both the maximum frame accuracy using the frame as granularity and the pixel average accuracy using the pixel as granularity.
In summary, the present invention provides a novel class reactivation map-based weak supervised image target localization analysis system for the problem of confusion of foreground background activation values of class maps on the premise of only using weak supervised image level annotation, and achieves accurate and complete localization of an image target by three modules, namely class context feature learning, an expectation maximization algorithm for completing reactivation of an initial class map, and class reactivation map calibration, so that it is possible to complete image analysis using large-scale coarse-grained annotation data on the internet.
TABLE 1
Figure BDA0003756544990000082
Figure BDA0003756544990000091
TABLE 2
Method VGG Inception ResNet Average out
Center Gauss 48.9 48.9 48.9 48.9
CAM 60.0 63.7 63.7 62.4
HaS +0.6 +0.3 -0.3 +0.2
ACoL -2.6 +0.3 -1.4 -1.2
SPG -0.1 -0.1 -0.4 -0.2
ADL -0.2 -2.0 +0.0 -0.7
CutMix -0.6 +0.5 -0.4 -0.2
CREAM +6.2 +2.1 +3.7 +5.1
TABLE 3
Method VGG Inception ResNet Average
Center Gauss 54.4 54.4 54.4 54.4
CAM 58.3 63.2 58.5 60.0
HaS -0.2 -5.1 -2.6 -2.6
ACoL -4.0 -6.0 -1.2 -3.7
SPG +0.0 -0.9 -1.8 -0.9
ADL +0.4 -6.4 -3.3 -3.1
CutMix -0.2 -0.7 -0.8 -0.6
CREAM +3.7 +1.4 +6.2 +3.8

Claims (7)

1. The system is characterized by comprising a category context feature learning module, a category map reactivation module and a category reactivation map calibration module; the category context feature learning module extracts image features and generates an initial category mapping graph as index learning category context features; the class mapping map reactivation module receives the image features and the class context features, judges the foreground and the background through pixel-level clustering, generates a class reactivation mapping map and inputs the class reactivation mapping map to the class mapping map calibration module; the class map calibration module locates the coarse foreground and background regions according to the class map and directs the class reactivation map to calibrate the foreground and background activation values.
2. The weakly supervised image target localization analysis system of claim 1, wherein the category context feature learning module includes an image feature extraction network and a fully connected neural network classifier; the image feature extraction network uses a VGG16 or inclusion-V3 or ResNet50 deep convolution neural network to extract hierarchical features of the image, and a spatial feature vector f with dimensions of h multiplied by w multiplied by 1,024 is generated; the feature vector f is sent into a full-connection neural network classifier; the full-connection neural network classifier performs weighted summation on the spatial feature vector f and the full-connection network weight w on the c category to obtain an initial category mapping chart M with dimension h multiplied by w c (ii) a The process is represented as:
Figure FDA0003756544980000011
f k being the kth component of the spatial feature vector f,
Figure FDA0003756544980000012
the kth component of the weight w corresponding to the c-th class; based on the class map, the final class prediction for the image by the classifier is represented as:
Figure FDA0003756544980000013
wherein i, j represents a spatial position; as can be seen from equation (2), solving the image weak surveillance target localization problem using the neural network classifier and the class map can be generalized to solve "which pixels contribute to the class prediction"; normalizing the class mapping chart into a [0,1] interval, and carrying out binarization through a threshold value tau; for the value of each position in the class map, if the value is greater than tau, the position is regarded as the foreground part, otherwise, the position is regarded as the background part.
3. The system for positioning and analyzing a weakly supervised image target according to claim 2, wherein the generating of the initial class map as the context feature of the index learning class is specifically: for each class c, the foreground context feature vector and the background context feature vector are respectively represented as
Figure FDA0003756544980000014
And
Figure FDA0003756544980000015
the context feature vectors are all d-dimensional feature vectors and are used as the cluster center of each category to summarize the common foreground and background features of the category; first, binarize the initial class map:
Figure FDA0003756544980000016
wherein δ represents a threshold value; 1 () represents a directive function;
Figure FDA0003756544980000017
and with
Figure FDA0003756544980000018
As rough estimates of the foreground and background; for each sample, the deep features are F, using estimates of the foreground and background
Figure FDA0003756544980000019
And with
Figure FDA00037565449800000110
Obtain the foreground and the background respectivelyBackground features, and updating context feature vectors by using the mean values; the specific process is as follows:
Figure FDA0003756544980000021
Figure FDA0003756544980000022
wherein, F ij A value representing the feature F at spatial location (i, j); i | · | purple wind 0 Calculating the number of all non-zero values; updating the foreground and background context feature vectors by using momentum, wherein the momentum parameter is lambda; the use of momentum updates can ensure that context features are updated slowly, maintaining more historical features.
4. The system for positioning and analyzing the weak supervision image target according to claim 3, wherein the class map reactivation module reactivates the class map to improve the activation value of the foreground part and increase the discrimination of the foreground activation value to make the target positioning more accurate; the module defines a reactivation problem as a gaussian mixture model based parameter estimation problem and solves the problem using an expectation-maximization algorithm; the expectation maximization algorithm is the extension of the maximum likelihood estimation in a probability model containing hidden variables; specifically, the method comprises the following steps:
for each sample x, the goal is to maximize the likelihood:
Figure FDA0003756544980000023
wherein the parameters
Figure FDA0003756544980000024
Model parameters (fg, bg represent foreground and background, respectively); for each image pixel x ij Obeying a probabilistic mixture model consisting of a set of foreground and background Gaussian distributionsThe composition is as follows:
Figure FDA0003756544980000025
wherein the mixing weight a fg ,a bg Is [0,1]]Real number in between, and in line with a fg +a bg =1; foreground and background base model p fg And p bg Measuring image features and learned category context feature vectors; cosine similarity is taken as a measure:
Figure FDA0003756544980000026
Figure FDA0003756544980000027
wherein, sigma is the degree of smooth control of the hyper-parameter.
5. The weakly supervised image target localization analysis system of claim 4, wherein the problem is solved using an expectation maximization algorithm, in particular:
defining hidden variables Z fg And Z bg Representing the probability that the image pixel belongs to the foreground and the background at position (i, j), respectively;
e, step E: the expectation-maximization algorithm first empirically assigns an initial distribution to each class, i.e., hidden variable, by assuming a distribution parameter V fg And V bg And solving hidden variable expectation of each data according to the distribution parameters;
and M: then, calculating the maximum likelihood value of the distribution parameters according to the classification result, and recalculating the expectation of the hidden variable of each data according to the maximum likelihood value; circulating until convergence;
in step E, the parameters of the current model are used to calculate the posterior distribution of the hidden variables, i.e.
Figure FDA0003756544980000031
And
Figure FDA0003756544980000032
in T (T is more than or equal to 1 and less than or equal to T) in each iteration process, assuming that the model parameters are fixed, the calculation process of the hidden variables is represented as follows:
Figure FDA0003756544980000033
Figure FDA0003756544980000034
from the clustering angle, the similarity between the pixel-by-pixel characteristics and the foreground and background context characteristic vectors is calculated by the formula (8), and a soft label is allocated to the pixel, namely the probability of belonging to the foreground or the background;
in the step M, the aim is to adjust the context feature vector to be matched with the current image feature; the specific method is that the expected likelihood of the image features is maximized by using the calculated hidden variable values; the model new parameter update process is represented as:
Figure FDA0003756544980000035
Figure FDA0003756544980000036
Figure FDA0003756544980000037
Figure FDA0003756544980000038
wherein, V fg And V bg Updated by a weighted average of the features, a fg And a bg Updating through the effective pixel number;
the E step and the M step are alternately carried out until convergence.
6. The system of claim 5, wherein the category reactivation map calibration module calibrates a category reactivation map; the resulting hidden variable Z is due to the use of the expectation-maximization algorithm in the class-map reactivation module fg(T) And Z bg(T) As the initial reactivation map, the foreground activation value cannot be guaranteed to be greater than the background activation value, so calibration is required; i.e. if and only if Z fg(T) When the foreground activation value of (2) is greater than the background activation value, Z fg(T) As a class reactivation map, otherwise, selecting Z bg(T) (ii) a Specifically, the initial class activation mapping is used as a guide to carry out calibration; from equation (3), the coarse foreground portion can be obtained
Figure FDA0003756544980000039
And background section
Figure FDA00037565449800000310
By estimating Z fg(T) And Z bg(T) Average probability of belonging to foreground
Figure FDA00037565449800000311
And
Figure FDA00037565449800000312
and (4) judging:
Figure FDA0003756544980000041
Figure FDA0003756544980000042
wherein the foreground part is passed
Figure FDA0003756544980000043
Giving out; according to the formulas (16) and (17), the class reactivation map corresponding to the one with higher foreground average probability should belong to the class reactivation map of the foreground; the calibrated foreground reactivation map
Figure FDA0003756544980000044
The calculation process of (a) is expressed as:
Figure FDA0003756544980000045
in order to further distinguish the activation values of the foreground and background parts, the class reactivation mapping chart and the initial class mapping chart are fused in a class reactivation mapping chart calibration module, and the calculation process is represented as follows:
Figure FDA0003756544980000046
and after the obtained final class reactivation mapping map is normalized and thresholded, generating a mask or bounding box mark image target positioning.
7. The weakly supervised image target localization analysis system of claim 6, wherein the workflow is:
firstly, training a deep convolutional neural network model in a category context feature learning module, performing feature representation on an image, and extracting deep image feature representation; utilizing a full-connection network classifier to represent deep features of the image, and carrying out weighted summation to obtain an initial class mapping chart; calculating the average feature vector of the foreground and the background based on the initial category mapping image and the image feature, and updating the category context feature by using momentum;
secondly, taking the deep features and the category context features of the image as the input of a category map reactivation module, clustering each pixel feature of the image into foreground features or background features by using an expectation-maximization algorithm according to the similarity of the deep features and the category context features of the image, and taking the hidden foreground variables and the hidden background variables as category reactivation maps;
thirdly, the foreground and background class reactivation mapping image is used as the input of a class reactivation mapping image calibration module, and according to the rough foreground and background positioning of the initial class mapping image, the foreground and the background of the reactivation mapping image are distinguished and the class reactivation mapping image is calibrated; and simultaneously, fusing the initial category mapping map and the category reactivation mapping map to obtain a final image target positioning result.
CN202210864306.6A 2022-07-20 2022-07-20 Weak supervision image target positioning analysis system based on class reactivation mapping chart Pending CN115311449A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210864306.6A CN115311449A (en) 2022-07-20 2022-07-20 Weak supervision image target positioning analysis system based on class reactivation mapping chart

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210864306.6A CN115311449A (en) 2022-07-20 2022-07-20 Weak supervision image target positioning analysis system based on class reactivation mapping chart

Publications (1)

Publication Number Publication Date
CN115311449A true CN115311449A (en) 2022-11-08

Family

ID=83856260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210864306.6A Pending CN115311449A (en) 2022-07-20 2022-07-20 Weak supervision image target positioning analysis system based on class reactivation mapping chart

Country Status (1)

Country Link
CN (1) CN115311449A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908296A (en) * 2022-11-10 2023-04-04 深圳大学 Medical image class activation mapping evaluation method and device, computer equipment and storage medium
CN116563953A (en) * 2023-07-07 2023-08-08 中国科学技术大学 Bottom-up weak supervision time sequence action detection method, system, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115908296A (en) * 2022-11-10 2023-04-04 深圳大学 Medical image class activation mapping evaluation method and device, computer equipment and storage medium
CN115908296B (en) * 2022-11-10 2023-09-22 深圳大学 Medical image class activation mapping evaluation method, device, computer equipment and storage medium
CN116563953A (en) * 2023-07-07 2023-08-08 中国科学技术大学 Bottom-up weak supervision time sequence action detection method, system, equipment and medium
CN116563953B (en) * 2023-07-07 2023-10-20 中国科学技术大学 Bottom-up weak supervision time sequence action detection method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN109190524B (en) Human body action recognition method based on generation of confrontation network
CN107633226B (en) Human body motion tracking feature processing method
CN110738247B (en) Fine-grained image classification method based on selective sparse sampling
Zhao et al. Closely coupled object detection and segmentation
Zhou et al. Salient object detection via fuzzy theory and object-level enhancement
CN115311449A (en) Weak supervision image target positioning analysis system based on class reactivation mapping chart
CN110909618B (en) Method and device for identifying identity of pet
CN114021799A (en) Day-ahead wind power prediction method and system for wind power plant
CN111027493A (en) Pedestrian detection method based on deep learning multi-network soft fusion
CN112052802B (en) Machine vision-based front vehicle behavior recognition method
CN111461039B (en) Landmark identification method based on multi-scale feature fusion
CN112668579A (en) Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution
CN113408605A (en) Hyperspectral image semi-supervised classification method based on small sample learning
CN110716792B (en) Target detector and construction method and application thereof
CN110363165B (en) Multi-target tracking method and device based on TSK fuzzy system and storage medium
CN113362341B (en) Air-ground infrared target tracking data set labeling method based on super-pixel structure constraint
Chen et al. A semisupervised context-sensitive change detection technique via gaussian process
CN114139631B (en) Multi-target training object-oriented selectable gray box countermeasure sample generation method
CN114201632B (en) Label noisy data set amplification method for multi-label target detection task
CN115393631A (en) Hyperspectral image classification method based on Bayesian layer graph convolution neural network
CN111815640A (en) Memristor-based RBF neural network medical image segmentation algorithm
CN114549909A (en) Pseudo label remote sensing image scene classification method based on self-adaptive threshold
CN112270285B (en) SAR image change detection method based on sparse representation and capsule network
Han et al. Accurate and robust vanishing point detection method in unstructured road scenes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination