CN115546466A - Weak supervision image target positioning method based on multi-scale significant feature fusion - Google Patents
Weak supervision image target positioning method based on multi-scale significant feature fusion Download PDFInfo
- Publication number
- CN115546466A CN115546466A CN202211201019.3A CN202211201019A CN115546466A CN 115546466 A CN115546466 A CN 115546466A CN 202211201019 A CN202211201019 A CN 202211201019A CN 115546466 A CN115546466 A CN 115546466A
- Authority
- CN
- China
- Prior art keywords
- image
- pyramid
- layer
- network
- cam
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
- G06V10/245—Aligning, centring, orientation detection or correction of the image by locating a pattern; Special marks for positioning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
- G06V10/765—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
A weak supervision image target positioning method based on multi-scale salient feature fusion belongs to the field of computer vision. In order to solve the two problems of complicated work of ROI labeling of a small target image and insufficient CAM activation, the invention focuses on the research of outputting a class activation graph by a classification network under the condition of optimizing weak supervision. The invention relates to information fusion of two layers: (1) because the semantic information of the feature map at the bottommost layer in the convolutional neural network is weak but the position information is strong, the feature map can be fused with the feature map at the highest layer to obtain the final feature map of the classification network; (2) because the sensitivity of the classification network to the ROIs with different scales is different, the obtained class activation maps are also different, the complementary object information in the different activation maps is fused, so that the positioning of the target region in the image can be perfected, and a more accurate pseudo label is generated for a segmentation task.
Description
Technical Field
The invention relates to a weak supervision image target positioning method based on multi-scale salient feature fusion, and belongs to the field of computer vision.
Background
The positioning segmentation of image region of interest (ROI) is a classic problem in computer vision research, and the present ROI positioning segmentation research based on natural images has made great progress. However, for some non-natural images in specific fields (e.g. medical images, pollen grain images), their ROI is smaller than that of natural images, so the ROI localization segmentation method based on natural images is not fully applicable to such images. Therefore, the small target positioning segmentation research based on the image in the specific field has very important significance.
At present, the mainstream small target positioning and segmentation methods based on deep learning comprise full-supervised learning and weak-supervised learning. Z, ning et al [1] And guiding the main network and the auxiliary network to learn the foreground significant representation and the background significant representation respectively by using the significance maps of the foreground and the background of the breast ultrasound image respectively, and finally fusing the features of the main network and the auxiliary network to enhance the form learning capability of the segmentation network. However, the fully supervised deep learning method generally requires a large number of labeled data sets, and acquiring the pixel-level labels of the images is a complicated and time-consuming work, and relatively speaking, acquiring the data sets with only category information is easier, so that a lot of work is performed by using only the image-level labels, which are weak supervision methods, to realize target positioning segmentation. However, the Class Activation Map (CAM) obtained by the classification network in the weak supervised learning only covers the most significant part of the image and cannot indicate the complete target area, i.e. the positioning accuracy of the CAM is low (insufficient Activation), and for this reason, li Y et al [2] Firstly, the prior knowledge of the breast anatomy is utilized to restrict the search space of the classification network for the breast lesion tissues, and then the search space is usedThe level set algorithm modifies the CAM. But it ignores one important piece of information: for targets of different scales, the discriminant regions captured by the classification network are not consistent.
In order to solve the two problems of complicated work of ROI labeling of a small target image and insufficient CAM activation, the invention focuses on the research of outputting a class activation image by a classification network under weak supervision of optimization. The invention relates to information fusion of two layers: (1) because the semantic information of the feature map at the bottommost layer in the convolutional neural network is weak but the position information is strong, the feature map can be fused with the feature map at the highest layer to obtain the final feature map of the classification network; (2) because the sensitivity of the classification network to the ROIs with different scales is different, the obtained class activation maps are also different, the complementary object information in the different activation maps is fused, so that the positioning of the target region in the image can be perfected, and a more accurate pseudo label is generated for a segmentation task.
Reference documents:
[1]Z.Ning,S.Zhong,Q.Feng,W.Chen and Y.Zhang,"SMU-Net:Saliency-Guided Morphology-Aware U-Net for Breast Lesion Segmentation in Ultrasound Image,"in IEEE Transactions on Medical Imaging,vol.41,no.2,pp.476-490,Feb.2022,doi: 10.1109/TMI.2021.3116087.
[2]Li Y,Liu Y,Huang L,Wang Z,Luo J.Deep weakly-supervised breast tumor segmentation in ultrasound images with explicit anatomical constraints.Med Image Anal. 2022Feb;76:102315.doi:10.1016/j.media.2021.102315.Epub 2021Nov 28.PMID: 34902792.
disclosure of Invention
Aiming at the problems that the existing small image target positioning segmentation research based on full-supervised learning has complicated labeling work and the single-scale small image target positioning segmentation research based on weak-supervised learning has insufficient CAM activation, the invention designs a weak-supervised image target positioning method based on multi-scale significant feature fusion. Specifically, three images with different scales are obtained by constructing an image pyramid, a multi-scale CAM of the same image is obtained, then the images are fused, and finally the fused CAM is used as weak supervision information to train a segmentation network.
The weak supervision image target positioning method based on multi-scale significant feature fusion comprises five stages: the first stage is the preprocessing of the image, which mainly unifies the resolution of the image in the data set. The second stage is the construction of an image pyramid. The method mainly comprises the steps of taking an input image as a source image, sampling downwards to construct an image pyramid top layer, sampling upwards to construct an image pyramid bottom layer, and determining the number of final image pyramid layers. The third stage is the acquisition of the classifier feature map. In the stage, firstly, a classifier is trained for the image of each layer of the image pyramid, and then, for the classifier of any layer, the feature map of the highest layer is spliced with the feature map of the lowest layer to obtain a fused feature map. The fourth stage is the fusion of the multi-scale CAMs. In the stage, firstly, the multi-scale CAM of the same image is obtained through the weighted sum of the feature maps of each layer, then all the CAMs are aligned and finally fused to obtain the final CAM of the source image. The fifth stage is the prediction of the target area. In the stage, the fused CAM is converted into a pseudo-binary label, then the pseudo-label is used for training a segmentation network, and finally a target area is predicted through the segmentation network.
The specific scheme of the invention is shown in figure 2.
Step 1: image pre-processing
The purpose of image pre-processing is to unify the size of all images within the data set. The data referenced by the present invention are primarily small target image data, such as the disclosed breast image dataset and pollen image dataset. If the image resolutions in the data set are not uniform, the sizes of the feature maps obtained by the subsequent classification network are also not uniform, and the parameters of the full connection layer in the classification network cannot adapt to the feature maps with different sizes, so that the sizes of all the input images must be fixed to be uniform.
Step 2: image pyramid construction
In the step, images in a data set are used as source images, and three kinds of scale transformation of the input images are obtained by constructing a Gaussian pyramid. In order to obtain more global and finer-grained information compared with the original image, the Gaussian pyramid constructed by the method adopts a pyramid structure with the mixture of down sampling and up sampling.
Step 2.1, image pyramid top layer construction: taking an input image as a source image, firstly performing Gaussian smoothing processing on the input image by using a template Gaussian core with the size of 5*5, then performing downsampling on the processed image by removing even rows and columns in an image matrix, and finally obtaining an image with the size of 1/4 of the input image, wherein the image is used as an image pyramid top layer.
Step 2.2, constructing the pyramid bottom layer of the image: taking an input image as a source image, firstly, expanding the image to be 2 times of the original image in each direction, wherein newly added rows and columns are filled with a numerical value of 0; and then multiplying the template Gaussian kernel of 5*5 by 4 and performing convolution operation on the amplified image to obtain an approximate value of the newly added pixel. And finally, obtaining an image with the size 4 times that of the input image, and taking the image as an image pyramid bottom layer.
Step 2.3, determining the pyramid layer number of the image: and determining numbers for the images on different layers in the image pyramid, wherein the number of the image pyramid layer is started from 0, and the image resolution is correspondingly reduced along with the increase of the pyramid layer. The image pyramid constructed by the invention has 3 layers, wherein the original image is positioned at the 2 nd layer, and the number of the corresponding pyramid layer is 1.
And step 3: classifier feature map acquisition
In the step, a classifier is respectively trained for three images with different scales in an image pyramid so as to obtain class activation maps with three different scales of the same image.
Step 3.1, training a classification network: the invention selects classical ResNet50 as a classification network for judging the category of the input image. Since there are three images of different scales in the image pyramid, it is finally necessary to train one classifier for each of the three image datasets of different scales.
Step 3.2, fusing the high-low layer characteristic diagrams: for each classification network, the superficial receptive field is small, and low-level geometric information such as texture, edge and the like is extracted; and the high-level receptive field is large, and more global and deeper semantic information is extracted. Therefore, the invention aligns and splices the highest layer features and the lowest layer features in each classification network, and prompts the network to enhance the low-level features of the small target object so as to obtain the final fusion feature map of the network.
And 4, step 4: multi-scale CAM fusion
The step obtains the CAMs of the three classification networks, and the CAMs are aligned and then fused to finally obtain a fused CAM image corresponding to the image.
Step 4.1, obtaining by the classification network CAM: and (3) multiplying the final fusion characteristic graph obtained in the step (3.2) by a weight matrix of a full connection layer in the classification network to obtain the CAM. Because the invention uses three classification networks, three CAMs with different scales are finally obtained for each source image to form a CAM pyramid.
Step 4.2 multiple CAM alignment: and aligning the CAMs with different scales based on the size of the source image so as to facilitate the subsequent fusion operation.
Step 4.3 multiple CAM fusions: for any pixel in the fused CAM, the invention adopts the following judgment mechanism: if the activation value of at least two independent CAMs at the point relative to a certain category is larger than or equal to the threshold value, the pixel point is considered to belong to the category. If the pixel point is not allocated to any category after passing through the judgment mechanism, ignoring the pixel point; and if the pixel point is allocated to a plurality of categories, allocating the pixel point to the category corresponding to the maximum average activation value of the three independent CAMs at the point.
And 5: ROI prediction
The step is to convert the fusion CAM obtained in the step 4.3 into a pseudo label, train a positioning segmentation network of the image ROI based on the pseudo label, and finally predict the ROI by using the network.
Step 5.1, fusing CAM pseudo label conversion: and converting the fused CAM into a pseudo binary mask for training the segmentation network. The invention adopts the following judgment mechanism: if any pixel point in the fusion CAM belongs to the non-target class, the pixel value of the point is assigned to be 0, otherwise, the pixel value is assigned to be 1.
Step 5.2, training and predicting the segmentation network: based on the pseudo-binary label training image segmentation network obtained in the step 5.1, the segmentation network framework selected by the method is U-Net, and finally, the trained network is used for carrying out ROI segmentation prediction on the test set.
Compared with the prior art, the invention has the beneficial effects that:
1. the weakly supervised image target positioning method based on multi-scale significant feature fusion effectively avoids pixel-level labeling related to ROI under full-supervised learning, and greatly reduces the data labeling workload.
2. According to the weakly supervised image target positioning method based on multi-scale salient feature fusion, the highest layer feature map and the lowest layer feature map obtained by each classification network are spliced, so that the learning of the network on the low-level features of the small target object is enhanced, and the network can pay attention to more features of the small target object.
3. The weakly supervised image target positioning method based on multi-scale salient feature fusion is adopted, a down-sampling and up-sampling mixed pyramid structure is constructed based on a source image, so that features which are more global and finer in granularity than the original image are obtained simultaneously, and a more complete CAM is obtained by fusing the features.
Drawings
FIG. 1 is a schematic diagram of an image pyramid constructed in the present invention.
FIG. 2 is an overall flow chart of the proposed method of the present invention.
Detailed Description
The following detailed description of the embodiments of the invention is made with reference to the accompanying fig. 2:
the weak supervision image target positioning method based on multi-scale significant feature fusion comprises five stages: the first stage is the pre-processing of the image, mainly unifying the resolution of the data set. The second stage is the construction of an image pyramid. The method mainly comprises the steps of taking an input image as a source image, sampling downwards to construct an image pyramid top layer, sampling upwards to construct an image pyramid bottom layer, and determining the number of final image pyramid layers. The third stage is the acquisition of the classifier feature map. In the stage, firstly, a classifier is trained for the image of each layer of the image pyramid, and then, for the classifier of any layer, the feature map of the highest layer is spliced with the feature map of the lowest layer to obtain a fused feature map. The fourth stage is the fusion of the multi-scale CAMs. In the stage, firstly, multi-scale CAMs of the same image are obtained through the weighted sum of feature maps of each layer, then all CAMs are aligned, and finally, the CAMs are fused to obtain the final CAMs of the source images. The fifth stage is the prediction of the ROI. In the stage, the fused CAM is firstly converted into a pseudo binary label, then the pseudo label is used for training a segmentation network, and finally the ROI is predicted through the segmentation network.
Specifically, the method comprises the following steps:
step 1: image pre-processing
The purpose of image pre-processing is to unify the size of all images within the data set. The data referenced by the present invention are primarily small target image data, such as the disclosed breast image dataset and pollen image dataset. If the resolution of the image in the data set is not unique, the size of the feature map obtained by the last convolution layer of the subsequent classification network is different, and the parameter dimensions of the connection between the full connection layer and the previous layer are fixed in advance, that is, the parameters of the full connection layer cannot adapt to different feature map sizes, so the sizes of all input images must be fixed. To minimize the loss of image information and facilitate convolution operations in subsequent classification networks, we set the size of all images to 512 × 512.
Step 2: image pyramid construction
This step acquires three scale transformations of the input image by constructing a gaussian pyramid. In order to obtain more global and finer-grained information compared with the original image, the Gaussian pyramid constructed by the method adopts a pyramid structure with the mixture of down sampling and up sampling. Specifically, the construction process includes two parts: firstly, the width and the height of an input original image are respectively down-sampled into 50% of an original image through a Gaussian pyramid, and an image with 256 × 256 resolution is obtained as the top layer of the pyramid; secondly, the width and height of the input original image are respectively up-sampled to 200% of the original image through a gaussian pyramid, and thus an image with 1024 × 1024 resolution is obtained as a bottom layer of the pyramid.
Step 2.1, image pyramid top layer construction: for a given artwork of 512 x 512 size, we downsample to construct the top level of the gaussian pyramid with an image of the artwork size 1/4, with the image corresponding to a resolution of 256 x 256. The specific process is shown as formula (1): firstly, performing primary Gaussian smoothing on an original image of 512 by 512, wherein the primary Gaussian smoothing is different from simple smoothing, and when the weighted average value of surrounding pixels is calculated, pixels adjacent to a central point are endowed with higher weight values by Gaussian smoothing; the processed image is then down-sampled by removing even rows and columns in the image matrix to obtain a 256 × 256 resolution image.
(1≤l≤L,0≤x≤R l ,0≤y≤C l )
Wherein G l Is the image of the first layer of the Gaussian pyramid (the number of Gaussian pyramid layers starts from 0), L is the layer number of the top layer of the Gaussian pyramid, R l And C l The number of rows and columns of the image of the l layer are respectively, W (m, n) is the numerical value of the n-th row and the n-th column of the Gaussian filter template, the value is 5*5 generally, the two-dimensional separable 5*5 Gaussian core widely used in the unsharp mask algorithm is selected to conduct smoothing processing on the original image, and the value is shown as (2).
Step 2.2, constructing the pyramid bottom layer of the image: for a given 512 x 512 artwork, we upsample to construct the lowest level of the gaussian pyramid with the 4 times size image of the artwork, which corresponds to a resolution of 1024 x 1024. The specific process is as follows: firstly, expanding the image to be 2 times of the original image in each direction, wherein the newly added rows and columns are filled with 0 values; and then, multiplying the Gaussian kernel used in the down sampling by 4, and performing convolution operation on the Gaussian kernel and the amplified image to obtain an approximate value of a newly added pixel, thereby finally obtaining an image with 1024 × 1024 resolution.
Step 2.3, determining the pyramid layer number of the image: after the image pyramid is constructed, the number of layers l in the Gaussian pyramid corresponding to the image with the resolution w × h is determined by a formula (3).
Wherein l 0 The original image with 512 × 512 number of layers in the image pyramid has 1024, 512, and 256 three dimensions of the image in the gaussian pyramid, so the number of layers l corresponding to the original image 0 And =1. As can be seen from formula (3), the number of corresponding layers of 1024 × 1024 images is 0, i.e., the corresponding layer is located at the lowest layer of the gaussian pyramid; the 256 × 256 size image corresponds to 2 layers, i.e., at the top of the gaussian pyramid.
And step 3: classifier feature map acquisition
In the step, a classifier is respectively trained for three images with different scales in an image pyramid so as to obtain class activation maps with three different scales of the same image.
Step 3.1, training a classification network: the classification network is used for judging the class of the input image, the classification network selected by the invention is ResNet50, the ResNet50 network comprises 49 convolution layers and 1 full-connection layer, each residual block has three layers of convolution, and the residual structure of the network can directly connect the input to the subsequent network layer so as to avoid information loss. The invention respectively trains a classifier for images with three resolutions of 256 × 256, 512 × 512 and 1024 × 1024 in an image pyramid, and the classifier is marked as R 1 、R 2 、R 3 。
Step 3.2, fusing the high-low layer characteristic diagrams: for each classification network, the superficial characteristic image has a small receptive field, and local and general characteristics such as image textures, edges and the like, namely low-level geometric information, are extracted; and as the network layer number is deepened, the perception field of the high-level feature map is larger, and deeper and more global features, namely high-level semantic information, are extracted. Therefore, the invention splices the highest layer characteristic and the lowest layer characteristic in each classification network to be used as the final characteristic of the network outputThe graph enables the network to enhance the low-level features of small target objects. For the classifier R 1 Let its network get the highest layer characteristic diagram asThe lowest layer characteristic diagram isThen the classifier R 1 Final feature map f 1 Can be obtained from equation (4).
Wherein UP is an upsampling operation, i.e., upsampling is performed on the feature map of the highest layer to achieve the same size as the feature map of the lowest layer, so as to facilitate subsequent processing,is the operation of the transverse connection of the feature maps, i.e. the element-by-element addition. In the same way, the classifier R 2 And a classifier R 3 Final feature map f 2 、f 3 It can be obtained from equations (5) and (6).
WhereinAre respectively a classifier R 2 The obtained highest-level feature map and the lowest-level feature map,are respectively the classifier R 3 Obtaining the highest layer characteristic diagram and the lowest layer characteristic diagram, UP is UP samplingIn the operation of the method, the operation,is the operation of the transverse connection of the feature maps, i.e. the element-by-element addition.
And 4, step 4: multi-scale CAM fusion
The step acquires the CAMs of the three classification networks, aligns the CAMs and then fuses the CAMs, and finally outputs a fused CAM image corresponding to the image.
Step 4.1, obtaining by the classification network CAM: for the classifier R 1 In particular, the activation value of a spatial pixel u (x, y) in an image with 256 × 256 resolution with respect to class cCan be obtained from equation (7).
Wherein i is the channel number of the last convolutional layer of the classification network, K is the channel number of the last convolutional layer of the classification network,for the weight corresponding to class c in channel i, f i 1 (x, y) is the classifier R 1 And finally fusing the characteristic value of the position (x, y) on the channel i in the characteristic diagram. For the same reason, for classifier R 2 And a classifier R 3 Activation value of a pixel point u (x, y) on an image with respect to class cCan be obtained from equations (8) and (9), respectively.
Wherein i is the channel number of the last convolutional layer of the classification network, K is the channel number of the last convolutional layer of the classification network,for the weight corresponding to class c in channel i, f i 2 (x, y) and f i 3 (x, y) are classifiers R, respectively 2 And R 3 And finally, fusing the characteristic value of the position (x, y) on the channel i in the characteristic diagram.
Step 4.2 multiple CAM alignment: due to the classifier R 1 、R 2 、R 3 The input of (1) is an image pyramid of three layers, so the three obtained activation map-like sizes also form an activation map pyramid. To fuse three differently scaled CAMs, which need to be aligned, the present invention sets all CAMs to a size consistent with the original input image, i.e., 512 x 512 resolution.
Step 4.3 multiple CAM fusions: the three aligned CAMs were fused to the final CAM. Activation map M for fusion classes agg The fusion mechanism of the present invention is as follows: if there are at least two independent activation graphs at this point, the activation value for class c is greater than or equal to the threshold value theta (theta epsilon [0.5,0.7)]) Then, consider M agg Wherein the pixel belongs to category c. If the pixel point is not allocated to any category after the fusion mechanism, the pixel point is ignored; if the pixel is assigned to a plurality of categories, the category cla (x, y) to which the pixel belongs is finally determined according to the formula (10).
Where j is the number of pyramid levels, P is the total number of pyramid levels, where P =3,N is the number of classes of data set partitioning (excluding background classes),and obtaining the activation value of the pixel point u (x, y) in the feature map obtained by the pyramid of the j layer about the category c.Refers to the average activation value of the pixel u (x, y) in the P feature maps with respect to the background class (class number 0),refers to the average activation value of pixel point u (x, y) in P feature maps for category c,refers to the average activation value of pixel point u (x, y) for category N in P feature maps. index is an index-taking operation, that is, an index sequence number corresponding to the maximum value in the array is taken, and here, the index sequence number also refers to the category to which the pixel belongs, for example, if the 0 th average activation value in the array is the maximum, the index value taken out is 0, and also refers to that the category belongs to 0.
And 5: ROI region prediction
The step is to convert the fused CAM obtained in the step 4.3 into a pseudo label, train and segment the network based on the pseudo label, and finally predict the ROI by using the network.
Step 5.1, fusing CAM pseudo label conversion: and converting the fused CAM into a pseudo binary mask for training the segmentation network. Pseudo binary maskThe value of the middle pixel u (x, y) is determined by formula (11).
Where cla (x, y) refers to the category to which the pixel u (x, y) belongs, and cla (x, y) =0 indicates that the pixel belongs to the non-target class.
Step 5.2, training and predicting the segmentation network: based on the pseudo-binary label training image segmentation network obtained in the step 5.1, the segmentation network framework selected by the method is U-Net, and finally, the trained network is used for carrying out ROI segmentation prediction on the test set.
The method is mainly used for small target image data, such as medical image data and pollen image data with a small focal region, and can strengthen the position information and the contour information of the small target object by fusing the multi-scale remarkable characteristics obtained by the image pyramid, so that the performance of a small target object positioning and segmenting task under a weak supervision scene is improved. The steps described in the specific examples of the present invention may be modified without departing from the basic spirit of the invention. The present embodiments are, therefore, to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description.
Claims (2)
1. A weak supervision image target positioning method based on multi-scale salient feature fusion is characterized by comprising the following steps:
step 1: image pre-processing
The purpose of image pre-processing is to unify the size of all images within a data set;
step 2: image pyramid construction
The method comprises the steps of taking an image in a data set as a source image, and obtaining three scale transformations of an input image by constructing a Gaussian pyramid; in order to obtain more global and finer-grained information than the original image, the constructed Gaussian pyramid adopts a pyramid structure with a mixture of down sampling and up sampling;
step 2.1, image pyramid top layer construction: taking an input image as a source image, firstly performing Gaussian smoothing processing on the input image by using a template Gaussian core of 5*5 size, then performing downsampling on the processed image by removing even rows and columns in an image matrix, and finally obtaining an image of 1/4 size of the input image, wherein the image is used as an image pyramid top layer;
step 2.2, constructing the pyramid bottom layer of the image: taking an input image as a source image, firstly, expanding the image to be 2 times of the original image in each direction, wherein newly added rows and columns are filled with a numerical value of 0; then, multiplying the template Gaussian kernel of 5*5 by 4, and then performing convolution operation on the multiplied image to obtain an approximate value of a newly added pixel; finally, obtaining an image of 4 times of the size of the input image, and taking the image as an image pyramid bottom layer;
step 2.3, determining the pyramid layer number of the image: determining numbers for images on different layers in an image pyramid, wherein the number of the pyramid layers of the image is from 0, and the resolution of the image is correspondingly reduced along with the increase of the pyramid layers; the constructed image pyramid is 3 layers, wherein the original image is positioned at the 2 nd layer, and the number of the corresponding pyramid layer is 1;
and step 3: classifier feature map acquisition
Respectively training a classifier aiming at three images with different scales in an image pyramid to obtain class activation graphs of the same image with three different scales;
step 3.1, training a classification network: selecting classical ResNet50 as a classification network for judging the class of the input image; because three images with different scales exist in the image pyramid, a classifier is required to be trained for three image data sets with different scales respectively;
step 3.2, fusing the high-low layer characteristic diagrams:
aligning and splicing the highest layer features and the lowest layer features in each classification network to promote the network to enhance the low-level features of the small target object so as to obtain a final fusion feature map of the network;
and 4, step 4: multi-scale CAM fusion
The step of obtaining the CAMs of the three classification networks, aligning the CAMs and then fusing the CAMs to finally obtain a fused CAM image corresponding to the image;
step 4.1, obtaining by the classification network CAM: obtaining a CAM by multiplying the final fusion characteristic diagram obtained in the step 3.2 by a weight matrix of a full connection layer in the classification network; because three classification networks are used, three CAMs with different scales are finally obtained for each source image to form a CAM pyramid;
step 4.2 multiple CAM alignment: aligning the CAMs with different scales based on the size of the source image so as to facilitate subsequent fusion operation;
step 4.3 multiple CAM fusions: for any pixel in the fused CAM, the following judgment mechanism is adopted: if the activation value of at least two independent CAMs at the point related to a certain category is larger than or equal to a threshold value, the pixel point is considered to belong to the category; if the pixel point is not allocated to any category after passing through the judgment mechanism, ignoring the pixel point; if the pixel point is allocated to a plurality of categories, the pixel point is allocated to the category corresponding to the maximum average activation value of the three independent CAMs at the point;
and 5: ROI prediction
Firstly, converting the fusion CAM obtained in the step 4.3 into a pseudo label, then training a positioning segmentation network of an image ROI based on the pseudo label, and finally predicting the ROI by using the network;
step 5.1, fusing CAM pseudo label conversion: converting the fused CAM into a pseudo-binary mask for segmenting network training; the following judgment mechanism is adopted: if any pixel point in the fusion CAM belongs to the non-target class, assigning the pixel value of the point to be 0, and otherwise assigning the pixel value to be 1;
step 5.2, training and predicting the segmentation network: and (5) training an image segmentation network based on the pseudo binary label obtained in the step (5.1), wherein the selected segmentation network architecture is U-Net, and finally, performing ROI segmentation prediction on the test set by using the trained network.
2. The weak supervised image target localization method based on multi-scale salient feature fusion as recited in claim 1, wherein:
step 1: image pre-processing
The purpose of image pre-processing is to unify the size of all images in the dataset; all images were sized 512 x 512;
step 2: image pyramid construction
The construction process includes two parts: firstly, the width and the height of an input original image are respectively down-sampled into 50% of an original image through a Gaussian pyramid, and an image with 256 × 256 resolution is obtained as the top layer of the pyramid; secondly, respectively up-sampling the width and the height of the input original image into 200% of the original image through a Gaussian pyramid, and thus obtaining an image with 1024 × 1024 resolution as the bottom layer of the pyramid; the method comprises the following specific steps:
step 2.1, image pyramid top layer construction:
for a given original image with a size of 512 × 512, downsampling to construct a top layer of a gaussian pyramid from an image with a size of 1/4 of the original image, wherein the corresponding resolution of the image is 256 × 256; the specific process is shown as formula (1): firstly, performing primary Gaussian smoothing on an original image of 512 by 512, wherein the primary Gaussian smoothing is different from simple smoothing, and when the weighted average value of surrounding pixels is calculated, pixels adjacent to a central point are endowed with higher weight values by Gaussian smoothing; the processed image is then down-sampled by removing even rows and columns from the image matrix to obtain a 256 x 256 resolution image;
1≤l≤L,0≤x≤R l ,0≤y≤C l
wherein G is l The image of the first layer of the Gaussian pyramid is obtained, the number of the layers of the Gaussian pyramid is started from 0, L is the layer number of the top layer of the Gaussian pyramid, and R is l And C l Respectively the number of rows and columns of the image of the l layer, wherein W (m, n) is the value of the nth row and column of the mth row of the Gaussian filter template, the value is generally 5*5, and the original image is smoothed by selecting a Gaussian core of a two-dimensional separable 5*5 widely used in an anti-sharpening mask algorithm, and the value is shown as (2);
step 2.2, constructing the pyramid bottom layer of the image:
for a given 512 x 512 original image, upsampling to construct the lowest layer of a Gaussian pyramid by using an image with the size 4 times that of the original image, wherein the corresponding resolution is 1024 x 1024; the specific process is as follows: firstly, expanding the image to be 2 times of the original image in each direction, wherein the newly added rows and columns are filled with 0 values; then, multiplying the Gaussian kernel used in the down sampling by 4, and then performing convolution operation on the Gaussian kernel and the amplified image to obtain an approximate value of a newly added pixel, and finally obtaining an image with 1024 × 1024 resolution;
step 2.3, determining the pyramid layer number of the image:
after the image pyramid is constructed, determining the number l of layers in the Gaussian pyramid corresponding to the image with the resolution of w x h by a formula (3);
wherein l 0 The number of layers l of the original image is 512 × 512, since the three scales of the image in the gaussian pyramid are 1024, 512, and 256, respectively, the number of layers l of the original image corresponds to 0 =1; as can be seen from formula (3), the number of corresponding layers of 1024 × 1024 images is 0, i.e., the corresponding layers are located at the lowest layer of the gaussian pyramid; 256 by 256 images correspond to 2 layers, namely the images are positioned at the topmost layer of the Gaussian pyramid;
and step 3: classifier feature map acquisition
Respectively training a classifier aiming at three images with different scales in an image pyramid to obtain class activation graphs of the same image with three different scales, wherein the steps are as follows;
step 3.1, training a classification network: the classification network is used for judging the class of the input image, the selected classification network is ResNet50, the ResNet50 network comprises 49 convolution layers and 1 full-connection layer, and each residual block has three layers of convolution; respectively training a classifier for images with three resolutions of 256 × 256, 512 × 512 and 1024 × 1024 in the image pyramid, and marking as R 1 、R 2 、R 3 ;
Step 3.2, fusing the high-low layer characteristic diagrams:
splicing the highest layer features and the lowest layer features in each classification network, and using the highest layer features and the lowest layer features as a final feature map output by the network to promote the network to enhance the low-level features of the small target objects; for the classifier R 1 Let its network obtain the highest level feature diagram asThe lowest layer characteristic diagram isThen the classifier R 1 Final feature map f 1 Can be obtained from formula (4);
wherein UP is an upsampling operation, i.e., upsampling is performed on the feature map of the highest layer to achieve the same size as the feature map of the lowest layer, so as to facilitate subsequent processing,a cross-join operation for the feature map, i.e., element-by-element addition; in the same way, the classifier R 2 And a classifier R 3 Final feature map f 2 、f 3 Then the results can be obtained from equations (5) and (6);
whereinAre respectively a classifier R 2 The obtained highest-level feature map and the lowest-level feature map,are respectively the classifier R 3 The obtained highest layer characteristic diagram and the lowest layer characteristic diagram, UP is the UP-sampling operation,a cross-join operation for the feature map, i.e., element-by-element addition;
and 4, step 4: multi-scale CAM fusion
Acquiring CAMs of the three classification networks, aligning the CAMs, and fusing the CAMs, wherein the final output is a fused CAM image corresponding to the image, and the method specifically comprises the following steps:
step 4.1, obtaining by a classified network CAM: for the classifier R 1 In particular, the activation value of a spatial pixel u (x, y) in an image with 256 × 256 resolution with respect to class cCan be obtained from equation (7);
wherein i is the channel number of the last convolutional layer of the classification network, K is the channel number of the last convolutional layer of the classification network,for the weight corresponding to class c in channel i, f i 1 (x, y) is a classifier R 1 Finally, fusing the characteristic values of the positions (x, y) on the channel i in the characteristic diagram; for the same reason, for classifier R 2 And a classifier R 3 Activation value of a pixel point u (x, y) on an image with respect to class cCan be obtained from equations (8) and (9), respectively;
wherein i is the channel number of the last convolutional layer of the classification network, and K is the channel number of the last convolutional layer of the classification network,For the weight corresponding to class c in channel i, f i 2 (x, y) and f i 3 (x, y) are classifiers R, respectively 2 And R 3 Finally, fusing the characteristic values of the positions (x, y) on the channel i in the characteristic diagram;
step 4.2 multiple CAM alignment: due to the classifier R 1 、R 2 、R 3 The input of (2) is an image pyramid with three layers, so the sizes of the three obtained activation mapping graphs also form an activation graph pyramid; in order to fuse three differently scaled CAMs, it is necessary to align them, setting all CAMs to a size consistent with the original input image, i.e. 512 × 512 resolution;
step 4.3 multiple CAM fusions: fusing the three aligned CAMs into a final CAM; activation map M for fusion classes agg The fusion mechanism of the pixel u (x, y) in (1) is as follows: if there are at least two independent activation graphs at the point where the activation value for class c is greater than or equal to the threshold value theta, theta epsilon [0.5,0.7]Then, consider M agg The pixel belongs to the category c; if the pixel point is not allocated to any category after the fusion mechanism, ignoring the pixel; if the pixel point is allocated to a plurality of categories, judging the category cla (x, y) to which the pixel point belongs finally according to a formula (10);
where j is the number of pyramid levels, P is the total number of pyramid levels, where P =3,N is the number of classes of data set partitioning (excluding background classes),the activation value of the pixel u (x, y) in the feature map obtained by the pyramid of the j layer about the category c is obtained;finger pixelThe average activation value of the point u (x, y) in the P feature maps about the background class, and the background class is numbered as 0;refers to the average activation value of pixel point u (x, y) in P feature maps for category c,the average activation value of the pixel point u (x, y) in the P characteristic graphs about the category N is pointed; index is an index-taking operation, that is, an index sequence number corresponding to the maximum value in the array is taken, and here, the index sequence number also refers to the category to which the pixel belongs, for example, if the 0 th average activation value in the array is the maximum, the index value taken out is 0, and also refers to that the category belongs to 0;
and 5: ROI region prediction
Firstly, converting the fused CAM obtained in the step 4.3 into a pseudo label, training and segmenting a network based on the pseudo label, and finally predicting the ROI by using the network; the method comprises the following specific steps:
step 5.1, fusing CAM pseudo label conversion: converting the fused CAM into a pseudo-binary mask for segmenting network training; pseudo binary maskThe value of the middle pixel u (x, y) is determined by the formula (11);
wherein cla (x, y) refers to the category to which the pixel u (x, y) belongs, and cla (x, y) =0 indicates that the pixel belongs to the non-target category;
step 5.2, training and predicting the segmentation network: and (5) training an image segmentation network based on the pseudo binary label obtained in the step (5.1), wherein the selected segmentation network architecture is U-Net, and finally, performing ROI segmentation prediction on the test set by using the trained network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211201019.3A CN115546466A (en) | 2022-09-28 | 2022-09-28 | Weak supervision image target positioning method based on multi-scale significant feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211201019.3A CN115546466A (en) | 2022-09-28 | 2022-09-28 | Weak supervision image target positioning method based on multi-scale significant feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115546466A true CN115546466A (en) | 2022-12-30 |
Family
ID=84731704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211201019.3A Pending CN115546466A (en) | 2022-09-28 | 2022-09-28 | Weak supervision image target positioning method based on multi-scale significant feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115546466A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116665095A (en) * | 2023-05-18 | 2023-08-29 | 中国科学院空间应用工程与技术中心 | Method and system for detecting motion ship, storage medium and electronic equipment |
CN117079103A (en) * | 2023-10-16 | 2023-11-17 | 暨南大学 | Pseudo tag generation method and system for neural network training |
-
2022
- 2022-09-28 CN CN202211201019.3A patent/CN115546466A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116665095A (en) * | 2023-05-18 | 2023-08-29 | 中国科学院空间应用工程与技术中心 | Method and system for detecting motion ship, storage medium and electronic equipment |
CN116665095B (en) * | 2023-05-18 | 2023-12-22 | 中国科学院空间应用工程与技术中心 | Method and system for detecting motion ship, storage medium and electronic equipment |
CN117079103A (en) * | 2023-10-16 | 2023-11-17 | 暨南大学 | Pseudo tag generation method and system for neural network training |
CN117079103B (en) * | 2023-10-16 | 2024-01-02 | 暨南大学 | Pseudo tag generation method and system for neural network training |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111784671B (en) | Pathological image focus region detection method based on multi-scale deep learning | |
CN114120102A (en) | Boundary-optimized remote sensing image semantic segmentation method, device, equipment and medium | |
CN112308860A (en) | Earth observation image semantic segmentation method based on self-supervision learning | |
US8351676B2 (en) | Digital image analysis using multi-step analysis | |
CN109685801B (en) | Skin mirror image processing method combining texture features and deep neural network information | |
CN115546466A (en) | Weak supervision image target positioning method based on multi-scale significant feature fusion | |
CN112036231B (en) | Vehicle-mounted video-based lane line and pavement indication mark detection and identification method | |
CN113034505A (en) | Glandular cell image segmentation method and device based on edge perception network | |
CN114092439A (en) | Multi-organ instance segmentation method and system | |
CN114332572B (en) | Method for extracting breast lesion ultrasonic image multi-scale fusion characteristic parameters based on saliency map-guided hierarchical dense characteristic fusion network | |
CN116645592B (en) | Crack detection method based on image processing and storage medium | |
CN114048822A (en) | Attention mechanism feature fusion segmentation method for image | |
CN110648331A (en) | Detection method for medical image segmentation, medical image segmentation method and device | |
CN112348059A (en) | Deep learning-based method and system for classifying multiple dyeing pathological images | |
CN112686902A (en) | Two-stage calculation method for brain glioma identification and segmentation in nuclear magnetic resonance image | |
CN116883650A (en) | Image-level weak supervision semantic segmentation method based on attention and local stitching | |
CN117635628B (en) | Sea-land segmentation method based on context attention and boundary perception guidance | |
CN116630971A (en) | Wheat scab spore segmentation method based on CRF_Resunate++ network | |
CN118230166A (en) | Corn canopy organ identification method and canopy phenotype detection method based on improved Mask2YOLO network | |
CN114494786A (en) | Fine-grained image classification method based on multilayer coordination convolutional neural network | |
CN117291935A (en) | Head and neck tumor focus area image segmentation method and computer readable medium | |
Cogan et al. | Deep understanding of breast density classification | |
Arefin et al. | Deep learning approach for detecting and localizing brain tumor from magnetic resonance imaging images | |
CN114862883A (en) | Target edge extraction method, image segmentation method and system | |
Shahzad et al. | Semantic segmentation of anaemic RBCs using multilevel deep convolutional encoder-decoder network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |