CN111179270A - Image co-segmentation method and device based on attention mechanism - Google Patents

Image co-segmentation method and device based on attention mechanism Download PDF

Info

Publication number
CN111179270A
CN111179270A CN201911147678.1A CN201911147678A CN111179270A CN 111179270 A CN111179270 A CN 111179270A CN 201911147678 A CN201911147678 A CN 201911147678A CN 111179270 A CN111179270 A CN 111179270A
Authority
CN
China
Prior art keywords
image
segmentation
images
feature map
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911147678.1A
Other languages
Chinese (zh)
Inventor
李甲
付程晗
赵一凡
赵沁平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201911147678.1A priority Critical patent/CN111179270A/en
Publication of CN111179270A publication Critical patent/CN111179270A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Abstract

The invention provides an attention mechanism-based image co-segmentation method and device. The method comprises the following steps: the N images which are most similar to each other in the image group to be segmented are determined to form the co-segmentation image pair, so that a large number of co-segmentation image pairs are prevented from entering a co-segmentation network to be calculated, the complexity of system calculation is reduced, the co-segmentation time of the image group is saved, the image pair is input into a co-segmentation model, two similar images learn important characteristic channels mutually, the segmentation result of a target image is obtained, the co-segmentation model is adopted for calculating the image pair, the co-segmentation efficiency of the image can be improved, and a good segmentation effect is obtained.

Description

Image co-segmentation method and device based on attention mechanism
Technical Field
The invention relates to an image processing technology, in particular to an image co-segmentation method and device based on an attention mechanism.
Background
Image segmentation is one of the hot spots of image processing and computer vision, and is a computer vision task that marks a specified area according to the content of an image. It is the basis for image analysis and understanding of image feature extraction and recognition. Image co-segmentation, also called image co-segmentation, refers to the segmentation of a common object from a set of images that contain the common object (often also referred to as foreground).
The current co-segmentation method is mainly based on the traditional machine learning method: markov random fields, conditional random fields, etc., and neural network-based co-segmentation methods. The common segmentation method based on the neural network comprises the steps of inputting a group of common segmentation graphs, obtaining an initial common segmentation result through the common segmentation network, and obtaining a final common segmentation result through some processing.
However, the conventional co-segmentation method based on the neural network has low segmentation efficiency and poor segmentation effect.
Disclosure of Invention
The invention provides an attention mechanism-based image co-segmentation method and device, which are used for solving the problems of low image co-segmentation efficiency and poor segmentation effect.
In one aspect, the present invention provides an attention mechanism-based image co-segmentation method, including:
acquiring a group of images to be segmented, wherein the group of images to be segmented comprises a plurality of images;
determining N images to be segmented of target images in the image group to be segmented, and forming a co-segmentation image pair by the target images and each image to be segmented, wherein N is more than or equal to 2 and less than or equal to 8;
inputting the co-segmentation image pairs into a co-segmentation model, and performing the following processing on each co-segmentation image pair through the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;
calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed by the target image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented;
determining an average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain a prediction weight of the attention characteristic graph of the target image; multiplying the predicted weight and the semantic features of the target image to obtain the weight calibration of the target image;
and carrying out up-sampling on the weight calibration of the target image to obtain a segmentation result of the target image.
Optionally, determining N images to be segmented of the target image in the image group to be segmented includes:
extracting the characteristics of each image in the image group to be segmented;
respectively calculating the distances between the features of the target image and the features of all the images in the image group to be segmented;
and selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as the N images to be segmented.
Optionally, the performing operation on the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image includes:
for each co-segmentation image pair, performing feature compression processing on semantic feature maps of two images of the co-segmentation image pair according to feature channels of the semantic feature maps to obtain a first feature vector;
performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map;
the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph;
multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image;
according to the feature channel of the second feature map, performing feature compression processing on the second feature map to obtain a second feature vector;
performing full connection on the second feature vector twice to obtain a second weight of each feature channel of the second feature map;
the second weight of each feature channel is up-sampled to obtain a third feature map, the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair;
determining an average value of features of the attention feature maps of the N images to be segmented to obtain a prediction weight of the attention feature map of the target image, wherein the prediction weight comprises the following steps:
acquiring attention feature maps of N images to be segmented of the target image obtained through calculation;
and determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image.
Optionally, before inputting the co-segmentation image pair into the co-segmentation model, the method further includes:
preprocessing a training image set to obtain a labeling object of each image in the training image set;
obtaining a real result of each image according to the labeling object;
determining N images to be segmented of each image in the training image set, and forming N co-segmented image pairs of each image in the training image set;
processing each co-segmentation image in the training image set with an input established co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image, and performing attention operation twice on the semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;
the attention calculation includes: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph; multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image; the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map of the second attention operation;
carrying out up-sampling on the weight calibration of the current image to obtain a segmentation result of the target image;
calculating a loss between a segmentation result of the current image and the true result;
and reversely propagating the loss of the current image to the co-segmentation model.
Optionally, the preprocessing the training image set to obtain the labeled object of each image in the training image set includes:
when the images in the training image set comprise a plurality of objects, selecting one object as an annotation object according to the sizes of the objects;
the obtaining of the real result of each image according to the labeling object includes:
and cutting the labeled object of each image by using a minimum bounding box, and filling other areas without the labeled object by using pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.
In another aspect, the present invention provides an attention-based image co-segmentation apparatus, including:
the device comprises an acquisition module, a segmentation module and a segmentation module, wherein the acquisition module is used for acquiring a group of images to be segmented, and the group of images to be segmented comprises a plurality of images;
the determining module is used for determining N images to be segmented of the target image in the image group to be segmented, and forming a co-segmentation image pair by the target image and each image to be segmented, wherein N is greater than or equal to 2 and less than or equal to 8;
the co-segmentation module is to:
performing the following processing on each co-segmentation image pair through the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;
calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed by the target image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented;
determining an average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain a prediction weight of the attention characteristic graph of the target image; multiplying the predicted weight and the semantic features of the target image to obtain the weight calibration of the target image;
and carrying out up-sampling on the weight calibration of the target image to obtain a segmentation result of the target image.
Optionally, the determining module is specifically configured to:
extracting the characteristics of each image in the image group to be segmented;
respectively calculating the distances between the features of the target image and the features of all the images in the image group to be segmented;
and selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as the N images to be segmented.
Optionally, the co-segmentation module performs an operation on the semantic feature maps of the two images of each co-segmented image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image, and the operation includes:
for each co-segmentation image pair, performing feature compression processing on semantic feature maps of two images of the co-segmentation image pair according to feature channels of the semantic feature maps to obtain a first feature vector;
performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map;
the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph;
multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image;
according to the feature channel of the second feature map, performing feature compression processing on the second feature map to obtain a second feature vector;
performing full connection on the second feature vector twice to obtain a second weight of each feature channel of the second feature map;
the second weight of each feature channel is up-sampled to obtain a third feature map, the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair;
the common segmentation module determines an average value of features of the attention feature maps of the N images to be segmented to obtain a prediction weight of the attention feature map of the target image, and the common segmentation module comprises:
acquiring attention feature maps of N images to be segmented of the target image obtained through calculation;
and determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image.
Optionally, the image co-segmentation apparatus based on attention mechanism further includes a training module:
the training module is configured to:
preprocessing a training image set to obtain a labeling object of each image in the training image set, and obtaining a real result of each image according to the labeling object;
determining N images to be segmented of each image in the training image set, and forming N co-segmented image pairs of each image in the training image set;
processing each co-segmentation image in the training image set with an input established co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image, and performing attention operation twice on the semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;
the attention calculation includes: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph; multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image; the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map of the second attention operation;
carrying out up-sampling on the weight calibration of the current image to obtain a segmentation result of the target image;
calculating a loss between a segmentation result of the current image and the true result;
and reversely propagating the loss of the current image to the co-segmentation model.
Optionally, the training module performs preprocessing on a training image set to obtain a labeled object of each image in the training image set, including:
when the images in the training image set comprise a plurality of objects, selecting one object as an annotation object according to the sizes of the objects;
the obtaining of the real result of each image according to the labeling object includes:
and cutting the labeled object of each image by using a minimum bounding box, and filling other areas without the labeled object by using pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.
In yet another aspect, the present invention provides an attention-based image co-segmentation apparatus, comprising:
a memory for storing processor-executable instructions;
a processor for implementing a method as described in the above aspect when the computer program is executed.
In yet another aspect, the present invention provides a computer-readable storage medium having stored thereon computer-executable instructions for implementing a method as described in the above aspect when executed by a processor.
According to the image co-segmentation method based on the attention mechanism, the N most similar images in the image group to be segmented are determined to form the co-segmentation image pair, so that a large number of co-segmentation image pairs are prevented from entering a co-segmentation network to be calculated, the complexity of system calculation is reduced, the time for image co-segmentation is saved, the image pairs are input into a co-segmentation model, two similar images learn important characteristic channels mutually, the segmentation result of a target image is obtained, the image pairs are calculated by adopting the co-segmentation model, the image co-segmentation efficiency can be improved, and a good segmentation effect is obtained.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
FIG. 1 is a diagram of a scene for target object extraction for a set of images provided by the present invention;
FIG. 2 is a flowchart illustrating a first embodiment of an image co-segmentation method based on attention mechanism according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of an image co-segmentation method based on attention mechanism according to the present invention;
FIG. 4 is a flowchart of a third embodiment of an image co-segmentation method based on attention mechanism according to the present invention;
FIG. 5 is a schematic structural diagram of an image co-segmentation method based on an attention mechanism according to the present invention;
FIG. 6 is a schematic structural diagram of a first embodiment of an image co-segmentation apparatus based on an attention mechanism according to the present invention;
fig. 7 is a block diagram of an image co-segmentation apparatus based on an attention mechanism according to the present invention.
With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.
Detailed Description
The invention provides an attention mechanism-based image co-segmentation method, which can improve the image co-segmentation efficiency and enable the image co-segmentation effect to be better.
Fig. 1 is a scene diagram of extracting a target object in an image according to the present invention. The image co-segmentation method based on the attention mechanism can be applied to the scene graph of the target object in the extracted image shown in FIG. 1. As shown in fig. 1, the scenario includes:
an object of interest, such as a person or a vehicle, always exists in a continuous video frame, and the object of interest needs to be segmented and labeled.
The image group comprises the same or similar objects, so the image set can be used as the image set of the same category and input into an image co-segmentation device based on an attention mechanism for co-segmentation processing, thereby extracting the object of interest, namely the target object. The co-segmentation device firstly finds N images which are the same as or similar to each image in the image set to form a co-segmentation image pair, then inputs the co-segmentation image pair into a co-segmentation model, and finally obtains the segmentation result of the image.
The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.
Fig. 2 is a flowchart of a first embodiment of the method for image co-segmentation based on attention mechanism according to the present invention, and the execution subject of the method is an apparatus for image co-segmentation based on attention mechanism, which may be implemented by software and/or hardware. As shown in fig. 2, the image co-segmentation method based on attention mechanism may include the following steps:
s101, acquiring a group of images to be segmented, wherein the group of images to be segmented comprises a plurality of images.
S102, determining N images to be segmented of the target image in the image group to be segmented, and forming a co-segmentation image pair by the target image and each image to be segmented.
If images in a group of acquired images to be segmented form a co-segmentation image pair between every two images, and subsequent co-segmentation processing is performed, the system workload is large, so that N images with high similarity to each target image can be determined in the group of images to be segmented as images to be segmented, and the images and the target images form a co-segmentation image pair between every two images, wherein N can be a number which is more than or equal to 2 and less than or equal to 8.
Optionally, when N is equal to 6, a good co-segmentation effect can be obtained. For example, 100 images are input in the image group to be segmented, 600 pairs of co-segmented images are determined.
Optionally, determining N images to be segmented of the target image in the image group to be segmented includes:
and extracting the characteristics of each image in the image group to be segmented. The feature extraction may be performed on each image using an image metric network, or other feature extraction methods may be used, which is not limited herein. And then respectively calculating the distances between the features of the target image and the features of all the images in the image group to be segmented. And selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as N images to be segmented. I.e. the N images that are most similar to the target image are selected.
The method comprises the steps of extracting the features of a target image, calculating the distance between the target image and the features of each image in an image group to be segmented, finding N images most similar to the target image, and forming a co-segmentation image pair, so that a large number of co-segmentation image pairs are prevented from entering a co-segmentation network for calculation, the complexity of system calculation is reduced, and the time for co-segmentation of the image group is saved.
S103, inputting the co-segmentation image pairs into a co-segmentation model, and carrying out the following processing on each co-segmentation image pair through the co-segmentation model: and respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image.
The image pair is processed by using 5 convolutional layers in the VGG16 module, a pooling layer is connected between the convolutional layers and is used for halving the feature map, and the semantic feature map of the image is obtained through the processing, so that the aims of feature dimension reduction, data compression and parameter quantity reduction are fulfilled.
And S104, calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented which forms the image pair with the target image.
Wherein, the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented.
And performing operation on the semantic feature map of one image of each co-segmentation image pair to obtain the attention feature map of the image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image, namely the attention feature map represents the importance degree of the feature channel of the image. For a target image, N images to be segmented which are most similar to the target image are determined through step S102, and attention feature maps of the N images to be segmented are obtained through attention calculation on the N images to be segmented.
S105, determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image, and multiplying the prediction weight by the semantic characteristics of the target image to obtain the weight calibration of the target image.
And calculating the average value of the characteristics of the attention characteristic graphs of the N images to be segmented of the target image, namely fusing the characteristics of the attention characteristic graphs of the N images to be segmented, wherein the average value is the prediction weight of the attention characteristic graph of the target image. And multiplying the predicted weight by the semantic features of the target image to obtain the weight calibration of the target image.
Illustratively, when N is 6, the target image x in the image group is pointed to1Obtaining 6 images (x) most suitable for segmentation1,x2,x3,x4,x5,x6) Forming a co-segmented image pair (x)1,x1)、(x1,x2)、(x1,x3)、(x1,x4)、(x1,x5)、(x1,x6) Then obtaining 6 attention feature maps through a co-segmentation model
Figure BDA0002282663920000091
Calculating to obtain a target image x1The predicted weight of the attention feature map of (1)
Figure BDA0002282663920000092
The method can be specifically realized by the following formula:
Figure BDA0002282663920000093
and S106, carrying out up-sampling on the weight calibration of the target image to obtain the segmentation result of the target image.
The method comprises the steps of up-sampling the weight calibration of a target image, namely analyzing the low-layer characteristics of the image, restoring the original information of the image, and then performing secondary classification on each pixel point of the obtained image, namely distinguishing the pixel point as a target object or a background area to obtain a final segmentation result.
According to the image co-segmentation method based on the attention mechanism, the N most similar images in the image group to be segmented are determined to form the co-segmentation image pair, so that a large number of co-segmentation image pairs are prevented from entering a co-segmentation network to be calculated, the complexity of system calculation is reduced, the time for image co-segmentation is saved, the image pairs are input into a co-segmentation model, two similar images learn important characteristic channels mutually, the segmentation result of a target image is obtained, the image pairs are calculated by adopting the co-segmentation model, the image co-segmentation efficiency can be improved, and a good segmentation effect is obtained.
Fig. 3 is a flowchart of a second embodiment of the image co-segmentation method based on the attention mechanism according to the present invention. Fig. 3 is a flowchart of the embodiment shown in fig. 1, where S104 is performed to calculate semantic feature maps of two images of each co-segmentation image pair to obtain an attention feature map of an image to be segmented that forms an image pair with a target image, an average value of features of the attention feature maps of N images to be segmented is determined, and a prediction weight of the attention feature map of the target image is obtained, and the method includes:
s201, aiming at each co-segmentation image pair, performing feature compression processing on semantic feature maps according to feature channels of the semantic feature maps of two images of the co-segmentation image pair to obtain a first feature vector; performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map; and upsampling the weight of each characteristic channel to obtain a first characteristic map, wherein the size of the first characteristic map is the same as that of the semantic characteristic map.
The method comprises the steps of performing feature compression on each two-dimensional feature map in a semantic feature map to obtain a real number, namely, each feature channel of the semantic feature map is represented by one real number, the real number has a global receptive field and represents an average feature of the semantic feature map on the feature channel, obtaining a first feature vector, performing full connection twice to obtain a first weight of each feature channel of the semantic feature map, and then performing up-sampling to obtain the first feature map with the same size as the semantic feature map.
S202, multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image.
And multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image, so that the importance degree of the feature channel of one image is used for influencing the other image, thereby obtaining a second feature map.
And S203, performing feature compression processing on the second feature map according to the feature channel of the second feature map to obtain a second feature vector.
Performing full connection on the second feature vector twice to obtain a second weight of each feature channel of the second feature map;
s204, upsampling the second weight of each feature channel to obtain a third feature map, wherein the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair.
S205, acquiring the attention feature maps of the N images to be segmented of the target image obtained through calculation, determining the average value of the features of the attention feature maps of the N images to be segmented, and obtaining the prediction weight of the attention feature map of the target image.
Each image in the image group has N image pairs, and another image to be segmented in each image pair generates an attention feature map, so that for the target image, the average value of the features of the N attention feature maps, that is, the prediction weight of the attention feature map of the current image, is calculated for the generated attention feature maps of the N images to be segmented.
In this embodiment, for an image in each co-segmented image pair, a first feature map with the same size as a semantic feature map is obtained, that is, the importance degree of a feature channel of the semantic feature map of the image, then the first feature map corresponding to one image in each co-segmented image pair is multiplied by the semantic feature map of another image to obtain a second feature map corresponding to each image, that is, the importance degree of the feature channel of the semantic feature map of one image affects another image, then a third feature map with the same size as the semantic feature map is obtained through operation, the attention feature maps of N images to be segmented of the target image obtained through operation are obtained, the average value of the features of the attention feature maps of the N images to be segmented is determined, the prediction weight of the attention feature map of the target image is obtained, and the prediction weight is a result of the mutual influence of the importance degrees of similar image feature channels, therefore, the image co-segmentation method and the device can further improve the image co-segmentation efficiency and obtain a good segmentation effect.
Fig. 4 is a flowchart of a third embodiment of the image co-segmentation method based on the attention mechanism provided in the present invention. The embodiment shown in fig. 2 or fig. 3 above is to perform image co-segmentation by using a co-segmentation model in the case of a known co-segmentation model, and in practical applications, before the method is implemented by using the co-segmentation model, the method further includes: and constructing and training a co-segmentation model. As shown in fig. 4, constructing and training the co-segmentation model includes the following steps:
fig. 5 is a schematic structural diagram of an image co-segmentation method based on an attention mechanism according to the present invention, and a co-segmentation model constructed and trained according to the present invention is described with reference to fig. 5.
301. Preprocessing the training image set to obtain a labeling object of each image in the training image set, and obtaining a real result of each image according to the labeling object.
Before training the model, firstly selecting a training image set, and preprocessing the training image set to obtain a labeling object of each image in the training image set. And obtaining a real result of each image according to the labeling object, wherein the real result is a correct result of segmenting the labeling object and other background areas of the image.
Optionally, when the images in the training image set include a plurality of objects, one object is selected as the annotation object according to the sizes of the plurality of objects.
Optionally, the labeled object of each image is cut by using the minimum bounding box, and other regions without labeled objects are filled with the pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.
The MSCOCO data set is a classified image data set, 20 categories are selected from 80 categories of the MSCOCO data set to serve as a training image set, and a labeled object in an image is extracted according to the category of the object. When the images in the training image set contain a plurality of objects, the objects can be divided into three dimensions of large, medium and small, and the images are segmented together, so that the object with a larger foreground area is needed, and therefore, for each semantic class of images, one large-dimension object is selected as the labeled object of the class of images. And (3) clipping the marked object by using a minimum bounding box, filling the region except the marked object in the image, namely the background of the image, by using the image pixel mean value of the MSCOCO data set, taking the marked object as white, and taking the region except the marked object as black, so as to obtain the real result of the image.
S302, determining N images to be segmented of each image in the training image set, and forming N co-segmented image pairs of each image in the training image set.
By measuring the distance between each image and other images in the image set, the semantic similarity between the images can be obtained, so that N images which are most suitable for co-segmentation are found for each image, and N co-segmentation image pairs of each image in the training image set are formed.
Specifically, the image features of each image in the training image set are extracted, and in the training image set of each category, the similarity distance between each image feature and all the image features in the image set is calculated pairwise.
In one possible implementation, a distance metric method may be used, and the similarity distances are arranged from high to low, and the first N results are taken to determine the N images to be segmented for each image in the training image set.
Another possible implementation manner is that M distance measurement methods may be used to calculate the similarity distance, the similarity distance values calculated by each method are arranged into a matrix from high to low, thereby obtaining M matrices, the first N results of each matrix are taken to form a most similar distance matrix, the closest image is selected in a voting manner, for example, if a similarity distance result appears in all the most similar distance matrices at the same time, the image corresponding to the distance is determined to be a co-segmented image of the target image, then the similarity distance in the M-1 most similar distance matrices appearing at the same time is searched, the image corresponding to the distance is determined to be a co-segmented image of the target image, and so on, the result with the highest similarity distance value is obtained until the number of returned results is equal to N.
This strategy results in a matrix of the top N most similar images for each image. With this metric matrix, the target image for each semantic class can be found, together with the N images it is most suitable to co-segment.
S303, processing each co-segmentation image in the training image set on the input established co-segmentation model: and respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image.
Input image x1(Cinput1,Hinput1,Winput1) And x2(Cinput2,Hinput2,Winput2) And 5 convolution operations in the VGG16 module are used, the convolution blocks are connected with each other by a maximum pooling layer, and the maximum pooling layer can reduce the feature map by half, so that the purposes of feature dimension reduction, data compression and parameter reduction are realized through the operations, and the semantic feature map fx of each image is obtained1(C1,H1,W1) And fx2(C2,H2,W2). For example, two RGB images x are input1And x2All the semantic feature maps fx obtained by the above operations are 512 x 5121And fx2All have 512 characteristic channels, and the sizes are all 16 × 16.
S304, performing attention operation twice on semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;
attention calculation, including: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic diagram, and the size of the first characteristic diagram is the same as that of the semantic characteristic diagram; multiplying a first feature map corresponding to one image in each co-segmentation image pair with a semantic feature map of the other image to obtain a second feature map corresponding to each image; and the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map obtained by the second attention operation.
Map semantic features fx1And fx2Compressing the features along the space dimension (i.e. the feature channel of the semantic feature map), compressing the two-dimensional feature map of each feature channel of the semantic feature map into a real number, and obtaining a feature map fx1' and fx2' the number of channels is equal to the number of channels of the input semantic feature map, and the feature map fx1' and fx2' it characterizes the global distribution of the response on the feature channel, and makes the layer near the input obtain the global receptive field, fx after the compression of the image semantic feature map1' and fx2' is a feature vector with dimension (C,1,1), and the feature vector obtained by compressing the semantic feature map of the image adopted in this embodiment can be implemented by the following formula:
Figure BDA0002282663920000141
wherein, fx is a semantic feature map of the image, fx' is a feature vector after feature compression, H is a length value of the feature map of each channel of fx, and W is a width value of the feature map of each channel of fx.
Then two feature vectors fx1' and fx2' after two full connections, the full connection can be generated systematically and randomly by using linear transformation g (x) ═ wx + b, and the values of w and b can be initially generated and then continuously learned through training. The weight of each feature channel is obtained through two times of full connection. Between full connections all values are mapped between 0, 1 using sigmoid function. Then, the weight of the characteristic channel is up-sampled to obtain the sum fx1' and fx2' equally large first characteristic diagram
Figure BDA0002282663920000142
The first feature map represents the weight of each pixel in the feature map. The full connection operation performed on the feature map and the calculation of the first feature map, which are adopted in the embodiment, may be implemented by the following formulas:
fx"=σ(g(w,fx'))=σ(w2,δ(w1fx')) equation (3)
attx"═ update (fx") formula (4)
Wherein fx 'is an input characteristic diagram of full connection, fx' is a characteristic diagram obtained after two times of full connection, and delta (w) is adopted for one time of full connection1Fx ') and the second full join is performed using σ (g (w, fx'). attx"is the first characteristic diagram obtained by sampling on the characteristic diagram.
And multiplying the semantic features of the current image by the first feature map of the other image in the co-segmentation image pair, so that after the two images mutually learn the importance degrees of the respective feature channels, a second feature map representing the weight of the image feature map is obtained.
For example, the semantic feature map fx1And fx2All with a size of 512 x 16, the feature map fx obtained by the feature compression1' and fx2' size 512 x 1, to get the first feature map
Figure BDA0002282663920000151
And
Figure BDA0002282663920000152
are respectively 512 x 16.
In this embodiment, the second characteristic diagram fattx2(C, H, W) and fattx1(C, H, W), which can be specifically realized by the following formula:
Figure BDA0002282663920000153
wherein fx is a semantic feature map of the image, attx"is the first characteristic diagram.
The process of obtaining the weight calibration result of one image to the other image feature map from the semantic feature map is the attention calculation process. The calculated first feature map represents the weight of the feature channel of one image, namely the importance of the feature channel of one image, the input image pair is subjected to attention operation once, and the first feature map of one image is applied to the other image so that the common segmentation model learns that the important feature channels of the two images are the same if the two images are similar, namely the feature map of the foreground object of the other image is subjected to weighting once. Furthermore, attention operation can be performed again, so that the co-segmentation model learns a group of parameters again to sense important feature channels in the weighted semantic feature map. The common area of the two images can be better obtained through two attention operations, namely similar target objects in the two images can be captured, and a better common segmentation effect is obtained.
S305, carrying out up-sampling on the weight calibration of the current image to obtain the segmentation result of the target image.
The method comprises the following steps of carrying out upsampling on the weight calibration of a current image, namely analyzing low-level features of the image, and restoring original information of the image, wherein the specific operation is that firstly, the upsampling expands a feature map to be twice of the original feature map, and then, carrying out convolution operation by using a 3-by-3 convolution kernel to change the sparse feature map to be dense, so that more information is learned, and the restoring process is closer to the real effect. And 5 times of upsampling and convolution operation are carried out to obtain a characteristic diagram with weight, which is equal to the original input image in size, and finally, softmax is used for carrying out secondary classification on each pixel point in the characteristic diagram to obtain a final segmentation result.
S306, calculating the loss between the segmentation result and the real result of the current image.
The loss between the segmentation result and the real result is calculated by adopting cross entropy, which can be specifically realized by the following formula:
Figure BDA0002282663920000154
wherein, ClslossAs a loss between the segmented result and the true result, ypAs a result of segmentation, ygtFor the real result, n is n rounds of the above method processing performed on a set of images, each round of processing is also called an epoch.
And S307, reversely transmitting the loss of the current image to the co-segmentation model.
Calculating loss between the obtained image segmentation result and the real result of the image, reversely transmitting the loss corresponding to the current image to the co-segmentation model, and updating parameters.
Illustratively, in the experiment, the co-segmented model was trained with 40 million image pairs of training data, followed by 2 epochs using 8 Tesla P40 blocks with a batch size of 64 for a total of 18 hours, and a final trained model size of 600M.
Further, the trained co-segmentation model is tested.
First, a test data set is selected.
The test dataset includes both image datasets of classes that are present in the training dataset and image datasets of classes that are not present in the training dataset. For example, the ImageNet dataset is an image dataset, images in the ImageNet dataset are manually labeled with objects, 40 classes in the ImageNet dataset are selected as the test dataset, wherein the 20 classes are the same as the 20 classes of the MSCOCOs selected by the training dataset, and the 20 classes are different from the class labels of the MSCOCOs selected by the training dataset.
The test dataset is then processed according to the method shown in fig. 2 or fig. 3, resulting in a co-segmentation result for each image in the test dataset.
For the evaluation of the test result, four evaluation indexes are selected: the cross-over ratio (IOU) of the real label and the prediction result, the accuracy (precision) of the foreground pixel point, the recall (recall) of the foreground pixel point, and the harmonic mean (f1score) of the precision and the recall. Wherein precision represents the proportion of the correct pixel to the predicted foreground pixel, and recall represents the proportion of the correct predicted foreground pixel to the real foreground pixel.
The results of the four evaluation indexes obtained by performing the test on 40 analogy of ImageNet are as follows: the IOU is 0.5917, precision is 0.8246, call is 0.6709, and f1score is 0.7398.
Fig. 6 is a schematic structural diagram of a first embodiment of an attention-based image co-segmentation apparatus provided in the present invention, and as shown in fig. 6, the attention-based image co-segmentation apparatus of the present embodiment includes: obtaining 11, determining 12, and co-segmenting 13.
The acquiring module 11 is configured to acquire a group of images to be segmented, where the group of images to be segmented includes a plurality of images;
the determining module 12 is configured to determine N images to be segmented of a target image in the image group to be segmented, and form a co-segmented image pair with each image to be segmented, where N is greater than or equal to 2 and less than or equal to 8;
the co-segmentation module 13 is configured to:
the following processing is performed on each co-segmented image pair by the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;
calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed by the target image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented;
determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image; multiplying the predicted weight by the semantic features of the target image to obtain the weight calibration of the target image;
and carrying out up-sampling on the weight calibration of the target image to obtain a segmentation result of the target image.
Optionally, the determining module 12 is specifically configured to:
extracting the characteristics of each image in the image group to be segmented;
respectively calculating the distances between the features of the target image and the features of all images in the image group to be segmented;
and selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as N images to be segmented.
Optionally, the co-segmentation module 13 performs an operation on the semantic feature maps of the two images of each co-segmented image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image, including:
for each co-segmentation image pair, performing feature compression processing on the semantic feature maps according to feature channels of the semantic feature maps of the two images of the co-segmentation image pair to obtain a first feature vector;
performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map;
the weight of each characteristic channel is up-sampled to obtain a first characteristic diagram, and the size of the first characteristic diagram is the same as that of the semantic characteristic diagram;
multiplying a first feature map corresponding to one image in each co-segmentation image pair with a semantic feature map of the other image to obtain a second feature map corresponding to each image;
according to the feature channel of the second feature map, performing feature compression processing on the second feature map to obtain a second feature vector;
performing full connection on the second feature vector twice to obtain a second weight of each feature channel of the second feature map;
the second weight of each feature channel is up-sampled to obtain a third feature map, the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair;
the common segmentation module determines an average value of features of attention feature maps of N images to be segmented to obtain a prediction weight of the attention feature map of the target image, and the common segmentation module comprises the following steps:
acquiring attention feature maps of N images to be segmented of the target image obtained through calculation;
and determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image.
Optionally, the image co-segmentation device based on attention mechanism, further comprises a training module 14,
a training module 14 for:
preprocessing a training image set to obtain a labeling object of each image in the training image set;
obtaining a real result of each image according to the labeling object;
determining N images to be segmented of each image in a training image set to form N co-segmented image pairs of each image in the training image set;
processing each co-segmentation image in the training image set with the input established co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image, and performing attention operation twice on the semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;
attention calculation, including: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic diagram, and the size of the first characteristic diagram is the same as that of the semantic characteristic diagram; multiplying a first feature map corresponding to one image in each co-segmentation image pair with a semantic feature map of the other image to obtain a second feature map corresponding to each image; the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map of the second attention operation;
carrying out up-sampling on the weight calibration of the current image to obtain a segmentation result of the target image;
calculating the loss between the segmentation result and the real result of the current image;
the loss of the current image is propagated back to the co-segmentation model.
Optionally, the training module 14 performs preprocessing on the training image set to obtain a labeled object of each image in the training image set, including:
when the images in the training image set comprise a plurality of objects, selecting one object as an annotation object according to the sizes of the objects;
obtaining the real result of each image according to the labeling object, comprising the following steps:
and cutting the labeled object of each image by using the minimum bounding box, and filling other regions without the labeled object by using the pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.
The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 7 is a block diagram of an image co-segmentation apparatus based on attention mechanism according to the present invention, and as shown in fig. 7, the image co-segmentation apparatus 300 based on attention mechanism includes: a memory 32 and a processor 31, the memory 32 storing computer instructions, the processor 31 executing the computer instructions to perform the method steps of the embodiments shown in fig. 2 to 4 described above.
It should be understood that in the above embodiments, the processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the aforementioned memory may be a read-only memory (ROM), a Random Access Memory (RAM), a flash memory, a hard disk, or a solid state disk. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. Which when executed performs steps comprising the method embodiments described above.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (12)

1. An attention mechanism-based image co-segmentation method is characterized by comprising the following steps:
acquiring a group of images to be segmented, wherein the group of images to be segmented comprises a plurality of images;
determining N images to be segmented of target images in the image group to be segmented, and forming a co-segmentation image pair by the target images and each image to be segmented, wherein N is more than or equal to 2 and less than or equal to 8;
inputting the co-segmentation image pairs into a co-segmentation model, and performing the following processing on each co-segmentation image pair through the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;
calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed by the target image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented;
determining an average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain a prediction weight of the attention characteristic graph of the target image; multiplying the predicted weight and the semantic features of the target image to obtain the weight calibration of the target image;
and carrying out up-sampling on the weight calibration of the target image to obtain a segmentation result of the target image.
2. The method according to claim 1, wherein determining N images to be segmented of the target image in the group of images to be segmented comprises:
extracting the characteristics of each image in the image group to be segmented;
respectively calculating the distances between the features of the target image and the features of all the images in the image group to be segmented;
and selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as the N images to be segmented.
3. The method according to claim 1, wherein the calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image comprises:
for each co-segmentation image pair, performing feature compression processing on semantic feature maps of two images of the co-segmentation image pair according to feature channels of the semantic feature maps to obtain a first feature vector;
performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map;
the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph;
multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image;
according to the feature channel of the second feature map, performing feature compression processing on the second feature map to obtain a second feature vector;
performing full connection on the second feature vector twice to obtain a second weight of each feature channel of the second feature map;
the second weight of each feature channel is up-sampled to obtain a third feature map, the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair;
determining an average value of features of the attention feature maps of the N images to be segmented to obtain a prediction weight of the attention feature map of the target image, wherein the prediction weight comprises the following steps:
acquiring attention feature maps of N images to be segmented of the target image obtained through calculation;
and determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image.
4. The method of any of claims 1-3, wherein prior to inputting the co-segmented image pairs into a co-segmentation model, further comprising:
preprocessing a training image set to obtain a labeling object of each image in the training image set;
obtaining a real result of each image according to the labeling object;
determining N images to be segmented of each image in the training image set, and forming N co-segmented image pairs of each image in the training image set;
processing each co-segmentation image in the training image set with an input established co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image, and performing attention operation twice on the semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;
the attention calculation includes: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph; multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image; the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map of the second attention operation;
carrying out up-sampling on the weight calibration of the current image to obtain a segmentation result of the target image;
calculating a loss between a segmentation result of the current image and the true result;
and reversely propagating the loss of the current image to the co-segmentation model.
5. The method according to claim 4, wherein the preprocessing the training image set to obtain the labeled object of each image in the training image set comprises:
when the images in the training image set comprise a plurality of objects, selecting one object as an annotation object according to the sizes of the objects;
the obtaining of the real result of each image according to the labeling object includes:
and cutting the labeled object of each image by using a minimum bounding box, and filling other areas without the labeled object by using pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.
6. An attention-based image co-segmentation apparatus comprising:
the device comprises an acquisition module, a segmentation module and a segmentation module, wherein the acquisition module is used for acquiring a group of images to be segmented, and the group of images to be segmented comprises a plurality of images;
the determining module is used for determining N images to be segmented of the target image in the image group to be segmented, and forming a co-segmentation image pair by the target image and each image to be segmented, wherein N is greater than or equal to 2 and less than or equal to 8;
the co-segmentation module is to:
performing the following processing on each co-segmentation image pair through the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;
calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed by the target image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented;
determining an average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain a prediction weight of the attention characteristic graph of the target image; multiplying the predicted weight and the semantic features of the target image to obtain the weight calibration of the target image;
and carrying out up-sampling on the weight calibration of the target image to obtain a segmentation result of the target image.
7. The apparatus of claim 6, wherein the determining module is specifically configured to:
extracting the characteristics of each image in the image group to be segmented;
respectively calculating the distances between the features of the target image and the features of all the images in the image group to be segmented;
and selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as the N images to be segmented.
8. The apparatus according to claim 6, wherein the co-segmentation module performs an operation on semantic feature maps of two images of each co-segmented image pair to obtain an attention feature map of an image to be segmented of an image pair formed with the target image, and includes:
for each co-segmentation image pair, performing feature compression processing on semantic feature maps of two images of the co-segmentation image pair according to feature channels of the semantic feature maps to obtain a first feature vector;
performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map;
the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph;
multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image;
according to the feature channel of the second feature map, performing feature compression processing on the second feature map to obtain a second feature vector;
performing full connection on the second feature vector twice to obtain a second weight of each feature channel of the second feature map;
the second weight of each feature channel is up-sampled to obtain a third feature map, the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair;
the common segmentation module determines an average value of features of the attention feature maps of the N images to be segmented to obtain a prediction weight of the attention feature map of the target image, and the common segmentation module comprises:
acquiring attention feature maps of N images to be segmented of the target image obtained through calculation;
and determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image.
9. The apparatus of any of claims 6-8, further comprising a training module to:
the training module is configured to:
preprocessing a training image set to obtain a labeling object of each image in the training image set; obtaining a real result of each image according to the labeling object;
determining N images to be segmented of each image in the training image set, and forming N co-segmented image pairs of each image in the training image set;
processing each co-segmentation image in the training image set with an input established co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image, and performing attention operation twice on the semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;
the attention calculation includes: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph; multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image; the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map of the second attention operation;
carrying out up-sampling on the weight calibration of the current image to obtain a segmentation result of the target image;
calculating a loss between a segmentation result of the current image and the true result;
and reversely propagating the loss of the current image to the co-segmentation model.
10. The apparatus of claim 9, wherein the training module preprocesses a training image set to obtain a labeled object of each image in the training image set, and comprises:
when the images in the training image set comprise a plurality of objects, selecting one object as an annotation object according to the sizes of the objects;
the obtaining of the real result of each image according to the labeling object includes:
and cutting the labeled object of each image by using a minimum bounding box, and filling other areas without the labeled object by using pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.
11. An attention-based image co-segmentation apparatus comprising:
a memory for storing processor-executable instructions;
processor for implementing the method according to any of the claims 1-5 when the computer program is executed.
12. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the attention-based image co-segmentation method as claimed in any one of claims 1 to 5.
CN201911147678.1A 2019-11-21 2019-11-21 Image co-segmentation method and device based on attention mechanism Pending CN111179270A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911147678.1A CN111179270A (en) 2019-11-21 2019-11-21 Image co-segmentation method and device based on attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911147678.1A CN111179270A (en) 2019-11-21 2019-11-21 Image co-segmentation method and device based on attention mechanism

Publications (1)

Publication Number Publication Date
CN111179270A true CN111179270A (en) 2020-05-19

Family

ID=70647323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911147678.1A Pending CN111179270A (en) 2019-11-21 2019-11-21 Image co-segmentation method and device based on attention mechanism

Country Status (1)

Country Link
CN (1) CN111179270A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113239943A (en) * 2021-05-28 2021-08-10 北京航空航天大学 Three-dimensional component extraction and combination method and device based on component semantic graph
CN113627258A (en) * 2021-07-12 2021-11-09 河南理工大学 Apple leaf pathological detection method
WO2022032824A1 (en) * 2020-08-10 2022-02-17 中国科学院深圳先进技术研究院 Image segmentation method and apparatus, device, and storage medium
US20230131977A1 (en) * 2021-10-22 2023-04-27 The Boeing Company Method For Large Area Inspection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170330059A1 (en) * 2016-05-11 2017-11-16 Xerox Corporation Joint object and object part detection using web supervision
CN110163878A (en) * 2019-05-28 2019-08-23 四川智盈科技有限公司 A kind of image, semantic dividing method based on dual multiple dimensioned attention mechanism
CN110197213A (en) * 2019-05-21 2019-09-03 北京航空航天大学 Image matching method, device and equipment neural network based

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170330059A1 (en) * 2016-05-11 2017-11-16 Xerox Corporation Joint object and object part detection using web supervision
CN110197213A (en) * 2019-05-21 2019-09-03 北京航空航天大学 Image matching method, device and equipment neural network based
CN110163878A (en) * 2019-05-28 2019-08-23 四川智盈科技有限公司 A kind of image, semantic dividing method based on dual multiple dimensioned attention mechanism

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HONG CHEN等: "Semantic Aware Attention Based Deep Object Co-segmentation", 《ARXIV》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022032824A1 (en) * 2020-08-10 2022-02-17 中国科学院深圳先进技术研究院 Image segmentation method and apparatus, device, and storage medium
CN113239943A (en) * 2021-05-28 2021-08-10 北京航空航天大学 Three-dimensional component extraction and combination method and device based on component semantic graph
CN113627258A (en) * 2021-07-12 2021-11-09 河南理工大学 Apple leaf pathological detection method
CN113627258B (en) * 2021-07-12 2023-09-26 河南理工大学 Apple leaf pathology detection method
US20230131977A1 (en) * 2021-10-22 2023-04-27 The Boeing Company Method For Large Area Inspection

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN109522942B (en) Image classification method and device, terminal equipment and storage medium
WO2020228446A1 (en) Model training method and apparatus, and terminal and storage medium
CN109960742B (en) Local information searching method and device
CN108427927B (en) Object re-recognition method and apparatus, electronic device, program, and storage medium
CN111179270A (en) Image co-segmentation method and device based on attention mechanism
CN111476806B (en) Image processing method, image processing device, computer equipment and storage medium
CN114549913B (en) Semantic segmentation method and device, computer equipment and storage medium
CN112801047B (en) Defect detection method and device, electronic equipment and readable storage medium
CN110222718A (en) The method and device of image procossing
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN111144425B (en) Method and device for detecting shot screen picture, electronic equipment and storage medium
CN114419406A (en) Image change detection method, training method, device and computer equipment
CN111814821A (en) Deep learning model establishing method, sample processing method and device
CN112348116A (en) Target detection method and device using spatial context and computer equipment
CN115272691A (en) Training method, recognition method and equipment for steel bar binding state detection model
CN112668675B (en) Image processing method and device, computer equipment and storage medium
CN116503399B (en) Insulator pollution flashover detection method based on YOLO-AFPS
CN111914809A (en) Target object positioning method, image processing method, device and computer equipment
CN112183303A (en) Transformer equipment image classification method and device, computer equipment and medium
CN114155388B (en) Image recognition method and device, computer equipment and storage medium
CN111652246B (en) Image self-adaptive sparsization representation method and device based on deep learning
CN114445916A (en) Living body detection method, terminal device and storage medium
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN111625672B (en) Image processing method, image processing device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination