CN111179270A

CN111179270A - Image co-segmentation method and device based on attention mechanism

Info

Publication number: CN111179270A
Application number: CN201911147678.1A
Authority: CN
Inventors: 李甲; 付程晗; 赵一凡; 赵沁平
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-05-19

Abstract

The invention provides an attention mechanism-based image co-segmentation method and device. The method comprises the following steps: the N images which are most similar to each other in the image group to be segmented are determined to form the co-segmentation image pair, so that a large number of co-segmentation image pairs are prevented from entering a co-segmentation network to be calculated, the complexity of system calculation is reduced, the co-segmentation time of the image group is saved, the image pair is input into a co-segmentation model, two similar images learn important characteristic channels mutually, the segmentation result of a target image is obtained, the co-segmentation model is adopted for calculating the image pair, the co-segmentation efficiency of the image can be improved, and a good segmentation effect is obtained.

Description

Image co-segmentation method and device based on attention mechanism

Technical Field

The invention relates to an image processing technology, in particular to an image co-segmentation method and device based on an attention mechanism.

Background

Image segmentation is one of the hot spots of image processing and computer vision, and is a computer vision task that marks a specified area according to the content of an image. It is the basis for image analysis and understanding of image feature extraction and recognition. Image co-segmentation, also called image co-segmentation, refers to the segmentation of a common object from a set of images that contain the common object (often also referred to as foreground).

The current co-segmentation method is mainly based on the traditional machine learning method: markov random fields, conditional random fields, etc., and neural network-based co-segmentation methods. The common segmentation method based on the neural network comprises the steps of inputting a group of common segmentation graphs, obtaining an initial common segmentation result through the common segmentation network, and obtaining a final common segmentation result through some processing.

However, the conventional co-segmentation method based on the neural network has low segmentation efficiency and poor segmentation effect.

Disclosure of Invention

The invention provides an attention mechanism-based image co-segmentation method and device, which are used for solving the problems of low image co-segmentation efficiency and poor segmentation effect.

In one aspect, the present invention provides an attention mechanism-based image co-segmentation method, including:

acquiring a group of images to be segmented, wherein the group of images to be segmented comprises a plurality of images;

determining N images to be segmented of target images in the image group to be segmented, and forming a co-segmentation image pair by the target images and each image to be segmented, wherein N is more than or equal to 2 and less than or equal to 8;

inputting the co-segmentation image pairs into a co-segmentation model, and performing the following processing on each co-segmentation image pair through the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;

calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed by the target image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented;

determining an average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain a prediction weight of the attention characteristic graph of the target image; multiplying the predicted weight and the semantic features of the target image to obtain the weight calibration of the target image;

and carrying out up-sampling on the weight calibration of the target image to obtain a segmentation result of the target image.

Optionally, determining N images to be segmented of the target image in the image group to be segmented includes:

extracting the characteristics of each image in the image group to be segmented;

respectively calculating the distances between the features of the target image and the features of all the images in the image group to be segmented;

and selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as the N images to be segmented.

Optionally, the performing operation on the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image includes:

for each co-segmentation image pair, performing feature compression processing on semantic feature maps of two images of the co-segmentation image pair according to feature channels of the semantic feature maps to obtain a first feature vector;

performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map;

the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph;

multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image;

according to the feature channel of the second feature map, performing feature compression processing on the second feature map to obtain a second feature vector;

performing full connection on the second feature vector twice to obtain a second weight of each feature channel of the second feature map;

the second weight of each feature channel is up-sampled to obtain a third feature map, the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair;

determining an average value of features of the attention feature maps of the N images to be segmented to obtain a prediction weight of the attention feature map of the target image, wherein the prediction weight comprises the following steps:

acquiring attention feature maps of N images to be segmented of the target image obtained through calculation;

and determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image.

Optionally, before inputting the co-segmentation image pair into the co-segmentation model, the method further includes:

preprocessing a training image set to obtain a labeling object of each image in the training image set;

obtaining a real result of each image according to the labeling object;

determining N images to be segmented of each image in the training image set, and forming N co-segmented image pairs of each image in the training image set;

processing each co-segmentation image in the training image set with an input established co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image, and performing attention operation twice on the semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;

the attention calculation includes: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic graph, and the size of the first characteristic graph is the same as that of the semantic characteristic graph; multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image; the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map of the second attention operation;

carrying out up-sampling on the weight calibration of the current image to obtain a segmentation result of the target image;

calculating a loss between a segmentation result of the current image and the true result;

and reversely propagating the loss of the current image to the co-segmentation model.

Optionally, the preprocessing the training image set to obtain the labeled object of each image in the training image set includes:

when the images in the training image set comprise a plurality of objects, selecting one object as an annotation object according to the sizes of the objects;

the obtaining of the real result of each image according to the labeling object includes:

and cutting the labeled object of each image by using a minimum bounding box, and filling other areas without the labeled object by using pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.

In another aspect, the present invention provides an attention-based image co-segmentation apparatus, including:

the device comprises an acquisition module, a segmentation module and a segmentation module, wherein the acquisition module is used for acquiring a group of images to be segmented, and the group of images to be segmented comprises a plurality of images;

the determining module is used for determining N images to be segmented of the target image in the image group to be segmented, and forming a co-segmentation image pair by the target image and each image to be segmented, wherein N is greater than or equal to 2 and less than or equal to 8;

the co-segmentation module is to:

performing the following processing on each co-segmentation image pair through the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;

Optionally, the determining module is specifically configured to:

Optionally, the co-segmentation module performs an operation on the semantic feature maps of the two images of each co-segmented image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image, and the operation includes:

the common segmentation module determines an average value of features of the attention feature maps of the N images to be segmented to obtain a prediction weight of the attention feature map of the target image, and the common segmentation module comprises:

Optionally, the image co-segmentation apparatus based on attention mechanism further includes a training module:

the training module is configured to:

preprocessing a training image set to obtain a labeling object of each image in the training image set, and obtaining a real result of each image according to the labeling object;

Optionally, the training module performs preprocessing on a training image set to obtain a labeled object of each image in the training image set, including:

In yet another aspect, the present invention provides an attention-based image co-segmentation apparatus, comprising:

a memory for storing processor-executable instructions;

a processor for implementing a method as described in the above aspect when the computer program is executed.

In yet another aspect, the present invention provides a computer-readable storage medium having stored thereon computer-executable instructions for implementing a method as described in the above aspect when executed by a processor.

According to the image co-segmentation method based on the attention mechanism, the N most similar images in the image group to be segmented are determined to form the co-segmentation image pair, so that a large number of co-segmentation image pairs are prevented from entering a co-segmentation network to be calculated, the complexity of system calculation is reduced, the time for image co-segmentation is saved, the image pairs are input into a co-segmentation model, two similar images learn important characteristic channels mutually, the segmentation result of a target image is obtained, the image pairs are calculated by adopting the co-segmentation model, the image co-segmentation efficiency can be improved, and a good segmentation effect is obtained.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a diagram of a scene for target object extraction for a set of images provided by the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of an image co-segmentation method based on attention mechanism according to the present invention;

FIG. 3 is a flowchart illustrating a second embodiment of an image co-segmentation method based on attention mechanism according to the present invention;

FIG. 4 is a flowchart of a third embodiment of an image co-segmentation method based on attention mechanism according to the present invention;

FIG. 5 is a schematic structural diagram of an image co-segmentation method based on an attention mechanism according to the present invention;

FIG. 6 is a schematic structural diagram of a first embodiment of an image co-segmentation apparatus based on an attention mechanism according to the present invention;

fig. 7 is a block diagram of an image co-segmentation apparatus based on an attention mechanism according to the present invention.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

The invention provides an attention mechanism-based image co-segmentation method, which can improve the image co-segmentation efficiency and enable the image co-segmentation effect to be better.

Fig. 1 is a scene diagram of extracting a target object in an image according to the present invention. The image co-segmentation method based on the attention mechanism can be applied to the scene graph of the target object in the extracted image shown in FIG. 1. As shown in fig. 1, the scenario includes:

an object of interest, such as a person or a vehicle, always exists in a continuous video frame, and the object of interest needs to be segmented and labeled.

The image group comprises the same or similar objects, so the image set can be used as the image set of the same category and input into an image co-segmentation device based on an attention mechanism for co-segmentation processing, thereby extracting the object of interest, namely the target object. The co-segmentation device firstly finds N images which are the same as or similar to each image in the image set to form a co-segmentation image pair, then inputs the co-segmentation image pair into a co-segmentation model, and finally obtains the segmentation result of the image.

The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart of a first embodiment of the method for image co-segmentation based on attention mechanism according to the present invention, and the execution subject of the method is an apparatus for image co-segmentation based on attention mechanism, which may be implemented by software and/or hardware. As shown in fig. 2, the image co-segmentation method based on attention mechanism may include the following steps:

s101, acquiring a group of images to be segmented, wherein the group of images to be segmented comprises a plurality of images.

S102, determining N images to be segmented of the target image in the image group to be segmented, and forming a co-segmentation image pair by the target image and each image to be segmented.

If images in a group of acquired images to be segmented form a co-segmentation image pair between every two images, and subsequent co-segmentation processing is performed, the system workload is large, so that N images with high similarity to each target image can be determined in the group of images to be segmented as images to be segmented, and the images and the target images form a co-segmentation image pair between every two images, wherein N can be a number which is more than or equal to 2 and less than or equal to 8.

Optionally, when N is equal to 6, a good co-segmentation effect can be obtained. For example, 100 images are input in the image group to be segmented, 600 pairs of co-segmented images are determined.

and extracting the characteristics of each image in the image group to be segmented. The feature extraction may be performed on each image using an image metric network, or other feature extraction methods may be used, which is not limited herein. And then respectively calculating the distances between the features of the target image and the features of all the images in the image group to be segmented. And selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as N images to be segmented. I.e. the N images that are most similar to the target image are selected.

The method comprises the steps of extracting the features of a target image, calculating the distance between the target image and the features of each image in an image group to be segmented, finding N images most similar to the target image, and forming a co-segmentation image pair, so that a large number of co-segmentation image pairs are prevented from entering a co-segmentation network for calculation, the complexity of system calculation is reduced, and the time for co-segmentation of the image group is saved.

S103, inputting the co-segmentation image pairs into a co-segmentation model, and carrying out the following processing on each co-segmentation image pair through the co-segmentation model: and respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image.

The image pair is processed by using 5 convolutional layers in the VGG16 module, a pooling layer is connected between the convolutional layers and is used for halving the feature map, and the semantic feature map of the image is obtained through the processing, so that the aims of feature dimension reduction, data compression and parameter quantity reduction are fulfilled.

And S104, calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented which forms the image pair with the target image.

Wherein, the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image to be segmented.

And performing operation on the semantic feature map of one image of each co-segmentation image pair to obtain the attention feature map of the image, wherein the attention feature map is the weight of a feature channel with the same size as the semantic feature map of the image, namely the attention feature map represents the importance degree of the feature channel of the image. For a target image, N images to be segmented which are most similar to the target image are determined through step S102, and attention feature maps of the N images to be segmented are obtained through attention calculation on the N images to be segmented.

S105, determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image, and multiplying the prediction weight by the semantic characteristics of the target image to obtain the weight calibration of the target image.

And calculating the average value of the characteristics of the attention characteristic graphs of the N images to be segmented of the target image, namely fusing the characteristics of the attention characteristic graphs of the N images to be segmented, wherein the average value is the prediction weight of the attention characteristic graph of the target image. And multiplying the predicted weight by the semantic features of the target image to obtain the weight calibration of the target image.

Illustratively, when N is 6, the target image x in the image group is pointed to₁Obtaining 6 images (x) most suitable for segmentation₁，x₂，x₃，x₄，x₅，x₆) Forming a co-segmented image pair (x)₁,x₁)、(x₁,x₂)、(x₁,x₃)、(x₁,x₄)、(x₁,x₅)、(x₁,x₆) Then obtaining 6 attention feature maps through a co-segmentation model

Calculating to obtain a target image x₁The predicted weight of the attention feature map of (1)

The method can be specifically realized by the following formula:

and S106, carrying out up-sampling on the weight calibration of the target image to obtain the segmentation result of the target image.

The method comprises the steps of up-sampling the weight calibration of a target image, namely analyzing the low-layer characteristics of the image, restoring the original information of the image, and then performing secondary classification on each pixel point of the obtained image, namely distinguishing the pixel point as a target object or a background area to obtain a final segmentation result.

Fig. 3 is a flowchart of a second embodiment of the image co-segmentation method based on the attention mechanism according to the present invention. Fig. 3 is a flowchart of the embodiment shown in fig. 1, where S104 is performed to calculate semantic feature maps of two images of each co-segmentation image pair to obtain an attention feature map of an image to be segmented that forms an image pair with a target image, an average value of features of the attention feature maps of N images to be segmented is determined, and a prediction weight of the attention feature map of the target image is obtained, and the method includes:

s201, aiming at each co-segmentation image pair, performing feature compression processing on semantic feature maps according to feature channels of the semantic feature maps of two images of the co-segmentation image pair to obtain a first feature vector; performing full connection on the feature vectors twice to obtain a first weight of each feature channel of the semantic feature map; and upsampling the weight of each characteristic channel to obtain a first characteristic map, wherein the size of the first characteristic map is the same as that of the semantic characteristic map.

The method comprises the steps of performing feature compression on each two-dimensional feature map in a semantic feature map to obtain a real number, namely, each feature channel of the semantic feature map is represented by one real number, the real number has a global receptive field and represents an average feature of the semantic feature map on the feature channel, obtaining a first feature vector, performing full connection twice to obtain a first weight of each feature channel of the semantic feature map, and then performing up-sampling to obtain the first feature map with the same size as the semantic feature map.

S202, multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image to obtain a second feature map corresponding to each image.

And multiplying the first feature map corresponding to one image in each co-segmentation image pair with the semantic feature map of the other image, so that the importance degree of the feature channel of one image is used for influencing the other image, thereby obtaining a second feature map.

And S203, performing feature compression processing on the second feature map according to the feature channel of the second feature map to obtain a second feature vector.

s204, upsampling the second weight of each feature channel to obtain a third feature map, wherein the size of the third feature map is the same as that of the semantic feature map, and the third feature map is an attention feature map of the co-segmentation image pair.

S205, acquiring the attention feature maps of the N images to be segmented of the target image obtained through calculation, determining the average value of the features of the attention feature maps of the N images to be segmented, and obtaining the prediction weight of the attention feature map of the target image.

Each image in the image group has N image pairs, and another image to be segmented in each image pair generates an attention feature map, so that for the target image, the average value of the features of the N attention feature maps, that is, the prediction weight of the attention feature map of the current image, is calculated for the generated attention feature maps of the N images to be segmented.

In this embodiment, for an image in each co-segmented image pair, a first feature map with the same size as a semantic feature map is obtained, that is, the importance degree of a feature channel of the semantic feature map of the image, then the first feature map corresponding to one image in each co-segmented image pair is multiplied by the semantic feature map of another image to obtain a second feature map corresponding to each image, that is, the importance degree of the feature channel of the semantic feature map of one image affects another image, then a third feature map with the same size as the semantic feature map is obtained through operation, the attention feature maps of N images to be segmented of the target image obtained through operation are obtained, the average value of the features of the attention feature maps of the N images to be segmented is determined, the prediction weight of the attention feature map of the target image is obtained, and the prediction weight is a result of the mutual influence of the importance degrees of similar image feature channels, therefore, the image co-segmentation method and the device can further improve the image co-segmentation efficiency and obtain a good segmentation effect.

Fig. 4 is a flowchart of a third embodiment of the image co-segmentation method based on the attention mechanism provided in the present invention. The embodiment shown in fig. 2 or fig. 3 above is to perform image co-segmentation by using a co-segmentation model in the case of a known co-segmentation model, and in practical applications, before the method is implemented by using the co-segmentation model, the method further includes: and constructing and training a co-segmentation model. As shown in fig. 4, constructing and training the co-segmentation model includes the following steps:

fig. 5 is a schematic structural diagram of an image co-segmentation method based on an attention mechanism according to the present invention, and a co-segmentation model constructed and trained according to the present invention is described with reference to fig. 5.

301. Preprocessing the training image set to obtain a labeling object of each image in the training image set, and obtaining a real result of each image according to the labeling object.

Before training the model, firstly selecting a training image set, and preprocessing the training image set to obtain a labeling object of each image in the training image set. And obtaining a real result of each image according to the labeling object, wherein the real result is a correct result of segmenting the labeling object and other background areas of the image.

Optionally, when the images in the training image set include a plurality of objects, one object is selected as the annotation object according to the sizes of the plurality of objects.

Optionally, the labeled object of each image is cut by using the minimum bounding box, and other regions without labeled objects are filled with the pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.

The MSCOCO data set is a classified image data set, 20 categories are selected from 80 categories of the MSCOCO data set to serve as a training image set, and a labeled object in an image is extracted according to the category of the object. When the images in the training image set contain a plurality of objects, the objects can be divided into three dimensions of large, medium and small, and the images are segmented together, so that the object with a larger foreground area is needed, and therefore, for each semantic class of images, one large-dimension object is selected as the labeled object of the class of images. And (3) clipping the marked object by using a minimum bounding box, filling the region except the marked object in the image, namely the background of the image, by using the image pixel mean value of the MSCOCO data set, taking the marked object as white, and taking the region except the marked object as black, so as to obtain the real result of the image.

S302, determining N images to be segmented of each image in the training image set, and forming N co-segmented image pairs of each image in the training image set.

By measuring the distance between each image and other images in the image set, the semantic similarity between the images can be obtained, so that N images which are most suitable for co-segmentation are found for each image, and N co-segmentation image pairs of each image in the training image set are formed.

Specifically, the image features of each image in the training image set are extracted, and in the training image set of each category, the similarity distance between each image feature and all the image features in the image set is calculated pairwise.

In one possible implementation, a distance metric method may be used, and the similarity distances are arranged from high to low, and the first N results are taken to determine the N images to be segmented for each image in the training image set.

Another possible implementation manner is that M distance measurement methods may be used to calculate the similarity distance, the similarity distance values calculated by each method are arranged into a matrix from high to low, thereby obtaining M matrices, the first N results of each matrix are taken to form a most similar distance matrix, the closest image is selected in a voting manner, for example, if a similarity distance result appears in all the most similar distance matrices at the same time, the image corresponding to the distance is determined to be a co-segmented image of the target image, then the similarity distance in the M-1 most similar distance matrices appearing at the same time is searched, the image corresponding to the distance is determined to be a co-segmented image of the target image, and so on, the result with the highest similarity distance value is obtained until the number of returned results is equal to N.

This strategy results in a matrix of the top N most similar images for each image. With this metric matrix, the target image for each semantic class can be found, together with the N images it is most suitable to co-segment.

S303, processing each co-segmentation image in the training image set on the input established co-segmentation model: and respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image.

Input image x₁(C_input1,H_input1,W_input1) And x₂(C_input2,H_input2,W_input2) And 5 convolution operations in the VGG16 module are used, the convolution blocks are connected with each other by a maximum pooling layer, and the maximum pooling layer can reduce the feature map by half, so that the purposes of feature dimension reduction, data compression and parameter reduction are realized through the operations, and the semantic feature map fx of each image is obtained₁(C₁,H₁,W₁) And fx₂(C₂,H₂,W₂). For example, two RGB images x are input₁And x₂All the semantic feature maps fx obtained by the above operations are 512 x 512₁And fx₂All have 512 characteristic channels, and the sizes are all 16 × 16.

S304, performing attention operation twice on semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;

attention calculation, including: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic diagram, and the size of the first characteristic diagram is the same as that of the semantic characteristic diagram; multiplying a first feature map corresponding to one image in each co-segmentation image pair with a semantic feature map of the other image to obtain a second feature map corresponding to each image; and the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map obtained by the second attention operation.

Map semantic features fx₁And fx₂Compressing the features along the space dimension (i.e. the feature channel of the semantic feature map), compressing the two-dimensional feature map of each feature channel of the semantic feature map into a real number, and obtaining a feature map fx₁' and fx₂' the number of channels is equal to the number of channels of the input semantic feature map, and the feature map fx₁' and fx₂' it characterizes the global distribution of the response on the feature channel, and makes the layer near the input obtain the global receptive field, fx after the compression of the image semantic feature map₁' and fx₂' is a feature vector with dimension (C,1,1), and the feature vector obtained by compressing the semantic feature map of the image adopted in this embodiment can be implemented by the following formula:

wherein, fx is a semantic feature map of the image, fx' is a feature vector after feature compression, H is a length value of the feature map of each channel of fx, and W is a width value of the feature map of each channel of fx.

Then two feature vectors fx₁' and fx₂' after two full connections, the full connection can be generated systematically and randomly by using linear transformation g (x) ═ wx + b, and the values of w and b can be initially generated and then continuously learned through training. The weight of each feature channel is obtained through two times of full connection. Between full connections all values are mapped between 0, 1 using sigmoid function. Then, the weight of the characteristic channel is up-sampled to obtain the sum fx₁' and fx₂' equally large first characteristic diagram

The first feature map represents the weight of each pixel in the feature map. The full connection operation performed on the feature map and the calculation of the first feature map, which are adopted in the embodiment, may be implemented by the following formulas:

fx"＝σ(g(w,fx'))＝σ(w₂,δ(w₁fx')) equation (3)

att_x"═ update (fx") formula (4)

Wherein fx 'is an input characteristic diagram of full connection, fx' is a characteristic diagram obtained after two times of full connection, and delta (w) is adopted for one time of full connection₁Fx ') and the second full join is performed using σ (g (w, fx'). att_x"is the first characteristic diagram obtained by sampling on the characteristic diagram.

And multiplying the semantic features of the current image by the first feature map of the other image in the co-segmentation image pair, so that after the two images mutually learn the importance degrees of the respective feature channels, a second feature map representing the weight of the image feature map is obtained.

For example, the semantic feature map fx₁And fx₂All with a size of 512 x 16, the feature map fx obtained by the feature compression₁' and fx₂' size 512 x 1, to get the first feature map

And

are respectively 512 x 16.

In this embodiment, the second characteristic diagram f_attx₂(C, H, W) and f_attx₁(C, H, W), which can be specifically realized by the following formula:

wherein fx is a semantic feature map of the image, att_x"is the first characteristic diagram.

The process of obtaining the weight calibration result of one image to the other image feature map from the semantic feature map is the attention calculation process. The calculated first feature map represents the weight of the feature channel of one image, namely the importance of the feature channel of one image, the input image pair is subjected to attention operation once, and the first feature map of one image is applied to the other image so that the common segmentation model learns that the important feature channels of the two images are the same if the two images are similar, namely the feature map of the foreground object of the other image is subjected to weighting once. Furthermore, attention operation can be performed again, so that the co-segmentation model learns a group of parameters again to sense important feature channels in the weighted semantic feature map. The common area of the two images can be better obtained through two attention operations, namely similar target objects in the two images can be captured, and a better common segmentation effect is obtained.

S305, carrying out up-sampling on the weight calibration of the current image to obtain the segmentation result of the target image.

The method comprises the following steps of carrying out upsampling on the weight calibration of a current image, namely analyzing low-level features of the image, and restoring original information of the image, wherein the specific operation is that firstly, the upsampling expands a feature map to be twice of the original feature map, and then, carrying out convolution operation by using a 3-by-3 convolution kernel to change the sparse feature map to be dense, so that more information is learned, and the restoring process is closer to the real effect. And 5 times of upsampling and convolution operation are carried out to obtain a characteristic diagram with weight, which is equal to the original input image in size, and finally, softmax is used for carrying out secondary classification on each pixel point in the characteristic diagram to obtain a final segmentation result.

S306, calculating the loss between the segmentation result and the real result of the current image.

The loss between the segmentation result and the real result is calculated by adopting cross entropy, which can be specifically realized by the following formula:

wherein, Cls_lossAs a loss between the segmented result and the true result, y_pAs a result of segmentation, y_gtFor the real result, n is n rounds of the above method processing performed on a set of images, each round of processing is also called an epoch.

And S307, reversely transmitting the loss of the current image to the co-segmentation model.

Calculating loss between the obtained image segmentation result and the real result of the image, reversely transmitting the loss corresponding to the current image to the co-segmentation model, and updating parameters.

Illustratively, in the experiment, the co-segmented model was trained with 40 million image pairs of training data, followed by 2 epochs using 8 Tesla P40 blocks with a batch size of 64 for a total of 18 hours, and a final trained model size of 600M.

Further, the trained co-segmentation model is tested.

First, a test data set is selected.

The test dataset includes both image datasets of classes that are present in the training dataset and image datasets of classes that are not present in the training dataset. For example, the ImageNet dataset is an image dataset, images in the ImageNet dataset are manually labeled with objects, 40 classes in the ImageNet dataset are selected as the test dataset, wherein the 20 classes are the same as the 20 classes of the MSCOCOs selected by the training dataset, and the 20 classes are different from the class labels of the MSCOCOs selected by the training dataset.

The test dataset is then processed according to the method shown in fig. 2 or fig. 3, resulting in a co-segmentation result for each image in the test dataset.

For the evaluation of the test result, four evaluation indexes are selected: the cross-over ratio (IOU) of the real label and the prediction result, the accuracy (precision) of the foreground pixel point, the recall (recall) of the foreground pixel point, and the harmonic mean (f1score) of the precision and the recall. Wherein precision represents the proportion of the correct pixel to the predicted foreground pixel, and recall represents the proportion of the correct predicted foreground pixel to the real foreground pixel.

The results of the four evaluation indexes obtained by performing the test on 40 analogy of ImageNet are as follows: the IOU is 0.5917, precision is 0.8246, call is 0.6709, and f1score is 0.7398.

Fig. 6 is a schematic structural diagram of a first embodiment of an attention-based image co-segmentation apparatus provided in the present invention, and as shown in fig. 6, the attention-based image co-segmentation apparatus of the present embodiment includes: obtaining 11, determining 12, and co-segmenting 13.

The acquiring module 11 is configured to acquire a group of images to be segmented, where the group of images to be segmented includes a plurality of images;

the determining module 12 is configured to determine N images to be segmented of a target image in the image group to be segmented, and form a co-segmented image pair with each image to be segmented, where N is greater than or equal to 2 and less than or equal to 8;

the co-segmentation module 13 is configured to:

the following processing is performed on each co-segmented image pair by the co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image;

determining the average value of the characteristics of the attention characteristic graphs of the N images to be segmented to obtain the prediction weight of the attention characteristic graph of the target image; multiplying the predicted weight by the semantic features of the target image to obtain the weight calibration of the target image;

Optionally, the determining module 12 is specifically configured to:

respectively calculating the distances between the features of the target image and the features of all images in the image group to be segmented;

and selecting N images closest to the characteristics of the target image from the images of the image group to be segmented as N images to be segmented.

Optionally, the co-segmentation module 13 performs an operation on the semantic feature maps of the two images of each co-segmented image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image, including:

for each co-segmentation image pair, performing feature compression processing on the semantic feature maps according to feature channels of the semantic feature maps of the two images of the co-segmentation image pair to obtain a first feature vector;

the weight of each characteristic channel is up-sampled to obtain a first characteristic diagram, and the size of the first characteristic diagram is the same as that of the semantic characteristic diagram;

multiplying a first feature map corresponding to one image in each co-segmentation image pair with a semantic feature map of the other image to obtain a second feature map corresponding to each image;

the common segmentation module determines an average value of features of attention feature maps of N images to be segmented to obtain a prediction weight of the attention feature map of the target image, and the common segmentation module comprises the following steps:

Optionally, the image co-segmentation device based on attention mechanism, further comprises a training module 14,

a training module 14 for:

obtaining a real result of each image according to the labeling object;

determining N images to be segmented of each image in a training image set to form N co-segmented image pairs of each image in the training image set;

processing each co-segmentation image in the training image set with the input established co-segmentation model: respectively extracting the features of the two images of each co-segmentation image pair to obtain a semantic feature map of each image, and performing attention operation twice on the semantic feature maps of the two images of each co-segmentation image pair to obtain weight calibration of the two images of the co-segmentation image pair;

attention calculation, including: according to the feature channel of the semantic feature map, performing feature compression processing on the semantic feature map to obtain a feature vector; performing full connection on the feature vectors twice to obtain the weight of each feature channel; the weight of each characteristic channel is up-sampled to obtain a first characteristic diagram, and the size of the first characteristic diagram is the same as that of the semantic characteristic diagram; multiplying a first feature map corresponding to one image in each co-segmentation image pair with a semantic feature map of the other image to obtain a second feature map corresponding to each image; the second feature map corresponding to the two images of each co-segmentation image pair obtained by the first attention operation is a semantic feature map of the second attention operation;

calculating the loss between the segmentation result and the real result of the current image;

the loss of the current image is propagated back to the co-segmentation model.

Optionally, the training module 14 performs preprocessing on the training image set to obtain a labeled object of each image in the training image set, including:

obtaining the real result of each image according to the labeling object, comprising the following steps:

and cutting the labeled object of each image by using the minimum bounding box, and filling other regions without the labeled object by using the pixel points to obtain a binary image of the labeled object, wherein the binary image is a real result of the image.

The apparatus of this embodiment may be configured to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 7 is a block diagram of an image co-segmentation apparatus based on attention mechanism according to the present invention, and as shown in fig. 7, the image co-segmentation apparatus 300 based on attention mechanism includes: a memory 32 and a processor 31, the memory 32 storing computer instructions, the processor 31 executing the computer instructions to perform the method steps of the embodiments shown in fig. 2 to 4 described above.

It should be understood that in the above embodiments, the processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. The general-purpose processor may be a microprocessor, or the processor may be any conventional processor, and the aforementioned memory may be a read-only memory (ROM), a Random Access Memory (RAM), a flash memory, a hard disk, or a solid state disk. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. Which when executed performs steps comprising the method embodiments described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An attention mechanism-based image co-segmentation method is characterized by comprising the following steps:

2. The method according to claim 1, wherein determining N images to be segmented of the target image in the group of images to be segmented comprises:

3. The method according to claim 1, wherein the calculating the semantic feature maps of the two images of each co-segmentation image pair to obtain the attention feature map of the image to be segmented of the image pair formed with the target image comprises:

4. The method of any of claims 1-3, wherein prior to inputting the co-segmented image pairs into a co-segmentation model, further comprising:

obtaining a real result of each image according to the labeling object;

5. The method according to claim 4, wherein the preprocessing the training image set to obtain the labeled object of each image in the training image set comprises:

6. An attention-based image co-segmentation apparatus comprising:

the co-segmentation module is to:

7. The apparatus of claim 6, wherein the determining module is specifically configured to:

8. The apparatus according to claim 6, wherein the co-segmentation module performs an operation on semantic feature maps of two images of each co-segmented image pair to obtain an attention feature map of an image to be segmented of an image pair formed with the target image, and includes:

9. The apparatus of any of claims 6-8, further comprising a training module to:

the training module is configured to:

preprocessing a training image set to obtain a labeling object of each image in the training image set; obtaining a real result of each image according to the labeling object;

10. The apparatus of claim 9, wherein the training module preprocesses a training image set to obtain a labeled object of each image in the training image set, and comprises:

11. An attention-based image co-segmentation apparatus comprising:

a memory for storing processor-executable instructions;

processor for implementing the method according to any of the claims 1-5 when the computer program is executed.

12. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the attention-based image co-segmentation method as claimed in any one of claims 1 to 5.