CN117788826A

CN117788826A - Image pair collaborative segmentation method and system based on pyramid depth convolution and electronic equipment

Info

Publication number: CN117788826A
Application number: CN202311851468.7A
Authority: CN
Inventors: 陈加; 袁科; 卫阳钰; 庞世燕; 董石; 田元; 刘智; 罗恒; 肖克江; 蔡欣芷; 陈迪; 童名文; 左明章
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-03-29

Abstract

The invention discloses an image pair collaborative segmentation method, system and electronic equipment based on pyramid depth convolution, which comprise the following steps: s1, acquiring an image pair, preprocessing the image pair and unifying the size; s2, inputting the image pairs into a twin encoder for advanced semantic feature extraction to obtain advanced semantic feature mapping; s3, for high-level languageThe artificial feature mapping uses pooling operation to construct a feature pyramid, so as to obtain a corresponding enhanced mapping; s4, mapping the features f _A And f _B The corresponding enhanced mapping cascade is used for obtaining a feature map, and a common object probability map is generated by using up-sampling of a twin decoder; s5, up-sampling is used to segment a segmentation mask with the same resolution as the original image. The invention can solve the problem of multiple scales in space to a certain extent, reduce the calculation complexity and simultaneously highlight the position information of the same semantic object.

Description

Image pair collaborative segmentation method and system based on pyramid depth convolution and electronic equipment

Technical Field

The invention belongs to the technical field of image segmentation, and particularly relates to an image pair collaborative segmentation method, system and electronic equipment based on pyramid depth convolution.

Background

With the improvement of computer performance, deep learning is vigorously developed, and a great deal of research completely discards the traditional characteristics. A network architecture using a twin encoder-decoder is designed to address the image co-segmentation task. In the encoder, the method of the architecture uses a convolutional neural network to acquire feature mapping of the image, and introduces a cross-correlation layer to perform feature matching to acquire corresponding mapping. The decoder concatenates the feature map and the corresponding map for upsampling until the original image size is restored, thereby obtaining a co-segmentation result. One of the methods using this architecture is a method for introducing a semantic aware attention mechanism, which has three structures, namely a channel attention module, a hybrid channel attention module and a channel space attention module. The attention mechanism may focus on high response feature channels and reduce the response of no light feature channels.

The image collaborative segmentation method based on deep learning needs segmentation clues to be used as the basis for obtaining collaborative segmentation results. Segmentation cues are the basis for identifying objects to be segmented from an image, including object cues and related cues. The object clues refer to the "object" features on the present image, and the related clues are the related features of the co-segmented object acquired from the multiple images. Unlike traditional semantic segmentation, which only requires feature extraction of objects contained in a single image, collaborative segmentation of images should have at least one object of the same semantic meaning in the image pair to be segmented without interference, which is clue information that is difficult to use. How to use the cue information to construct a simple and efficient collaborative feature extraction and related cue extraction model to complete an image collaborative segmentation task is a key problem which is most needed to be solved by the image collaborative segmentation task.

There may be a large number of images in the image that contain occlusion relations, background clutter, and small target foreground objects. These images are a great challenge for task image tasks, especially for collaborative segmentation tasks of images that only use the same semantic information between images. Advanced semantic features can be obtained, which is based on the advantages of the deep learning method, but a large optimization space still exists for the accuracy and the reasoning speed of the result. How to solve the problem that the accuracy is reduced due to the complex images, so that the image collaborative segmentation effect is improved as a whole is also one of the main problems to be solved in the image collaborative segmentation field.

Image co-segmentation requires processing of two or more images, which results in a larger network architecture for processing the image co-segmentation. Some approaches avoid compressing or simplifying the network in the future, but this approach has the consequence of reduced performance. How to reasonably optimize the network model, and to achieve good balance between performance and resources, is also one of the problems to be solved in the field of image collaborative segmentation. The prior art in the field of image collaborative segmentation has insufficient high-precision mining capability for the same semantic information between image pairs, which may lead to inaccurate segmentation results and even errors of multiple segmentation or fewer segmentation. And the prior art does not handle images containing occlusions, background clutter, or small target foreground objects well enough. Finally, the prior art generally only accepts input with a fixed size, but in practical application, it is difficult to ensure that the images to be processed are all the same size, which results in limited flexibility of the prior art and difficulty in being well applied to real scenes.

Disclosure of Invention

In view of the above shortcomings of the prior art, the present invention aims to provide a pyramid depth convolution-based image pair collaborative segmentation method, system and electronic device, which can solve the problem of multiple dimensions in space to a certain extent, and then perform semantic feature correlation calculation on a channel through depth convolution, so as to highlight the position information of the same semantic object while reducing the calculation complexity.

In order to achieve the above purpose, the technical scheme of the method of the invention is as follows:

an image pair collaborative segmentation method based on pyramid depth convolution comprises the following steps:

s1, acquiring an image pair I _A And I _B， Pair of images pair I _A And I _B Preprocessing and unifying the sizes;

s2, image pair I _A And I _B Inputting into a twin encoder for advanced semantic feature extraction to obtain advanced semantic feature mapping f _A And f _B ；

S3 mapping f to advanced semantic features _A And f _B Using pooling operations, constructsCreating feature pyramids to obtain corresponding enhanced mappings F _A And F _B ；

S4, mapping the features f _A And f _B Corresponding enhanced mapping F _A And F _B Cascading to obtain a feature map F _AB And F _BA Generating a common object mask M using twin decoder upsampling _A And M _B ；

S5, employing up-sampling to segment the segmentation mask P with the same resolution as the original image _A 、P _B 。

Further, the twin encoder in the step S2 is provided with a plurality of convolution layers and pooling layers, and the twin encoder may use different backbolts. In one example, we employ the network structure of vgg and pre-training weights (except for the last convolutional layer) and add a ReLU activation function after each convolutional layer.

Further, in the step S3, the pooling layers with different step sizes are used for the high-level semantic features f _A And f _B Processing to construct a three-layer multi-scale feature pyramid f _A ' and f _B ’。

Further, pyramid feature f _B Each channel of the ith layer of' acts as a convolution kernel for the depth convolution for f _A Carrying out convolution operation on each corresponding channel to obtain a characteristic diagram with the channel number of ixc, then carrying out pixel-by-pixel addition and averaging on the characteristic diagram with the channel number of ixc, then carrying out convolution on the characteristic diagram with the channel number of c by connecting 1x1, and recording the output result as: f (F) _A The mathematical description is as follows:

wherein L is ₂ (. Cndot.) represents L ₂ The normalization is performed so that the data of the sample,representing the depth convolution, m is the number of layers of the pyramid, +.>A convolution layer with a size of 1x1 with a ReLU activation function is shown.

Further, pyramid feature f _A Each channel of the ith layer of' acts as a convolution kernel for the depth convolution for f _A Carrying out convolution operation on each corresponding channel to obtain a characteristic diagram with the channel number of ixc, then carrying out pixel-by-pixel addition and averaging on the characteristic diagram with the channel number of ixc, then carrying out convolution on the characteristic diagram with the channel number of c by connecting 1x1, and recording the output result as: f (F) _B The mathematical description is as follows:

Further, in the step S4, the twin decoder in the step S4 is used to extract the foreground mask of the common object of the two input images, and the enhanced mapping F of the step S3 is first performed _A And F _B And corresponding high-level semantic feature map f _B 、f _A Connecting according to channel dimension to obtain a characteristic diagram F with the number of channels being twice as large as that of original characteristics _AB 、F _BA The convolution layer is then used to fuse the input feature graphs while reducing the number of channels of the advanced semantic feature graphs. The fused 256-channel feature map is then upsampled to an upsampled size equal to the output size of the second convolution block in the encoding stage, after which the low-level feature map output by the second convolution block of the twin encoder is connected to the previously upsampled high-level semantic feature map, after which the twin encoder uses two 3x3 convolution layers and one1x1 convolution layer without activation function to obtain two probability maps M _A 、M _B 。

In a second aspect, the present invention provides an image pair collaborative segmentation system based on pyramid depth convolution, including:

a preprocessing module for acquiring an image pair I _A And I _B， Pair of images pair I _A And I _B Preprocessing and unifying the sizes;

a twin encoding module for encoding a pair of images I _A And I _B Inputting into a twin encoder for advanced semantic feature extraction to obtain advanced semantic feature mapping f _A And f _B ；

A feature pyramid module for mapping f to high-level semantic features _A And f _B Using pooling operation to construct feature pyramid to obtain corresponding enhanced mapping F _A And F _B ；

A twin decoding module for mapping the features f _A And f _B Corresponding enhanced mapping F _A And F _B Cascading to obtain a feature map F _AB And F _BA Generating a common object mask P using twin decoder upsampling _A And P _B ；

A segmentation mask acquisition module for segmenting a segmentation mask M having the same resolution as the original image using upsampling _A 、M _B 。

The image pair collaborative segmentation system based on pyramid depth convolution is used for executing the steps in the image pair collaborative segmentation method based on pyramid depth convolution.

In a third aspect, the present invention provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the pyramid depth convolution-based image pair co-segmentation method when executing the program.

Compared with the prior art, the invention has the following beneficial effects:

the pyramid depth convolution module is designed, the position of the high response characteristic in the channel is highlighted, the problem that the high-precision mining capability of the same semantic information between the image pairs is insufficient in the prior art is solved, and the performance of collaborative segmentation of the images is improved.

The twin encoder of the method can process two images with different resolution sizes in the testing and actual segmentation processes, so that the method is more likely to be applied to practical application compared with the prior art.

Drawings

Fig. 1 is a detailed schematic diagram of a network structure according to the present invention.

Fig. 2 is a schematic diagram of a pyramid depth convolution module according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which are obtained by a person of ordinary skill in the art without making any inventive effort, are within the scope of the present invention based on the embodiments of the present invention.

Example 1

The embodiment provides an image pair collaborative segmentation method based on pyramid depth convolution, which is characterized by comprising the following steps:

S3 mapping f to advanced semantic features _A And f _B Using pooling operation to construct feature pyramid to obtain corresponding enhanced mapping F _A And F _B ；

S4, mapping the features f _A And f _B Corresponding enhanced mapping F _A And F _B Cascading to obtain a feature map F _AB And F _BA Generating a common object probability map P using twin decoder upsampling _A And P _B ；

S5, employing up-sampling to segment the segmentation mask M with the same resolution as the original image _A 、M _B 。

In the following, with reference to specific embodiments, the invention designs an image pair collaborative segmentation method based on pyramid depth convolution, and the main innovation is a Pyramid Depth Convolution (PDC) module which firstly uses pooling operation on input advanced semantic features to construct a feature pyramid. The pooling operation can obtain multi-scale features containing different receptive fields without introducing additional parameters, so that the method can solve the problem of multi-scale in space to a certain extent, and then the semantic feature correlation calculation on the channel is performed through the deep convolution, thereby reducing the calculation complexity and simultaneously highlighting the position information of the same semantic object.

The method mainly explores the correlation relation of the same semantic object in the image pair from the aspect of the characteristic channel, and as different channels reflect different characteristics, the method tries to realize interaction among the channels by using a simple operation so as to highlight the position of the high-response characteristic in the channel, thereby improving the segmentation performance of the method. The network frame of the method is shown in figure 1, the method can segment foreground objects with the same semantics from two input images by end-to-end implementation, and in training, the input of the network of the method is an image pair I _A And I _B ，I _A And I _B The uniform size is 512x512 after pretreatment. I _A And I _B The corresponding high-level semantic feature map f is obtained after the processing of the twin encoder _A And f _B 。f _A And f _B Then as the input of PDC, PDC is the pooling pyramid depth convolution module provided by the method, and the output of the pooling pyramid depth convolution module is F _A And F _B Then f _A And F _B ，f _B And F _A Respectively spliced high-level semantic feature map F _AB And F _BA As input to a twin decoder, resulting in a segmentation mask M _A And M _B 。

The network structure of the method mainly comprises three parts: (1) a twin encoder. The input of this part is an image pair I _A And I _B The output is an advanced semantic feature map f obtained through a backbone network _A And f _B . (2) PDC module. Pyramid depth convolution module designed by the method and used for high-level semantic feature graph f _A And f _B Processing to obtain corresponding enhanced mapping F _A And F _B . (3) a twin decoder. The decoder first maps the features f _A And f _B Corresponding enhanced mapping F _A And F _B Cascading to obtain a feature map F _AB And F _BA Generating a common object mask M using twin decoder upsampling _A And M _B 。

The method completes the flow of image collaborative segmentation:

1. the image pairs are characterized by a twin encoder.

2. Features extracted by the twin encoder are input to the PDC module, which functions to perform feature interactions, converting the independent feature maps into enhanced features containing mutual features.

3. The enhancement features are input to a twin decoder, which decodes the enhancement features obtained in the PDC module to obtain the final segmentation mask.

And a coding module. The method uses a twin encoder structure as an encoding module, and the twin encoder can simultaneously extract two images I _A And I _B Is described. The encoder network of the method has 14 convolution layers and 4 pooling layers, the parameters of the first 13 convolution layers are consistent with VGG16 except the last convolution layer, and the ReLU activation function is added after the convolution layers. During training, all convolutions are set to a size of 3x3, with a step size of 1. The input of the twin encoder during training is two 512x512 RGB images, and the output is two 512 channel characteristic images f _A And f _B The size is 32x32. It should be noted that in the test and actual segmentation tasks, the method can process two images with different resolutions and trainThe data augmentation approach is used to result in a fixed input size.

The second part of the method is a pyramid depth convolution module, the main purpose of which is to calculate the high-level semantic features F _A And F _B Correlation between them. The input of the module is an advanced semantic feature map f of the encoder output _A And f _B . First construct f respectively _A And f _B Corresponding pyramid features, the method selects 3x3, and the pooling layers with steps of 1,2 and 3 respectively for the high-level semantic features f _A And f _B Processing to construct a three-layer feature pyramid f _A ' and f _B '. A schematic diagram of a PDC module is shown in figure 2,

the following is characterized by pyramid f _B The 'ith layer' describes the computation of the depth convolution by way of example. By pyramid feature f _B Each channel of the ith layer of' acts as a convolution kernel for the depth convolution for f _A And carrying out convolution operation on each corresponding channel. Due to f _B ' is the pyramid of the i-th layer, the depth convolution does not change the number of channels, and thus a feature map with the number of channels ixc can be obtained. Then, the characteristic map with the channel number of ixc is averaged by adding the channels pixel by pixel, and then a 1x1 convolution is connected to obtain the characteristic map with the channel number of c. The output result is recorded as: f (F) _A . The mathematical description is as follows:

wherein L is ₂ (. Cndot.) represents L ₂ The normalization is performed so that the data of the sample,representing the depth convolution, m is the number of layers of the pyramid. />A convolution layer with a size of 1x1 with a ReLU activation function is shown. F can be obtained by the same method _B:

The PDC module pools the advanced semantic feature graphs to construct pyramid features with features of different sizes. And then taking the feature pyramid as a convolution kernel to finish the deep convolution operation so as to obtain the cross-correlation information between the two pictures. The depth convolution is an operation on a channel, and the channels of the single feature images are not interfered with each other before and after the operation, so that the PDC module adopts the operation to carry out convolution operation in each channel of the high-level semantic feature image, when convolution kernels of different sizes are convolved on the same channel, the same semantic object with the same high response can be further activated, and the PDC module uses the feature pyramid as three depth convolution kernels of different sizes, so that the operation can solve the problem of multiple scales of the object to a certain extent.

The reason for the deep convolution operation of one image with the feature map of another image as a convolution check is that the paper different channels represent different semantics shows that different feature channels will represent different semantic features.

The reason for constructing pyramid features of different sizes in the PDC module is that the same semantic objects of the image pair usually have arbitrary sizes and positions, so the use of a single-size feature map as a convolution kernel will limit the acquisition of related features, and the method uses three-size pyramid features to match semantic objects of different sizes and positions.

The decoder employed in the method is also a twin structure with shared parameters. The decoding module is used to extract a foreground mask of a common object of the two input images. Result F of PDC Module _A 、F _B And corresponding high-level semantic feature map f _B 、f _A Connecting according to channel dimension to obtain 1024-channel feature diagram F _AB 、F _BA . The input feature maps are then fused using a convolutional layer to obtain two 256-channel advanced featuresSemantic feature graphs. The fused 256-channel signature is then upsampled, the upsampled size being equal to the output size of the second convolution block in the encoding stage. The low-level feature map output by the second convolution block of the encoder is then concatenated with the previously up-sampled high-level semantic feature map. The low-level features generally contain richer visual information, and the segmentation effect can be effectively improved by properly adding the low-level features in the decoder. The decoder then uses two 3x3 convolutional layers and one 1x1 convolutional layer without an activation function to obtain two probability maps M _A 、M _B . Finally, upsampling is adopted to partition a partition mask P with the same resolution as the original image _A 、P _B . The image collaborative segmentation only needs to segment the foreground and the background, so the image collaborative segmentation is used as a binary image marking problem, and the standard cross entropy loss function is used for training the network of the method.

Example 2

The embodiment provides an image pair collaborative segmentation system based on pyramid depth convolution, which comprises the following steps:

A twin decoding module for mapping the features f _A And f _B Corresponding enhanced mapping F _A And F _B Cascading to obtain a feature map F _AB And F _BA Generating a common object probability map P using twin decoder upsampling _A And P _B ；

A segmentation mask acquisition module for employing upsamplingDividing the dividing mask M with the same resolution as the original image _A 、M _B 。

Example 3

The embodiment provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the image pair collaborative segmentation method based on pyramid depth convolution when executing the program.

The present invention is not limited to the above-mentioned embodiments, but any modifications, equivalents, improvements and modifications within the scope of the invention will be apparent to those skilled in the art. While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims and the equivalents thereof, the present invention is intended to include such modifications and variations.

Other parts not described in detail are prior art.

Claims

1. The image pair collaborative segmentation method based on pyramid depth convolution is characterized by comprising the following steps of:

s1, acquiring an image pair I _A And I _B Pair of images I _A And I _B Preprocessing and unifying the sizes;

2. The image pair collaborative segmentation method based on pyramid depth convolution according to claim 1, wherein the twin encoder in step S2 is provided with a plurality of convolution layers and pooling layers, and the twin encoder adopts different backbones.

3. The image pair collaborative segmentation method based on pyramid depth convolution according to claim 1, wherein the step S3 is implemented by pooling layers of different steps for advanced semantic features f _A And f _B Processing to construct a three-layer multi-scale feature pyramid f _A ' and f _B ’。

4. A pyramid depth convolution based image pair co-segmentation method according to claim 3, characterized in that pyramid feature f _B Each channel of the ith layer of' acts as a convolution kernel for the depth convolution for f _A Performing convolution operation on each corresponding channel to obtain a characteristic diagram with the channel number of i xc, then averaging the characteristic diagram with the channel number of ixc according to the channel pixel-by-pixel addition, then connecting a 1x1 convolution to obtain a characteristic diagram with the channel number of c, and outputting a junctionThe fruit is recorded as follows: f (F) _A The mathematical description is as follows:

5. The image pair collaborative segmentation method based on pyramid depth convolution according to claim 4, wherein pyramid feature f _A Each channel of the ith layer of' acts as a convolution kernel for the depth convolution for f _A Carrying out convolution operation on each corresponding channel to obtain a characteristic diagram with the channel number of i xc, then carrying out pixel-by-pixel addition and averaging on the characteristic diagram with the channel number of ixc, then carrying out convolution on the characteristic diagram with the channel number of c by connecting with 1x1, and recording the output result as follows: f (F) _B The mathematical description is as follows:

6. The image pair collaborative segmentation method based on pyramid depth convolution according to claim 3, wherein in the step S4, the twin decoder in the step S4 is used to extract the foreground mask of the common object of the two input images, and the enhancement map F of the step S3 is first _A And F _B And corresponding high-level semantic feature map f _B 、f _A Connecting according to channel dimension to obtain a characteristic diagram F with the number of channels being twice as large as that of original characteristics _AB 、F _BA Then the convolution layers are used to fuse the input feature images while reducing the number of channels of the two advanced semantic feature images, the fused feature images are then up-sampled to a size equal to the output size of the second convolution block in the encoding stage, then the low-level feature images output by the second convolution block of the twin encoder are connected to the previously up-sampled advanced semantic feature images, and then the twin encoder uses the two 3x3 convolution layers and one 1x1 convolution layer without an activation function to obtain two probability images P _A 、P _B 。

7. An image pair collaborative segmentation system based on pyramid depth convolution, comprising:

A twin decoding module for mapping the features f _A And f _B Corresponding enhanced mapping F _A And F _B Cascading to obtain a feature map F _AB And F _BA Using twinGenerating a common object probability map P by upsampling a raw decoder _A And P _B ；

A segmentation mask acquisition module for segmenting a segmentation mask M having the same resolution as the original image using upsampling _A 、M _B ；

The pyramid depth convolution based image pair co-segmentation system is configured to perform the steps in the pyramid depth convolution based image pair co-segmentation method according to any one of claims 1-6.

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the pyramid depth convolution-based image pair co-segmentation method according to any one of claims 1 to 6 when executing the program.