CN112686913A

CN112686913A - Object boundary detection and object segmentation model based on boundary attention consistency

Info

Publication number: CN112686913A
Application number: CN202110028596.6A
Authority: CN
Inventors: 李冬辉; 刘欣宇; 梁宁一; 高龙
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-01-11
Filing date: 2021-01-11
Publication date: 2021-04-20
Anticipated expiration: 2041-01-11
Also published as: CN112686913B

Abstract

The invention relates to a target boundary detection and target segmentation model based on boundary attention consistency, which is mainly technically characterized in that: the method comprises two pix2pix models which are cascaded together, wherein each pix2pix model consists of a generator, a discriminator and a loss function, the first pix2pix model is a target boundary detection model (OBD model), a detection result of the first pix2pix model is superposed on an original image and used as the input of the second pix2pix model, and the second pix2pix model is a target segmentation model. The method is reasonable in design, and the boundary attention consistency is introduced into the target boundary detection model to enhance the attention to the target boundary, so that the accurate target boundary is detected, and a more accurate target segmentation result is realized.

Description

Object boundary detection and object segmentation model based on boundary attention consistency

Technical Field

The invention belongs to the technical field of computer vision, and relates to a target boundary detection and target segmentation model, in particular to a target boundary detection and target segmentation model based on boundary attention consistency.

Background

In the technology and vision technology, accurate object segmentation is different from salient object detection for various objects, and the specific object is required to be segmented from the background with higher precision. For example, it is applied to portrait segmentation of scene change tasks and organ segmentation before medical diagnosis. Although deep neural networks have significantly improved the performance of object segmentation, accurate segmentation in complex scenes remains very difficult due to background interference.

Through research on the problem of non-ideal segmentation at the boundary, the problem is found to be mostly present in an area where the target boundary is not obvious. This is because the local difference between the target and the background is so small that the model cannot distinguish between the two based on the extracted features. One possible solution is to raise boundary awareness by treating Object Boundary Detection (OBD) as a task for object segmentation. However, OBD does not draw sufficient attention in existing object segmentation models, since the object boundary only occupies a very small part of the whole image and its contribution to the improvement of object segmentation performance in the per-pixel loss function is small.

In the existing object segmentation model, the OBD is only used as a simple sub-network in the existing object segmentation model, and only the initial image and the real object boundary image are used for training. Such sub-networks are prone to overfitting and inaccurate OBD results due to the small proportion of target boundary pixels and lack of supervision over the middle of the model. Therefore, directing attention to target boundaries by monitoring the middle layer of the OBD model helps to improve accuracy of OBD. Investigations have found that most of the excellent attention mechanisms are based on Class Activation Maps (CAMs). CAM is an effective way to enhance the attention of tag-related areas through image classification.

However, due to the poor surveillance of image-level classification, the attention gained using CAM is still difficult to accurately fall on the label-related area. Therefore, researchers have proposed a kind of attention consistency (TAC) under spatial transformation to further constrain the attention area. TAC means: in image classification, if the input image undergoes spatial transformation, the attention area should follow the same transformation. Spatial transformations typically include rotation, flipping, cropping, and the like. However, TAC is to increase attention to the tag-related region by requiring attention consistency of the input image under indirect transformation, and experiments prove that there is a significant difference in consistency obtained under different transformations or combinations. In other words, to obtain good consistency, a lot of experimentation is required to find a suitable transformation, and thus the consistency obtained under indirect transformation is limited.

In summary, how to improve the accuracy of target boundary detection and target segmentation is a problem that needs to be solved urgently at present.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a target boundary detection and target segmentation model based on boundary attention consistency, and solves the problem that target boundary detection and target segmentation are inaccurate.

The technical problem to be solved by the invention is realized by adopting the following technical scheme:

a target boundary detection and target segmentation model based on boundary attention consistency comprises two pix2pix models, each pix2pix model is composed of a generator, a discriminator and a loss function, the two pix2pix models are cascaded together, the first pix2pix model is an OBD model and used for detecting a target boundary, a detection result of the OBD model is superposed on an original image and used as an input of the second pix2pix model, and the second pix2pix model is a target segmentation model and used for generating a target segmentation result;

the OBD model generator comprises a twin network, an attention module and a decoder, wherein the twin network shares all parameters and takes an initial image A and an OBD detection result G (A) thereof as two inputs, and a feature map F corresponding to two branches is obtained through down-sampling and a residual block_AAnd feature map F_G(A)Feature map F_AAnd feature map F_G(A)After pooling by a global average pooling layer GAP and a global maximum pooling layer GMP, sending the obtained product to a full-link layer with W as a weight for classification; the attention module calculates classification values by weighting the pooled feature maps, extracts initial feature maps by linearly combining the feature maps by way of channel-by-channel multiplication and summing them along the dimension of the combined feature mapsAn attention map M (A) and an attention map M (G (A)) of the initial image A and the OBD detection result G (A), wherein the classification loss and the consistency loss of the attention module jointly guide an encoder of the OBD model to extract target boundary characteristics and transfer the target boundary characteristics to a decoder to generate an OBD detection result;

the structure of a discriminator of the OBD model is the same as that of a discriminator in a conventional pix2pix model;

the loss function of the OBD model comprises a penalty function L for generating a true target boundary image_advLoss function L1 for maintaining stable generation_GClassification loss function of auxiliary classifier

And boundary attention consistency loss function L_att；

The generator of the target segmentation model adopts the same structure as a conventional pix2pix model, and trains the model by using an image with a target boundary subjected to enhancement;

the structure of a discriminator of the target segmentation model is the same as that of a discriminator in a conventional pix2pix model;

the loss function of the target segmentation model comprises a countervailing loss function L_adv2And a loss function L1_G2And adopting least square GAN as an optimization objective function.

Further, in the attention module of the OBD model, the object boundary is treated as a class attribute, and the initial image and the object boundary image are of the same class.

Further, in the attention module of the OBD model, the attention maps M (a) and M (g (a)) of the initial image and the transformed image are equal at the same OBD transform.

Further, the penalty function L_advAnd a penalty function L_adv2Respectively expressed as:

L_adv＝E_x～A[log(1-D(x,G(x)))²]+E_x～A,y～B[log(D(x,y))²]

wherein G, G2 and D, D2 are the generator and the arbiter of two pix2pix models, respectively;

the loss function L1_GLoss function L1_G2Respectively expressed as:

L1_G＝E_x～A,y～B[||G(x)-y||₁]

classification loss function of auxiliary classifier of the OBD model

Is represented as;

wherein c is_gIs an auxiliary classifier of the generator, and adopts a cross entropy classification loss function;

boundary attention consistency loss function L of auxiliary classifier of OBD model_attIs represented as;

_att＝E_x～A[||G(M(x))-M(G(x))||₁]

where M (x) represents the attention map of image x in the a domain, g (x) and M (g (x)) represent the generated image and its attention map;

the above-mentioned loss function is integrated into two optimized objective functions to train the pix2pix model:

whereinα₁＝1,α₂＝1000,α₃＝10,α₄＝10,β＝10。

The invention has the advantages and positive effects that:

according to the method, two pix2pix image translation models are cascaded together, the first pix2pix model is used for detecting a target boundary (OBD), a detection result is superposed on an original image and used as an input of the second pix2pix model, and the second pix2pix model generates a target segmentation result. The boundary attention consistency is introduced into the target boundary detection model to enhance the attention to the target boundary, so that the accurate target boundary is detected, and a more accurate target segmentation result is realized.

Drawings

FIG. 1 is a schematic diagram of boundary attention consistency under OBD transform;

FIG. 2 is a schematic diagram of the structure of a generator of an OBD model;

fig. 3 is a diagram illustrating the segmentation result on the PFCN data set according to the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

The design idea of the invention is as follows:

aiming at the problem that the current target segmentation model has poor segmentation effect in a local area with similar target and background, the method can be very helpful to solve the problem by enhancing the target boundary. The method treats target segmentation as a two-stage task and is realized by utilizing two cascaded pix2pix image translation models. The first pix2pix image translation model is used for detecting the target boundary of the initial image independently, and the second pix2pix image translation model is used for completing target segmentation on the image with the target boundary enhanced. The invention focuses on the improvement of the object segmentation performance by the Object Boundary Detection (OBD) in the first stage.

Generally, whether the attention area is reasonable may reflect the performance of the model. In the case of OBD, the objective is to map the initial image and the target boundary image to the same distribution. Hence, the target boundary is certainly the most reasonable attention area, since it is a direct proof that the source domain (initial image) and the target domain (target boundary image) have the same distribution and OBD results. The more the attention is focused on the object boundary, the better the OBD model performance will be.

One straightforward way to increase the attention of a desired area is to implement full supervision of the attention map. However, if full supervision is used on the level of attention in the model, the model is difficult to learn well, i.e. under-fit, under-fit under high demand and low complexity conditions. Another possible solution is to implement weak supervision on the attention map by using image-level classification. However, the results using this approach show that attention cannot be accurately localized on the label-related area, i.e. an overfitting occurs. Therefore, the patent application adopts the following two measures to be applied to the attention map, thereby improving the attention to the target boundary.

(1) The CAM was introduced in the middle of the OBD generator as an attention module. In the attention module, the target boundary is treated as a class attribute, and the source domain (initial image) and the target domain (target boundary image) are classified into the same class. (2) The attention area is guided directly using the Boundary Attention Consistency (BAC) under Object Boundary Detection (OBD) transformation. BAC requires that when an initial image is converted to an object boundary map by OBD, as shown in the first line of fig. 1, its attention map should also be the attention map of the object boundary image under the same OBD conversion, as shown in the last line of fig. 1. To evaluate BAC, the same OBD transform needs to be applied on the attention map of the initial image. However, unlike simple transitions such as flipping or rotating, there are difficulties in implementing OBD transitions on an attention map. To solve this problem, the present invention uses the OBD model itself to perform the conversion, and obtains its output as the conversion result by inputting the attention of the original image again into the OBD. The generator of the OBD is shown in fig. 2.

For convenience of explanation, the meanings of the following symbols are explained first: a and B respectively represent an initial image and a real target boundary image of a training OBD model, A₂Representing an enhanced image of the boundary of the object obtained by superimposing A and B, B₂Representing the true target segmentation result, A₂And B₂For training of the object segmentation model in the second stage. c. C_gIs an auxiliary classifier in an OBD generator.

The target boundary detection and target segmentation model is formed by cascading two pix2pix image translation models, wherein the first pix2pix model is used for detecting a target boundary (OBD), a detection result is superposed on an original image and used as the input of the second pix2pix image translation model, and the second pix2pix image translation model generates a target segmentation result. The boundary attention consistency is introduced into the target boundary detection model to enhance the attention to the target boundary, so that the accurate target boundary is detected. The conventional pix2pix model consists of a generator, an arbiter, an impedance loss and an L1 loss. The invention focuses on improving the generator part of the OBD model, and the structure of the discriminator of the OBD and the second pix2pix is the same as that of the conventional pix2 pix. The OBD generator receives an original image to be segmented and generates a corresponding target boundary detection result; the arbiter of the OBD receives the generated OBD result and the correct object boundary and tries to distinguish the difference between them to push the generator to generate a true OBD result. A generator of the second pix2pix receives the superposed image of the OBD result and the original image and generates a target segmentation result; the discriminator receives the generated target segmentation result and the real target segmentation result, and tries to distinguish the difference between the two results so as to drive the generator to generate an accurate target segmentation result. The loss functions of the two models comprise the confrontation loss and the L1 loss of the pix2pix image translation model, and in addition, the classification loss and the boundary attention consistency loss of the attention module are added into the generator of the OBD model. The generator structure of the OBD model is shown in fig. 2, with the discriminator and the second pix2pix being identical to the conventional pix2pix structure. The various components of the model are described in detail below.

The following describes the respective parts of the present invention:

(1) the generator for detecting the object boundary is shown in FIG. 2 and comprises a twin network and attention module c_gAnd a decoder. The twin network shares all parameters and takes the initial image a and its OBD result G (a) as two inputs, G (-) representing the process of generation or OBD transformation. Obtaining characteristic graphs corresponding to the two branches through down sampling and residual blockIs shown as F_AAnd F_G(A). GAP and GMP then pool the profiles and feed them into the global connectivity layer with W as weight for classification, attention module c_gThe classification value is calculated by weighting the pooled feature maps. To enhance attention to the target boundary, attention module c_gThe two branches are classified into the same class.

At the same time, attention module c_gBy multiplying the linear combined feature maps path by path and summing them along the dimension of the combined feature maps, attention maps of a and g (a), denoted M (a) and M (g (a)), respectively, are extracted, where M (·) represents the process of computing the attention maps using CAM. According to the requirement of consistency, the attention maps M (a) and M (g (a)) of the initial image and the transformed image should be equal under the same OBD transform, which can be expressed as:

G(M(A))＝M(G(A)) (1)

the classification loss and the consistency loss of the attention module together guide an encoder of the OBD model to extract target boundary features and pass them to a decoder to generate an OBD result.

It should be noted that: the inputs to the twin network of the present invention are a and g (a) rather than the two domains a and B for the following reasons: first, as the number of training times increases, g (a) becomes B, and the generation process may also implement OBD transformation. Secondly, g (a) instead of B helps to perform OBD transformation with the model itself trying to focus on attention. Finally, the generation process can also be seen as an extension of the spatial transformation in the TAC, so it is also reasonable to maintain attention consistency under the generation transformation.

To achieve consistency in equation 1, two sides of the equation are obtained along the two branches in FIG. 2. The first branch is shown as a solid line, and an attention map M (A) is obtained by taking A as an input; then, m (a) is re-input and its output G (m (a)) is obtained to represent the OBD conversion result for the attention map. The other branch follows a dotted line and represents the attention map of the OBD transform result with the fed back g (a) as input and obtaining its attention map M (g (a)). Finally, G (M (a)) and M (G (a)) were used to assess consistency.

(2) A generator of a target segmentation model, which is trained using the same structure as conventional pix2pix, but with enhanced images of the target boundaries.

(3) A discriminator: the structure of the discriminators in the two pix2pix models is the same as in the conventional pix2 pix. The discriminator receives the source-false-target and source-true-target pairs of fields, respectively, and tries to distinguish them to direct the generator to produce the true target field.

(4) Loss function

For the first pix2pix, i.e. the OBD model of the invention, the objective function consists of four parts. Wherein the penalty function L is resisted_advFor generating true object boundary images, L1_GThe loss is used to maintain a stable generation,

and L_attRepresenting the classification penalty of the auxiliary classifier and the loss of boundary attention consistency, respectively.

For the second pix2pix, the object segmentation model, whose objective function is the same as the conventional pix2pix, including countering the loss L_adv2And L1_G2And (4) loss. To maintain stable training, least squares GAN is used as the optimization objective function.

Wherein the penalty function L is resisted_adv、L_adv2Comprises the following steps: the penalty of two pix2pix models is used to match the distribution of the source domain image to the target domain image:

L_adv＝E_x～A[log(1-D(x,G(x)))²]+E_x～A,y～B[log(D(x,y))²] (2)

where G, G2 and D, D2 are the generators and discriminators of the two pix2pix models, respectively.

L1 loss function: as with the conventional pix2pix model, the L1 penalty is applied in the generator to avoid model collapse and ensure stable generation, with the L1 penalty for the two pix2pix models as follows:

L1_G＝E_x～A,y～B[||G(x)-y||₁] (4)

classification loss function of CAM: in order to enhance the attention to the target boundary, the image x of the CAM classification a domain of the OBD model and its OBD result g (x) are of the same class. The classification loss of CAM is as follows:

wherein c is_gIs an auxiliary classifier of the generator, and adopts a cross entropy classification loss function.

Loss of target consistency of attention: according to the definition of consistency, if an initial image is transformed through OBD into an object boundary map, its attention is intended to have the same OBD transformation. The loss of consistency is defined using the absolute deviation as follows:

_att＝E_x～A[||G(M(x))-M(G(x))||₁] (7)

where M (x) represents the attention map of image x in the a domain, and g (x) and M (g (x)) represent the generated image and its attention map. The consistency of the present invention is a strong constraint on the attention of the target boundary.

The complete objective function: the above-mentioned loss function is integrated into two optimized objective functions to train the pix2pix model:

wherein alpha is₁＝1,α₂＝1000,α₃＝10,α₄＝10,β＝10。

It should be emphasized that the embodiments described herein are illustrative rather than restrictive, and thus the present invention is not limited to the embodiments described in the detailed description, but also includes other embodiments that can be derived from the technical solutions of the present invention by those skilled in the art.

Claims

1. A target boundary detection and target segmentation model based on boundary attention consistency comprises two pix2pix models, wherein each pix2pix model consists of a generator, a discriminator and a loss function, and is characterized in that: the two pix2pix models are cascaded together, the first pix2pix model is an OBD model, a detection result of the OBD model is superposed on an original image and used as an input of a second pix2pix model, and the second pix2pix model is a target segmentation model;

the OBD model generator comprises a twin network, an attention module and a decoder, wherein the twin network shares all parameters and takes an initial image A and an OBD detection result G (A) thereof as two inputs, and a feature map F corresponding to two branches is obtained through down-sampling and a residual block_AAnd feature map F_G(A)Feature map F_AAnd feature map F_G(A)After pooling by a global average pooling layer GAP and a global maximum pooling layer GMP, sending the obtained product to a full-link layer with W as a weight for classification; the attention module calculates classification values by weighting the pooled feature maps, extracts an attention map M (A) and an attention map M (G (A)) of the initial image A and the OBD detection result G (A) by linearly combining the feature maps by channel-by-channel multiplication and summing them along the dimension of the combined feature maps, and the classification loss and the consistency loss of the attention module together guide an encoder of the OBD model to extract target boundary features and transfer them to a decoder to generate an OBD detection result;

the loss function of the OBD model comprises a penalty function L for generating a true target boundary image_advLoss function L1 for maintaining stable generation_GClassification loss function of auxiliary classifierNumber of

And boundary attention consistency loss function L_att；

2. The boundary consistency-of-attention based object boundary detection and object segmentation model of claim 1, wherein: in the attention module of the OBD model, the object boundary is considered as a class attribute, and the initial image and the object boundary image are of the same class.

3. The boundary consistency-of-attention based object boundary detection and object segmentation model of claim 1, wherein: in the attention module of the OBD model, the attention maps M (a) and M (g (a)) of the initial image and the transformed image are equal under the same OBD transform.

4. The boundary consistency-of-attention based object boundary detection and object segmentation model of claim 1, wherein:

the penalty function L_advAnd a penalty function L_adv2Respectively expressed as: