CN114972313A

CN114972313A - Image segmentation network pre-training method and device

Info

Publication number: CN114972313A
Application number: CN202210710807.9A
Authority: CN
Inventors: 刘博�; 王瑜; 周付根
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-08-30
Anticipated expiration: 2042-06-22
Also published as: CN114972313B

Abstract

The invention discloses an image segmentation network pre-training method and device, wherein the method comprises the following steps: acquiring a set S of non-labeled images, cutting and enhancing the images in the set S to form a set T, randomly masking pixel regions of the images in the set T, and then sending the images into a first network branch for training, wherein the first network branch comprises a first semantic encoder and a first semantic decoder; and after the random modification of the pixels of the images in the set T, the images are sent to a first channel and a second channel of a second network branch for training, the second network branch also comprises a second semantic encoder and a second semantic decoder, the second semantic encoder shares the parameters of the first semantic encoder, and the training of the second network branch is constrained through a contrast loss function. The technical scheme provided by the invention dilutes useless characteristics learned in a pre-training stage, and relieves the problem of large workload of self-supervision learning downstream task training in the prior art.

Description

Image segmentation network pre-training method and device

Technical Field

The invention relates to an image information processing technology, in particular to an image segmentation network pre-training method and device, which can be applied to the fields of medical image processing, automatic driving and the like.

Background

Medical image segmentation, including segmentation of lesions and organs in images, is of great importance in clinical diagnosis and treatment planning. In recent years, medical image segmentation methods based on deep learning, such as UNet and UNet + +, have ideal segmentation effects, and such methods require a large amount of annotation data for training, and the quality of the annotation data also affects the effect of model training. In the field of three-dimensional medical image processing, data labeling is difficult in images, labeling acquisition cost is high, and non-labeled medical images are relatively easy to acquire, so that self-supervision learning in the prior art develops rapidly, and mining information of the non-labeled images through constructing a prepositive task, such as constructing a generative task, predicting rotation, classifying through constructing positive and negative sample pairs by using contrast learning, and the like.

However, the existing self-supervision method focuses on the information mining capability of the encoder in the pre-training stage, and when the image is migrated to the downstream task of image segmentation, a different detection head or segmentation head needs to be initialized randomly and retrained, which increases the workload of the downstream task training.

Disclosure of Invention

In view of this, the invention provides an image segmentation network pre-training method, which introduces a mask learning strategy on the basis of a traditional comparison learning strategy, and obtains more effective characteristics in a pre-training stage so as to alleviate the defects of the prior art.

In a first aspect, the present invention provides an image segmentation network pre-training method, including: acquiring a set S of unmarked images, wherein the unmarked images do not contain marks of target areas; cutting and enhancing the images in the set S to form a set T, carrying out random masking on pixel regions of the images in the set T, sending the pixel regions into a first network branch for training, wherein the first network branch comprises a first semantic encoder and a first semantic decoder, and supervising the training of the first network branch through image pixel value constraint; after random pixel modification is carried out on the images in the set T, the images are sent to a first channel and a second channel of a second network branch for training, the images entering the first channel and the second channel are subjected to different types of pixel random modification, the second network branch also comprises a second semantic encoder and a second semantic decoder, the second semantic encoder shares parameters of the first semantic encoder, and training of the second network branch is constrained through a contrast loss function; after the training of the first network branch and the second network branch is finished, a first image segmentation network consisting of a second semantic encoder and a second semantic decoder is obtained; acquiring a set R with an annotated image, wherein the annotated image comprises a mark of a target area; and sending the images in the set R into the first image segmentation network for training to obtain a second image segmentation network.

In one embodiment, image enhancement includes normalization of image pixel values.

In one embodiment, the random modification of the pixels includes randomly discarding pixel values or randomly scrambling pixel values.

In one embodiment, the step of constraining the training of the second network branch by the contrast loss function comprises: and constructing positive and negative sample pairs through the image pixels of the first channel and the second channel for comparison learning.

In one embodiment, the first channel and the second channel adopt different parameter updating modes, and the parameter updating modes comprise gradient updating and momentum updating.

In one embodiment, the parameters of the first semantic encoder or the parameters of the second semantic encoder are updated by weighting and gradient by the loss functions of the first channel and the second channel.

In one embodiment, the training of the first network branch and the second network branch employs a weighted loss function constraint.

In a second aspect, the present invention provides an image segmentation network pre-training apparatus, including: the system comprises a sample acquisition module, a storage module and a display module, wherein the sample acquisition module is used for acquiring a set S of unlabeled images and acquiring a set R of labeled images; wherein, the label-free image does not contain the label of the target area, and the labeled image contains the label of the target area; the first training module is used for cutting and enhancing the images in the set S to form a set T, randomly masking pixel regions of the images in the set T, sending the images into a first network branch to train, wherein the first network branch comprises a first semantic encoder and a first semantic decoder, and constraining and supervising the training of the first network branch through image pixel values; after random pixel modification is carried out on the images in the set T, the images are sent to a first channel and a second channel of a second network branch for training, the images entering the first channel and the second channel are subjected to different types of pixel random modification, the second network branch also comprises a second semantic encoder and a second semantic decoder, the second semantic encoder shares parameters of the first semantic encoder, and training of the second network branch is constrained through a contrast loss function; after the training of the first network branch and the second network branch is finished, a first image segmentation network consisting of a second semantic encoder and a second semantic decoder is obtained; and the second training module is used for sending the images in the set R into the first image segmentation network for training to obtain a second image segmentation network.

The invention has the following beneficial effects:

the technical scheme provided by the invention can have the following beneficial effects: by combining the mask learning strategy of the first network branch and the comparison learning strategy of the second network branch, the data characteristics of the unlabeled image are mined, and the useless characteristics learned in the pre-training stage are diluted, so that the problem of large workload of self-supervision learning downstream task training in the prior art is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are one embodiment of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic flow chart of a pre-training method for an image segmentation network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the logical structure of a first network branch and a second network branch according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an image segmentation network pre-training device according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and the described embodiments are some, but not all embodiments of the present invention.

The first embodiment is as follows:

fig. 1 is a schematic flow chart of an image segmentation network pre-training method according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following four steps.

Step S101: and acquiring a set S of the non-labeled images, wherein the non-labeled images do not contain the marks of the target areas. The label-free image may be derived from a two-dimensional or three-dimensional image acquired by a medical imaging device.

Step S102: the first network branch and the second network branch are trained. Specifically, an image in the set S is cut and enhanced to form a set T, a pixel area of the image in the set T is randomly masked and then sent to a first network branch for training, the first network branch comprises a first semantic encoder and a first semantic decoder, and training of the first network branch is supervised through image pixel value constraint; after random pixel modification is carried out on the images in the set T, the images are sent to a first channel and a second channel of a second network branch for training, the images entering the first channel and the second channel are subjected to different types of pixel random modification, the second network branch also comprises a second semantic encoder and a second semantic decoder, the second semantic encoder shares parameters of the first semantic encoder, and training of the second network branch is constrained through a contrast loss function; and after the training of the first network branch and the second network branch is finished, obtaining a first image segmentation network consisting of a second semantic encoder and a second semantic decoder.

Illustratively, cropping and image enhancement of images in set S may employ a Transform method in the MONAI open source library, for example, to crop an input three-dimensional image into a 96 × 96 × 96 fixed-size image block for subsequent network training. The image enhancement can reduce the noise of the image and clean the image data.

For example, the pixel regions of the images in the set T are randomly masked, and a size of a 96 × 96 × 96 × 96 image block is further divided into smaller size image blocks, each size of the image block is 16 × 16 × 16, and a certain proportion of the 16 × 16 × 16 image blocks are randomly masked.

Fig. 2 is a schematic diagram of logical structures of a first network branch and a second network branch according to an embodiment of the present invention.

Referring to fig. 2, a first network branch 10 comprises a first semantic encoder 11 and a first semantic decoder 12. Illustratively, the first semantic encoder 11 may be an ViT (Vision Transformer) encoder, the first semantic decoder 12 may be a Transformer architecture decoder, and the training of the first network branch 10 may result in a ViT encoder of feature expression capability.

As shown in fig. 2, the second network branch 20 comprises a first channel 23 and a second channel 24, and the images entering the first channel 23 and the second channel 24 are randomly modified with different types of pixels. It should be noted that the second network branch 20 is a contrast learning branch, and the first channel 23 and the second channel 24 employ different types of pixel random modification, so as to implement training of contrast learning. Since the second semantic encoder 21 shares the parameters of the first semantic encoder 11, the technical effect of the first network branch 10 and the second network branch 20 training the second semantic encoder 21 together is achieved.

In one embodiment, image enhancement includes normalization of image pixel values. The normalization of the image pixel values may be to window the image pixels, increase the contrast of the pixels, or reduce the dispersion of the pixel value distribution.

In one embodiment, the random modification of the pixels includes randomly discarding pixel values or randomly scrambling pixel values. Illustratively, the image fed into the first channel 23 may be an image subjected to random discarding of pixel values, and the image fed into the second channel 24 may be an image subjected to random scrambling of pixel values.

In one embodiment, the step of constraining the training of the second network branch 20 by the contrast loss function comprises: positive and negative sample pairs are constructed from image pixels passing through the first channel 23 and the second channel 24 for comparative learning. Illustratively, the input images of the first channel 23 and the second channel 24 are different in that after the input images are randomly modified by different types of pixels and encoded by the second semantic encoder 21, the output feature maps should have similar features at the same positions, and are regarded as positive sample pairs, and the output feature maps at different positions are regarded as negative sample pairs, and the cosine similarity criterion is used: the positive and negative sample pairs are determined with sim (q, k) ═ max (cos (q, k), 0). Wherein q, k are obtained from the features on the feature map and then through a linear mapping layer. The training of the second network branch 20 may be constrained with a pixel-level InfoNCE loss function.

In one embodiment, the first channel 23 and the second channel 24 employ different parameter updating methods, including gradient updating and momentum updating. It should be noted that the momentum update simulates inertia of an object in motion, and the previous update direction is retained to some extent during update, and the final update direction is fine-tuned by using the currently calculated gradient. Therefore, the momentum updating can increase the stability to a certain extent, so that the network training is faster, and the capability of getting rid of local optimization is provided.

In one embodiment, the parameters of the first semantic encoder 11 or the parameters of the second semantic encoder 21 are updated by weighting and gradient by the loss functions of the first channel 23 and the second channel 24.

In one embodiment, the training of the first network branch 10 and the second network branch 20 employs a weighted loss function constraint. Illustratively, the weighted loss function is represented as L _total ＝L _PCL +αL _MIM Wherein L is _total Is the total loss, L, of the first network branch 10 and the second network branch 20 _MIN Is the loss, L, of the first network branch 10 _PCL Is the loss of the second network branch 20.

Step S103: and acquiring a set R with an annotated image, wherein the annotated image comprises a mark of the target area. It should be noted that the number of elements in the set R is much smaller than the number of elements in the set S. Therefore, a large amount of label-free data is adopted to train the first network branch and the second network branch, and a small amount of labeled data is adopted to train the first image segmentation network. In engineering practice, the training of the first and second network branches is referred to as a pre-training phase and the training of the first image segmentation network is referred to as downstream task training.

Step S104: the first image segmentation network is trained. Specifically, the images in the set R are sent to the first image segmentation network for training, and a second image segmentation network is obtained.

In an alternative embodiment, the second image segmentation network further comprises a fully connected layer.

Example two:

the embodiment of the present invention provides an image segmentation network pre-training device, which is mainly used for executing the image segmentation network pre-training method provided in the above-mentioned content of the embodiment of the present invention, and the image segmentation network pre-training device provided in the embodiment of the present invention is specifically described below.

Fig. 3 is a schematic structural diagram of an image segmentation network pre-training apparatus according to an embodiment of the present invention. As shown in fig. 3, the image segmentation network pre-training apparatus 200 includes the following modules:

a sample obtaining module 201, configured to obtain a set S without an annotated image, and obtain a set R with an annotated image; the unmarked image does not contain the mark of the target area, and the marked image contains the mark of the target area.

The first training module 202 is configured to crop and enhance images in the set S to form a set T, randomly mask a pixel region of an image in the set T, and send the image to a first network branch for training, where the first network branch includes a first semantic encoder and a first semantic decoder, and the training of the first network branch is supervised through image pixel value constraint; after random pixel modification is carried out on the images in the set T, the images are sent to a first channel and a second channel of a second network branch for training, the images entering the first channel and the second channel are subjected to different types of pixel random modification, the second network branch also comprises a second semantic encoder and a second semantic decoder, the second semantic encoder shares parameters of the first semantic encoder, and training of the second network branch is constrained through a contrast loss function; and after the training of the first network branch and the second network branch is finished, obtaining a first image segmentation network consisting of a second semantic encoder and a second semantic decoder.

And the second training module 203 is configured to send the images in the set R to the first image segmentation network for training, so as to obtain a second image segmentation network.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An image segmentation network pre-training method is characterized by comprising the following steps:

acquiring a set S of non-annotated images, wherein the non-annotated images do not contain marks of target areas;

cutting and enhancing the images in the set S to form a set T, carrying out random masking on pixel regions of the images in the set T, sending the images into a first network branch for training, wherein the first network branch comprises a first semantic encoder and a first semantic decoder, and supervising the training of the first network branch through image pixel value constraint;

after the random pixel modification is carried out on the images in the set T, the images are sent to a first channel and a second channel of a second network branch for training, the images entering the first channel and the second channel are subjected to different types of pixel random modification, the second network branch further comprises a second semantic encoder and a second semantic decoder, the second semantic encoder shares parameters of the first semantic encoder, and the training of the second network branch is constrained through a contrast loss function;

after the training of the first network branch and the second network branch is finished, a first image segmentation network consisting of the second semantic encoder and the second semantic decoder is obtained;

acquiring a set R of marked images, wherein the marked images contain marks of target areas;

and sending the images in the set R into a first image segmentation network for training to obtain a second image segmentation network.

2. The method of claim 1, wherein the image enhancement comprises normalization of image pixel values.

3. The method of claim 1, wherein the random modification of the pixels comprises randomly discarding pixel values or randomly scrambling pixel values.

4. The method of claim 1, wherein the step of constraining the training of the second network branch by the contrast loss function comprises:

and constructing positive and negative sample pairs by using the image pixels passing through the first channel and the second channel for comparison learning.

5. The method of claim 1, wherein the first channel and the second channel employ different parameter update modes, the parameter update modes comprising a gradient update and a momentum update.

6. The method of claim 1, wherein the parameters of the first semantic encoder or the parameters of the second semantic encoder are updated by weighting and gradient of the penalty functions of the first channel and the second channel.

7. The method according to any of claims 1 to 6, wherein the training of the first and second network branches employs a weighted loss function constraint.

8. An image segmentation network pre-training device, comprising:

the system comprises a sample acquisition module, a storage module and a display module, wherein the sample acquisition module is used for acquiring a set S of unlabeled images and acquiring a set R of labeled images; the image without the annotation does not contain a mark of a target area, and the image with the annotation contains a mark of the target area;

the first training module is used for cutting and enhancing the images in the set S to form a set T, randomly masking pixel regions of the images in the set T, sending the images into a first network branch to train, wherein the first network branch comprises a first semantic encoder and a first semantic decoder, and supervising the training of the first network branch through image pixel value constraint; after the random pixel modification is carried out on the images in the set T, the images are sent to a first channel and a second channel of a second network branch for training, the images entering the first channel and the second channel are subjected to different types of pixel random modification, the second network branch further comprises a second semantic encoder and a second semantic decoder, the second semantic encoder shares parameters of the first semantic encoder, and the training of the second network branch is constrained through a contrast loss function; after the training of the first network branch and the second network branch is finished, a first image segmentation network consisting of the second semantic encoder and the second semantic decoder is obtained;

and the second training module is used for sending the images in the set R into the first image segmentation network for training to obtain a second image segmentation network.