CN116739075A

CN116739075A - Unsupervised pre-training method of neural network for image processing

Info

Publication number: CN116739075A
Application number: CN202310656829.6A
Authority: CN
Inventors: 蓝如师; 陈颖贤; 罗笑南; 杨睿
Original assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Current assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-09-12

Abstract

The invention relates to the technical field of unsupervised learning of neural networks, in particular to an unsupervised pre-training method of a neural network for image processing, which comprises the following steps: firstly, dividing an image into image blocks, then performing mask operation, then calculating a perception loss, calculating a contrast loss and a reconstruction loss, and finally training by using the loss. After training, the input image is processed by using the trained model to obtain category characteristic vectors and reconstructed image vectors. According to the invention, the influence of mask operation on the neural network can be measured by using the perception loss, the characteristics of the neural network are more obvious by using the contrast loss, and finally, the network learns how to abstract the image into the characteristics by using the reconstruction loss, and meanwhile, the information loss in the abstract process is reduced, so that the characteristic extraction capability of the neural network on the image is improved.

Description

Unsupervised pre-training method of neural network for image processing

Technical Field

The invention relates to the technical field of unsupervised learning of neural networks, in particular to an unsupervised pre-training method of a neural network for image processing.

Background

As neural networks develop, the craving for data for machine learning has increased, but the creation of data set labels is a time consuming, laborious and not a good task. In particular, the data are now scaled in billions, with the intention of manually tagging them with near-astronomical night. Thus, to alleviate data starvation, an unsupervised learning approach may be employed for processing.

Common algorithms for unsupervised learning are classified into clustering, dimension reduction and self-supervised learning. The clustering method is an unsupervised algorithm proposed earlier, which achieves element partitioning by minimizing intra-class distances and maximizing inter-class distances. This problem is the NP-hard problem. The existing clustering method has good convergence speed and result under the condition of small number of elements and low dimension of the elements, but the cost of the clustering algorithm is great when facing high-dimensional features. The dimension reduction method maps the high-dimension data to the low-dimension space through a certain mapping method, and maintains the distance relation of the original data. The dimension reduction method cannot capture abstract links between data.

The self-supervision learning utilizes auxiliary tasks to mine supervision information from the data, trains a neural network by using the constructed supervision information, and extracts the characteristics required by downstream tasks. It can be divided into two main directions, a direction based on generation and a direction based on discrimination. One of the first proposed architectures based on the direction of generation is the self-encoder. The data is first input into an encoder that causes the neural network to learn its characteristics, known as encoding. The learned features are then used to reconstruct the original input data using a decoder, referred to as decoding. The goal of the self-encoder is to make the reconstructed data and the input data differ as little as possible. The encoder is the required feature extractor. The denoising self-encoder then proposes to obtain more versatile feature extraction capabilities by "zeroing out" certain entries. Masking self-encoders are inspired by denoising self-encoders and BERTs in the natural language domain. It was found that constructing noise with a large proportion (about 75%) of a mask of 16 x 16 size forced the network to learn some higher order semantic information. The other main direction is a direction based on discrimination. CPC uses InfoNCE loss to build an autoregressive model by contrast to predict the implicit spatial features. This inspires a direction in which the supervision information is obtained by comparing the differences between samples. SimCLR generates two different enhanced images of the same image using a twin network. Taking the two images as positive samples, and taking the enhanced images of other images in the batch as negative samples so as to obtain contrast information. The existing self-supervision learning method still has the condition of insufficient generalization capability, and particularly aims at the problem of insufficient image feature extraction capability during large-scale image processing.

Disclosure of Invention

The invention aims to provide an unsupervised pre-training method for a neural network for image processing, which reduces the loss of image features in the abstract process through perception loss, contrast loss and reconstruction loss and solves the technical problem of insufficient image feature extraction capability in the existing unsupervised pre-training method.

To achieve the above object, the present invention provides an unsupervised pretraining method for a neural network for image processing, comprising the steps of:

step 1: introducing a dataset having a plurality of types of sample images;

step 2: performing masking operation on the image input of the input data set to respectively obtain a raw data set and a masked data set of the image;

step 3: dividing the neural network into a plurality of stages, using a plurality of vision converters as a backbone network in each stage, respectively inputting an original data set and a masked data set, checking the difference output by the two stages, and recording as a perception loss;

step 4: at the last layer of the neural network, the output of the visual transducer is obtained and divided into a classification unit and an image unit; for the classification unit, calculating the difference between mask input of the image and neural network output of original input of the image and original input of other images, and recording the difference as contrast loss; for an image unit, calculating the difference between the output of the neural network input by the image mask and the pixel value of the original image, and recording the difference as reconstruction loss;

step 5: training the neural network by using the perceived loss, the contrast loss and the reconstruction loss together as a total loss function;

step 6: after training, the model inputs the image and outputs the category feature vector and the reconstructed image vector.

Optionally, the process of masking the image input into the dataset comprises the steps of:

defining an image asB refers to the quantity of input data of each batch, and H, W and C refer to the width, height and channel dimension of the image respectively; firstly dividing an image into a plurality of image blocks, setting the size P multiplied by P of the divided image blocks, and defining a set of the divided image blocksWherein->Inputting the vector form code into a linear layer in a neural network, obtaining the vector form code of the vector form code and embedding the vector form code, and adding a randomly initialized class unit +.>Splicing to obtain image block feature set->D refers to a feature dimension, expressed as:

T＝Concat(Patches，V _CLS )

setting a mask rate m _r ∈[0，1]A mask M.epsilon.0, 1 is constructed ^B×N Make it satisfy (sigma) _M[i]＝1 1)/N≈m _r The method comprises the steps of carrying out a first treatment on the surface of the Constructing a randomly initialized mask vectorThus constructing a mask input from the mask, formulating the operation as:

where M i denotes a mask flag that masks M to the ith tile feature, 1 denotes that this tile feature needs to be masked.

Optionally, the neural network is divided into a plurality of stages, and the difference output by each stage is checked and recorded as a process of perceived loss, including the following steps:

step 3.1: dividing the neural network f into n phases, phase i being denoted Stage _i (X)：

f _j (X)＝Stage _j ⊙Stage _j-1 ⊙...⊙Stage ₁ (X)

Wherein "+.;

step 3.2: stage per Stage _i (X) contains lambda _i The flow of the visual transducer is expressed as:

X′ ^(l) ＝X ^(l) +MSA(LN(X ^(l) ))

X ^(l+1) ＝X′ ^(l) +FFN(LN(X′ ^(l) ))

where l denotes a layer-l neural network, LN denotes layer normalization, MSA denotes a multi-head attention mechanism, FFN denotes a feed-forward neural network,

the flow of MSA is formulated as:

wherein Concat refers to splicing operation, N _h Refer to the attention head, i.e. N _h The attention mechanism outputs, the attention of the h head is defined as:

SelfAttenion ^(h) (X)：＝[φ ^(h) (X)]V ^(h)

wherein the method comprises the steps ofIs a function of providing spatial attention based on the content of the input data, and functions to aggregate V ^(h) The definition is:

wherein the method comprises the steps ofIs a linear projection matrix, τ is a temperature parameter;

the flow of FFN is described as:

FFN(X)＝σ(XW ₁ )W ₂

wherein W is ₁ ，W ₂ Is a linear projection matrix, σ is an activation function;

step 3.3: calculating the perceived loss, and only calculating the perceived loss of the masked area

Wherein xi _j Is a super-parameter coefficient for measuring the perceptual loss weight of each stage, and is xi _j <ξ _j+1 ，T[b，i]The ith image block feature representing the b-th image.

Optionally, the output of the network into which the original image is input is re-divided into image unitsAnd category element->For the output of the network into which the masked image is input, the +.>And->

Constructing a similarity function sim (·) to measure similarity between the classification units, wherein the formula is as follows:

sim(a，b)＝a ^T b

a cross entropy function is used as a contrast loss,

wherein T is ^CLS [b]，Finger class unit T ^CLS ，/>The b-th image data of the input data. τ is a temperature parameter for controlling the inter-class distance.

The L1 penalty is used to calculate the reconstruction penalty for the masked region,

wherein Patches [ b, i ]]，Refers to the original image and the i-th image block characteristic of the b-th image in output.

Optionally, the total loss function is:

wherein, xi, beta and gamma are super parameter coefficients, and xi refers to the super parameter coefficient xi for measuring the perception loss weight of each stage _j Is a set of (3).

The invention provides an unsupervised pre-training method of a neural network for image processing, which comprises the following steps: firstly, dividing an image into image blocks, then performing mask operation, then calculating a perception loss, calculating a contrast loss and a reconstruction loss, and finally training by using the loss. After training, the input image is processed by using the trained model to obtain category characteristic vectors and reconstructed image vectors. According to the invention, the influence of mask operation on the neural network can be measured by using the perception loss, the characteristics of the neural network are more obvious by using the contrast loss, and finally, the network learns how to abstract the image into the characteristics by using the reconstruction loss, and meanwhile, the information loss in the abstract process is reduced, so that the characteristic extraction capability of the neural network on the image is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an unsupervised pretraining method of the neural network for image processing of the present invention.

Fig. 2 is an original image data diagram of a pre-training input of an embodiment of the present invention.

Fig. 3 is a schematic diagram showing the effect of masking operation according to an embodiment of the present invention.

Fig. 4 is a training effect diagram of a training network in accordance with an embodiment of the present invention.

Fig. 5 is a schematic diagram of network output without masking operation according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The invention provides an embodiment of an unsupervised pre-training method for a neural network for image classification, comprising the following steps:

s1: introducing a dataset having a plurality of types of sample images;

s2: performing masking operation on the image input of the input data set to respectively obtain a raw data set and a masked data set of the image;

s3: dividing the neural network into a plurality of stages, using a plurality of vision converters as a backbone network in each stage, respectively inputting an original data set and a masked data set, checking the difference output by the two stages, and recording as a perception loss;

s4: at the last layer of the neural network, the output of the visual transducer is obtained and divided into a classification unit and an image unit; for the classification unit, calculating the difference between mask input of the image and neural network output of original input of the image and original input of other images, and recording the difference as contrast loss; for an image unit, calculating the difference between the output of the neural network input by the image mask and the pixel value of the original image, and recording the difference as reconstruction loss;

s5: training the neural network by using the perceived loss, the contrast loss and the reconstruction loss together as a total loss function;

s6: after training, the model inputs the image and outputs the category feature vector and the reconstructed image vector.

The detailed flow of steps is shown in figure 1.

Further, the following describes the steps of the present invention in connection with the specific embodiments:

in step S1, the image dataset is imageNet-1K. The method comprises 140 or more than ten thousand pictures and 1000 image categories.

The step of masking the image in step S2 is:

2.1 define an image asB refers to the number of input image data per batch, H, W, C refers to the width, height and channel dimensions of the image, respectively. First, an image is divided into a plurality of image blocks, and a divided image block size p×p=16×16 is set. Defining a set of segmented image blocks->Wherein->Inputting it into a linear layer of neural network, obtaining its vector form code and embedding, and adding randomly initialized class unit vector ++>Splicing to obtain image block feature set->D refers to the feature dimension. Described by the formula as

T＝Concat(Patches，V _CLS )#(1)

2.2 sets a mask rate mr=0.75. Constructing a mask M.epsilon.0, 1 ^B×N Let it satisfy m _r ≈(∑ _M[i]＝1 1) N. Constructing a randomly initialized mask vectorThereby constructing a mask input from the mask. This operation is described by the formula:

where M i refers to the mask mark where mask M corresponds to the ith tile feature. 1 indicates that this tile feature needs to be masked. The mask vector in this operation overlays the original feature vector.

In step S3, the step of checking the output difference of each stage and recording as a perceived loss is as follows:

3.1 dividing the neural network f into n phases, phase i being denoted S _i (X)

f _j (X)＝S _j (S _j-1 (...S ₁ (X)#(3)

3.2 each stage S _i (X) contains lambda _i And a visual transducer. Which is set as = { lambda _i The visual transformer flow is formulated as = {2,2,6,2 }:

where l represents a layer-l neural network. LN refers to layer normalization. MSA refers to a multi-headed attention mechanism. FFN refers to a feed-forward neural network.

The flow of MSA is formulated as:

wherein Concat refers to splicing operation, N _h Refer to the attention head, i.e. N _h A personal attentiveness mechanism output. The attention of the h head is defined as

SelfAttenion ^(h) (X):＝[φ ^(h) (X)]V ^(h) #(6)

Wherein the method comprises the steps ofIs a function of providing spatial attention based on the input data content. Its function is to polymerize V ^(h) . It is defined as:

wherein the method comprises the steps ofIs a linear projection matrix. τ _φ Is a temperature parameter, the value is +.>

The flow of FFN is described as:

FFN(X)＝σ(XW ₁ )W ₂ #(8)

wherein W is ₁ ,W ₂ Is a linear projection matrix. Sigma is the activation function GeLU.

3.3 calculating the perceived loss. Only the perceived loss of the masked area is calculated

Wherein xi _j Is a coefficient super-parameter for measuring the perception loss weight of each stage, and is xi _j <ξ _j+1 。T[b,i]The ith image block feature representing the b-th image.

The step of calculating the contrast loss and the reconstruction loss in the step S4 is that

4.1 re-dividing the output of the network into which the raw image was input into image unitsAnd category unitFor the output of the network into which the masked image is input, the +.>And->

4.2 construction of similarity function sim (·) similarity between taxons is measured

sim(a,b)＝a ^T b#(11)

A cross entropy function is used as a contrast loss.

4.3 use the L1 penalty to calculate the reconstruction penalty for the masked region.

The total loss in step S5 is

The total loss function is calculated as:

wherein, xi, beta and gamma are coefficient super parameters. Xi means xi appearing in formula (9) _j Is set of (a)

The neural network is trained using existing neural network training tools until a suitable round.

The downstream task fine tuning process after training in step S6 is

6.1 inputting a batch of image data into the neural network f _n . And obtaining the final visual transducer output of the model, and dividing the final visual transducer output into category units and image units. The class unit is adjusted to be output in one-hot form through the linear layer and the activation function

6.2 training the entire network Using Cross entropy function as Classification loss

Finally, the classification task process by using the network is as follows:

the category of the image is calculated. The flow is described as follows:

class is a series of positive integers, meaningThe most likely class number that each image neural network considers within a batch is shown.Refers to->And (3) represents the one-hot form prediction probability that the network belongs to the i-th class of the image.

Further, an embodiment of the present invention is provided for assistance in explanation, and the effect of performing the pre-training is shown in fig. 2 to 5. Specifically, fig. 2 illustrates the input image data, fig. 3 illustrates the masking operation of the image by the network, fig. 4 illustrates the training effect of the pre-training network, and fig. 5 illustrates the network output without masking operation. As can be seen from FIG. 4, the invention can enable the network to extract higher semantics and enable the network to have a part of reasoning capability. As can be seen from FIG. 5, the present invention allows the neural network to retain much structural and color information from the artwork, helping to train downstream tasks.

The above disclosure is only a preferred embodiment of the present invention, and it should be understood that the scope of the invention is not limited thereto, and those skilled in the art will appreciate that all or part of the procedures described above can be performed according to the equivalent changes of the claims, and still fall within the scope of the present invention.

Claims

1. An unsupervised pretraining method for a neural network for image processing, comprising the steps of:

step 1: introducing a dataset having a plurality of types of sample images;

2. An unsupervised pretraining method for an image processing neural network according to claim 1,

a process for masking an image input into a dataset, comprising the steps of:

T＝Concat(Patches,V _CLS )

setting a mask rate m _r ∈[0,1]A mask M.epsilon.0, 1 is constructed ^B×N Make it satisfy (sigma) _M[i]＝1 1)/N≈m _r The method comprises the steps of carrying out a first treatment on the surface of the Constructing a randomly initialized mask vectorThus constructing a mask input from the mask, formulating the operation as:

3. An unsupervised pretraining method for an image processing neural network according to claim 2,

a process of dividing a neural network into a plurality of stages, checking the difference of both outputs of each stage, and recording as a perceived loss, comprising the steps of:

f _j (X)＝Stage _j ⊙Stage _j-1 ⊙...⊙Stage ₁ (X)

Wherein "+.;

X′ ^(l) ＝X ^(l) +MSA(LN(X ^(l) ))

X ^(l+1) ＝X′ ^(l) +FFN(LN(X′ ^(l) ))

the flow of MSA is formulated as:

SelfAttenion ^(h) (X)：＝[φ ^(h) (X)]V ^(h)

wherein the method comprises the steps ofIs a linear projection matrix, τ _φ Is a temperature parameter;

the flow of FFN is described as:

FFN(X)＝σ(XW ₁ )W ₂

4. An unsupervised pretraining method for an image processing neural network according to claim 3,

the specific implementation process of the step 4 comprises the following steps:

re-dividing the output of a network into which an original image is input into image unitsAnd category unitFor the output of the network into which the masked image is input, the +.>And->

A similarity function sim (·) is constructed to measure similarity between the taxonomies, the formula:

a cross entropy function is used as a contrast loss,

wherein T is ^CLS [b]，Finger class unit T ^CLS ，/>Corresponding to the b-th image of the input data, τ being a temperature parameter for controlling the inter-class distance;

5. An unsupervised pretraining method for an image processing neural network according to claim 4,

the total loss function is:

wherein, xi, beta and gamma are super parameter coefficients, and xi refers to the super parameter coefficient xi for measuring the perception loss weight of each stage in the step 3.3 _j Is a set of (3).