CN116343109A

CN116343109A - Text pedestrian searching method based on self-supervision mask model and cross-mode codebook

Info

Publication number: CN116343109A
Application number: CN202310093067.3A
Authority: CN
Inventors: 吴一鸣; 潘企何; 高楠; 梁荣华
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-02-10
Filing date: 2023-02-10
Publication date: 2023-06-27

Abstract

A text pedestrian searching method based on a self-supervision mask model and a cross-modal codebook comprises the following steps: masking the input text and picture, and inputting the masked text and picture into a feature extraction backbone network to obtain visual features F _V And text feature F _T Then the visual characteristics F _V And text feature F _T Input to a mapping layer to obtain a picture global feature F _V1 And text global feature F _T1 Alignment is then performed. At the same time will visual characteristics F _V And text feature F _T Inputting into a cross-modal codebook, and obtaining the visual characteristic F _V And text feature F _T Replaced by the nearest features in the codebook and then eachAnd inputting the replaced features into a classification network of a picture decoder and a text, and finally comparing the result with the original input. The invention can improve the feature learning capability of the model and the alignment capability of the model to two modal features.

Description

Text pedestrian searching method based on self-supervision mask model and cross-mode codebook

Technical Field

The invention relates to the field of cross-modal retrieval, in particular to a method for using a feature alignment mode based on masks and cross-modal codebooks.

Background

Text-based pedestrian searches aim to match text description queries with correct pedestrian images, which has great potential in monitoring systems, activity analysis and intelligent photo album. Text descriptions are in most cases more accessible than image query pedestrian re-recognition (also referred to as image-based pedestrian re-recognition), which has made text-based person searches popular in recent years. The method for solving the cross-modal retrieval is mainly divided into two types, one type is learning feature representation, and the other type is extracting two modal features and then carrying out feature alignment.

To better learn the proper features from pictures and text, there are coloring (prior gray processing) of the person image using the generation of the challenge network and text description; there are also techniques for obtaining a priori knowledge via CLIP using self-supervised learning methods and then passing into a cross-modal momentum contrast learning framework. At the same time, to address the differences between the two modalities, there are also many efforts to use a attentive mechanism to help achieve alignment between text and image features. This may require the use of a pre-trained object detection model or the acquisition of picture information for each position of the person in the picture by manually setting the region, and then inputting the corresponding picture information and the corresponding text information together into the attention module, thereby achieving feature alignment. This can undoubtedly place significant computational stress on training and testing. In order to better realize the accuracy of cross-modal pedestrian retrieval, it is important to solve the problem of feature alignment and feature learning between two modalities.

Disclosure of Invention

In order to overcome the defects of the prior art in cross-modal feature learning and alignment, the invention provides a method for combining a mask model and a cross-modal codebook to enhance the model feature learning and alignment capability, and further improves the accuracy of cross-modal pedestrian retrieval.

In order to achieve the above purpose, the invention discloses a text pedestrian searching method based on a self-supervision mask model and a cross-mode codebook, which adopts the following technical scheme:

step 1, reading a data set, and inputting each pair of matched text description and pictures as data of a model;

step 1.1, firstly scaling a picture to a preset size, performing horizontal overturning, randomly increasing Gaussian noise and the like to enhance data, and then dividing a picture into (h/p) square small blocks, wherein p is the side length of each small block, and h and w are the length and width dimensions of the picture respectively;

step 1.2, randomly selecting part of picture blocks, and covering by using a unified mask token;

step 1.3, inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers; meanwhile, randomly selecting part of text blocks, and covering by using a unified mask token;

step 2, inputting the processed picture covered by the mask and the description text into feature encoders of two modes; the method specifically comprises the following steps:

step 2.1, visual backbone network E _V And loading model parameters pre-trained on a dataset ImageNet, and processing image input to obtain visual features F _V ；

Step 2.2, text backbone network E _T The pre-trained model parameters are loaded as well, text input is processed, and text characteristics F are obtained _T ；

Step 2.3, respectively inputting the two features into a mapping layer to obtain global features of the two modes;

step 2.4, calculating a CMPC loss function and a CMPM loss function for the obtained global features of the two modes to measure the distance between matched text pictures and the distance size relation between unmatched text pictures;

wherein the CMPC loss function is expressed as follows:

L _cmpc ＝L _tpi +L _ipt (4) The CMPM loss function is expressed as follows:

L _cmpm ＝L _i2t +L _t2i (10)

wherein x is _i For visual features, z _i For text features, W _j As a weight matrix, y _i,j Representing whether the input is a matching pair of pictures,

e is a very small positive number, preventing division by 0;

step 3, the visual characteristics F obtained in the step 2 through the characteristic extraction backbone network are obtained _V And textFeature F _T Input into a cross-modal codebook, visual feature F _V The dimension of (a) is (h/p) ×d, and the text feature F _T The dimension of (2) is L x D, L is the length of the text, D is the number of channels of the visual feature or the text feature, and the number of channels of the visual feature and the text feature is the same; the method specifically comprises the following steps: further processing the characteristics obtained in the step 2, wherein the specific operation is implemented according to the following steps;

step 3.1, visual characteristics F _V And text feature F _T A total of ((h/p) x (w/p) +L) feature vectors whose number of channels is the same as the number of channels of feature vectors in the codebook, and then calculating the distances between the feature vectors in the codebook and all text features and visual features, and combining visual feature F _V And text feature F _T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:

wherein z is _i Representing visual characteristics F _V And text feature F _T ，c _i Representing the feature vectors in the codebook, wherein K represents the number of the feature vectors in the codebook;

step 3.2, replacing the original visual feature F with the feature vector in the codebook _V And text feature F _T After the vector in (2), a new visual characteristic F is obtained _V2 And new text feature F _T2 The method comprises the steps of carrying out a first treatment on the surface of the Because the replacement feature vector is discrete and the replacement process is not trivial, a gradient estimation strain-throughput is required to back-propagate the gradient to the previous module, as shown in the following equation:

where sg (. Cndot.) represents the stop-propagation gradient, l ₂ Representing a normalization operation;

step 3.3, after replacing the input feature vector, the features in the codebook are updated by the synchronous momentum, and the updated formula is used as follows:

wherein lambda is _mom Is the weight of the update codebook, c _h Is a feature vector in the codebook;

step 4, reconstructing the input pictures and text:

step 4.1, the image decoder uses a single-layer deconvolution network to restore the picture into the input size and channel number, then compares the picture with the original picture, and calculates a reconstruction loss function;

step 4.2, selecting text via text encoder E _T A pre-trained text classifier (fine tuning is performed in the training stage), the features are classified through a Linear layer of the last layer of the text classifier, the difference between the text and the input is predicted, and a classification loss function is calculated;

wherein Ω _T To calculate x _T Function of how many token there are, x _T For visual features, y _T The text is correctly labeled;

step 5, optimizing the model by using a back propagation algorithm and a gradient descent algorithm according to the three loss functions in the step 2, the step 3 and the step 4; the method specifically comprises the following steps:

step 5.1, obtaining an overall error formula according to the actual input and the expected output, wherein the formula is as follows:

L _total ＝L _align +λ ₁ L _recon +λ ₂ L _cosebook (17)

in which L _align Is a CMPC and CMPM loss function for calculating the alignment degree of two modes, L _recon Is a loss function of the difference between the text and picture that the model calculation rebuilds the input and the text and picture that is not covered at the beginning; l (L) _codebook The method comprises the steps of calculating the difference between a replaced characteristic fragment and an input characteristic fragment in order to optimize a cross-modal codebook; lambda (lambda) ₁ And lambda (lambda) ₂ Is two loss functions L _recon And L _codebook The weight occupied in the whole loss function.

Step 5.2, optimizing model parameters by using a back propagation algorithm and a gradient descent algorithm; the batch size was set to 64 and the initial learning rate of the network was set to 4 x 10 using Adam optimizer ^-5 However, in the first ten training rounds, the learning rate was linearly increased from 4×10 ^-6 First grow to 4 x 10 ^-5 The learning rate was reduced to 0.1 at 50 th and 80 th epochs for a total of 120 rounds of training.

And 6, selecting the characteristics after the trunk network and the mapping layer during the test of the model, respectively taking the characteristics of the two modes as input and query sets, and obtaining a corresponding query result by calculating cosine similarity and then sequencing.

Preferably, in step 1.1, the picture is scaled to size of 384px x 128px, horizontal flipping, random increase of gaussian noise, etc. are performed to enhance data, and then a picture is divided into 192 small blocks, and the size of each small block is 16px x 16px.

Preferably, the ratio of masking the randomly selected partial picture blocks in step 1.2 is 5%, 10% or 15%.

Preferably, the ratio of masking the randomly selected portion of text block described in step 1.3 is 20%, 25%, 30%, 35%.

Preferably, the visual encoder E of step 2.1 _V Is a Resnet network, vision Transformer network.

Preferably, the text encoder E described in step 2.2 _T Be Bert network, lstm network, bi-lstm network.

Preferably, the number of feature vectors in the codebook described in step 3.1 is 512, 1024, 2048.

Preferably, the codebook update weights lambda described in step 3.3 _mom Set to 0.8.

Preferably, the loss function L described in step 5.1 _recon And L _codebook Weight lambda of (2) ₁ And lambda (lambda) ₂ Are set to 0.2.

Preferably, step 5.2 specifically comprises: set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 ^-5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 ^-6 First grow to 4 x 10 ^-5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.

The working principle of the invention is as follows: the matching accuracy is improved mainly through two modules: firstly, a mask module is used, so that the model can obtain the capability of reconstructing original pictures and original texts while extracting features, and the learned feature vectors have corresponding high-level semantic information. Secondly, through a discrete cross-modal codebook, the alignment of text features and picture features is realized by enabling the features of two modalities to find the nearest tokens in the codebook at the same time, so that the accuracy of cross-modal retrieval and matching on pedestrian data is realized.

The simple mask strategy provided by the invention can be applied to any network, and the learning capability of the features can be improved through reconstructing the original image and the original text, so that the features extracted by the feature extraction network have rich semantic information; secondly, compared with other methods using complex feature alignment methods, the method has the advantages that a discrete codebook for storing the visual token is established, and the feature of the text and the feature of the picture are aligned on the semantic level through the discrete codebook, so that the alignment capability of the model to the cross-modal feature is improved. Furthermore, compared with a complex multi-mode model (the time complexity is O (mn)), the invention adopts a double-flow feature extraction network, and only a text feature extraction network and a picture feature extraction network are needed to be used for respectively extracting text and picture features, and the time complexity is O (m+n), so that the invention can achieve the fastest matching search effect on the premise of the same backbone network while improving the feature extraction capability and the feature alignment capability.

The invention has the advantages that: 1. the method can enable the model to learn semantic information in the text mode and the visual module, and provides a theoretical basis for cross-mode alignment. 2. By establishing a cross-modal discrete codebook, alignment of features of two modalities is facilitated, which is simpler and easier to understand than implementing modal alignment through a complex attention mechanism. 3. The time complexity is minimized using a simple dual stream network, for m pictures and n sentences, O (m+n). 4. On the data set CUHK-PEDES, the accuracy rate rank@1 is improved by 1.98%, and the accuracy rate rank@10 is improved by 1.31%; on the data set ICFG-PEDES, the accuracy rate rank@1 is improved by 1.78%, and the accuracy rate rank@10 is improved by 2.4%.

Drawings

FIG. 1 is an overall flow chart of the method of the present invention for cross-modal pedestrian retrieval.

Detailed Description

The invention will be further described with reference to the drawings and the implementation method.

A cross-modal pedestrian retrieval method based on a mask model and a cross-modal codebook is implemented as shown in fig. 1, specifically according to the following steps:

1) The data set is read, and each pair of matched text description and picture is used as data input of a model.

11 The picture is scaled to size of 384px 128px, and horizontal overturn, random increase of Gaussian noise and the like are performed to enhance the data.

12 A picture is then divided into 192 tiles according to a 16px x 16px size. Then randomly selecting 5%, 10% and 15% of small blocks to cover by using a unified mask token.

13 Inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers. At the same time, 20%, 25%, 30%, 35% numbers are also randomly selected and masked with a unified mask token.

2) As shown in fig. 1, the processed masked pictures and descriptive text are input into feature encoders of both modalities.

21 The visual feature extraction backbone network uses a visual encoder Vision Transformer-base consisting essentially of a normalized Norm layer, a multi-headed self-attention mechanism layer, and a fully connected layer MLP layer, and loads the resulting model parameters pre-trained on the dataset ImageNet. The text feature extraction backbone uses the text encoder Bert-base, which is similar in major constituent structure to the visual encoder Vision Transformer, while also loading the pre-trained model to initialize the model. When the text encoder is a Resnet, it is mainly composed of a convolutional layer CNN and a residual block, and when the text encoder is lstm or bi-lstm, it is mainly composed of a recurrent neural network RNN.

22 After the two-mode feature extraction network, the visual feature F is obtained _V Text feature F _T . To facilitate alignment of global features of two modalities, visual feature F is set _V And text feature F _T Input into two mapping layers. Wherein the visual mapping layer is composed of a single Linear layer, and then the visual global feature F is obtained through a global maximum pooling layer _V1 . For text features we use three stacked residual blocks, each for each feature in the input residual block, on the one hand via three 1*1 convolutions and ReLU activation functions, and on the other hand via only one 1*1 convolutions and ReLU activation functions, then add the two inputs, finally input the resulting feature to the average pooling layer to obtain the text global feature F _T1 。

23 For the resulting global features of both modalities, a CMPC loss function and a CMPM loss function are calculated to measure the distance between matching text pictures and the distance magnitude relationship between non-matching text pictures. Wherein the CMPC loss function is expressed as follows:

L _cmpc ＝L _tpi +L _ipt (4)

the CMPM loss function is expressed as follows:

L _cmpm ＝L _i2t +L _t2i (10)

e is a very smallPositive number, prevent dividing by 0; .

24 Overall, the loss function for measuring the difference between the two modality features, for assisting the modality to it, is expressed as follows:

L _align ＝L _cmpc +L _cmpm (18)

3) The visual characteristic F obtained in the step 2 through the characteristic extraction backbone network is obtained _V And text feature F _T (visual characteristics F) _V The dimension of (2) is 24 x 8 x 768, and the text feature F _T Dimension size of 100 x 768) is input into the cross-modality codebook.

31 Visual characteristics F) _V And text feature F _T A total of (24×8+100) feature vectors with 768 channels, the channels of the feature vectors in the codebook are 768 the same as the visual features and the text features, the codebook is set to have 512, 1024 and 2048 feature vectors, and then the distances between the feature vectors in the codebook and all the text features and the visual features are calculated to obtain the visual feature F _V And text feature F _T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:

wherein z is _i Representing visual characteristics F _V And text feature F _T ，c _i Representing the feature vectors in the codebook, and K representing the number of feature vectors in the codebook, wherein the selected values are 512, 1024 and 2048.

32 Using feature vectors in the codebook to replace the original visual feature F _V And text feature F _T After the vector in (2), a new visual characteristic F is obtained _V2 And text feature F _T2 . Because the features of the substitution are discrete and the substitution process is not differentiable, gradient estimation stress-throughput is required to be performedThe gradient propagates back to the preceding module as shown in the following formula:

where sg (. Cndot.) represents the stop-propagation gradient, l ₂ Representing a normalization operation.

33 After replacing the input feature vector, the features in the codebook are updated with synchronous momentum, using the updated formula as follows:

4) Step 4, we get the visual feature F through the cross-modal codebook _V2 And text feature F _T2 And inputting the images and texts into a decoder, and combining a self-supervision learning method to enable the models to reconstruct the input images and texts.

41 Step 4.1, visual characteristics F _V2 Input into the visual decoder, the visual encoder used in the present invention is a single-layer deconvolution layer, which will visual feature F _V2 The length and width of the channel number is restored to be consistent with the input image. The model is guided to perform self-supervision learning through reconstructing loss, the difference between the picture recovered by the decoder and the input picture is calculated, and the loss function calculation method is as follows:

42 The text classifier selected by the invention is a text classifier of a text encoder Bert pre-training model, the input features are classified by a Linear layer of the last layer of the text classifier, the aim is to calculate whether the model can correctly restore the covered word in the input stage, and the loss function of the text mask model is calculated as follows:

5) According to the three loss functions, we use a back propagation algorithm and a gradient descent algorithm to optimize the model.

51 According to the actual input and the expected output, we get the overall error formula, which is:

L _total ＝L _align +λ ₁ L _recon +λ ₂ L _codebook (17)

in which L _align Is a CMPC/M loss function for calculating the alignment degree of two modes, L _recon Is the loss function of the model calculation reconstructing the differences between the input text and picture and the initially uncovered text and picture. L (L) _codebook The method is used for optimizing the cross-modal codebook, and calculating the difference between the replaced characteristic fragment and the input characteristic fragment. Lambda (lambda) ₁ And lambda (lambda) ₂ Is L _recon And L _codebook The weight occupied in the whole loss function is set to 0.2 in the training process; at L _codebook In lambda, lambda _mom Set to 0.8 to update the codebook.

52 Using a back propagation algorithm and a gradient descent algorithm to optimize the model parameters. Set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 ^-5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 ^-6 First grow to 4 x 10 ^-5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.

6) When testing is performed on the verification set, only a visual backbone network and a text backbone network are selected to extract the input of two modes of the features. For a given text description and a query set consisting of M pictures, the two backbone networks extract N text features and M image features, respectively.

61 And then using cosine similarity to calculate similarity matrixes between the N descriptions and the M images, wherein the size of the matrixes is N x M, sorting according to the similarity, obtaining the first ten pictures with high matching rate corresponding to each text input, comparing the ten pictures with real answers, and calculating an accuracy related index.

62 To verify the accuracy and effectiveness of the actual application of the method of the present invention. The invention uses the index CMC to represent the search result, and calculates the accuracy rank@1, rank@5, rank@10 values on the data set CUHK-PEDES and the data set ICFG-PEDES in detail to evaluate the algorithm performance and show the experimental results in tables 1 and 2. From the experimental results the following conclusions can be drawn: and (1) the accuracy is obviously improved. There is a great boost after using the mask-based model of our invention, both on the data set CUHK-PEDES and the more challenging data set ICFG-PEDES. On the data set CUHK-PEDES, the accuracy rate rank@1 is improved by 1.98%, and the accuracy rate rank@10 is improved by 1.31%; on the data set ICFG-PEDES, the accuracy rate rank@1 is improved by 1.78%, and the accuracy rate rank@10 is improved by 2.4%. (2) The model can reconstruct the input text and image while the retrieval effect is improved, which shows that the model of our invention truly realizes the retrieval accuracy by improving the capability of the model to extract feature learning and alignment.

TABLE 1 search results on CUHK-PEDES dataset

TABLE 2 search results on ICFG-PEDES dataset

Claims

1. A text pedestrian searching method based on a self-supervision mask model and a cross-modal codebook is characterized in that partial proportion picture blocks and text blocks are covered, and a cross-modal codebook is created, and the method comprises the following steps:

wherein the CMPC loss function is expressed as follows:

L _cmpc ＝L _tpi +L _ipt (4)

the CMPM loss function is expressed as follows:

L _cmpm ＝L _i2t +L _t2i (10)

e is a very small positive number, preventing division by 0;

step 3, the visual characteristics F obtained in the step 2 through the characteristic extraction backbone network are obtained _V And text feature F _T Input into a cross-modal codebook, visual feature F _V The dimension of (a) is (h/p) ×d, and the text feature F _T The dimension of (2) is L x D, L is the length of the text, D is the number of channels of the visual feature or the text feature, and the number of channels of the visual feature and the text feature is the same; the method specifically comprises the following steps: further processing the characteristics obtained in the step 2, wherein the specific operation is implemented according to the following steps;

step 3.2, replacing the original visual feature F with the feature vector in the codebook _V And text feature F _T After the vector in (2), a new visual characteristic F is obtained _V2 And new text feature F _T2 The method comprises the steps of carrying out a first treatment on the surface of the Because the replaced feature vectors are discrete and the replacement process is not differentiable, it is necessary toTo estimate the gradient, the strain-throughput is used to back-propagate the gradient to the previous module, the specific method is as follows:

step 4, reconstructing the input pictures and text:

wherein Ω _T To calculate x _T Function of how many token there are, x _T Is special for visionSign, y _T The text is correctly labeled;

L _total ＝L _align +λ ₁ L _recon +λ ₂ L _codebook (17)

in which L _align Is a CMPC and CMPM loss function for calculating the alignment degree of two modes, L _recon Is a loss function of the difference between the text and picture that the model calculation rebuilds the input and the text and picture that is not covered at the beginning; l (L) _codebook The method comprises the steps of calculating the difference between a replaced characteristic fragment and an input characteristic fragment in order to optimize a cross-modal codebook; lambda (lambda) ₁ And lambda (lambda) ₂ Is two loss functions L _recon And L _codebook The weight occupied in the whole loss function;

step 5.2, optimizing model parameters by using a back propagation algorithm and a gradient descent algorithm;

2. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: in step 1.1, the picture is scaled to size of 384px x 128px, and a method of horizontal turning and random increasing of gaussian noise is performed to enhance data, and then a picture is divided into 192 small blocks, and the size of each small block is 16px x 16px.

3. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the proportion of the randomly selected part of the picture blocks in the step 1.2 to be covered is 5%, 10% and 15%.

4. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the proportion of the randomly selected part of text blocks in the step 1.3 to be covered is 20%, 25%, 30% and 35%.

5. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the visual encoder E of step 2.1 _V Is a Resnet network, vision Transformer network.

6. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: text encoder E as described in step 2.2 _T Be Bert network, lstm network, bi-lstm network.

7. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the number of feature vectors in the codebook described in step 3.1 is 512, 1024, 2048.

8. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the loss function L described in step 5.1 _recon And L _codebook Weight lambda of (2) ₁ And lambda (lambda) ₂ Are set to 0.2.

9. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: step 5.2 specifically comprises: set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 ^-5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 ^-6 First grow to 4 x 10 ^-5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.