CN116343109A - Text pedestrian searching method based on self-supervision mask model and cross-mode codebook - Google Patents

Text pedestrian searching method based on self-supervision mask model and cross-mode codebook Download PDF

Info

Publication number
CN116343109A
CN116343109A CN202310093067.3A CN202310093067A CN116343109A CN 116343109 A CN116343109 A CN 116343109A CN 202310093067 A CN202310093067 A CN 202310093067A CN 116343109 A CN116343109 A CN 116343109A
Authority
CN
China
Prior art keywords
text
feature
codebook
picture
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310093067.3A
Other languages
Chinese (zh)
Inventor
吴一鸣
潘企何
高楠
梁荣华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310093067.3A priority Critical patent/CN116343109A/en
Publication of CN116343109A publication Critical patent/CN116343109A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Optimization (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Algebra (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

A text pedestrian searching method based on a self-supervision mask model and a cross-modal codebook comprises the following steps: masking the input text and picture, and inputting the masked text and picture into a feature extraction backbone network to obtain visual features F V And text feature F T Then the visual characteristics F V And text feature F T Input to a mapping layer to obtain a picture global feature F V1 And text global feature F T1 Alignment is then performed. At the same time will visual characteristics F V And text feature F T Inputting into a cross-modal codebook, and obtaining the visual characteristic F V And text feature F T Replaced by the nearest features in the codebook and then eachAnd inputting the replaced features into a classification network of a picture decoder and a text, and finally comparing the result with the original input. The invention can improve the feature learning capability of the model and the alignment capability of the model to two modal features.

Description

Text pedestrian searching method based on self-supervision mask model and cross-mode codebook
Technical Field
The invention relates to the field of cross-modal retrieval, in particular to a method for using a feature alignment mode based on masks and cross-modal codebooks.
Background
Text-based pedestrian searches aim to match text description queries with correct pedestrian images, which has great potential in monitoring systems, activity analysis and intelligent photo album. Text descriptions are in most cases more accessible than image query pedestrian re-recognition (also referred to as image-based pedestrian re-recognition), which has made text-based person searches popular in recent years. The method for solving the cross-modal retrieval is mainly divided into two types, one type is learning feature representation, and the other type is extracting two modal features and then carrying out feature alignment.
To better learn the proper features from pictures and text, there are coloring (prior gray processing) of the person image using the generation of the challenge network and text description; there are also techniques for obtaining a priori knowledge via CLIP using self-supervised learning methods and then passing into a cross-modal momentum contrast learning framework. At the same time, to address the differences between the two modalities, there are also many efforts to use a attentive mechanism to help achieve alignment between text and image features. This may require the use of a pre-trained object detection model or the acquisition of picture information for each position of the person in the picture by manually setting the region, and then inputting the corresponding picture information and the corresponding text information together into the attention module, thereby achieving feature alignment. This can undoubtedly place significant computational stress on training and testing. In order to better realize the accuracy of cross-modal pedestrian retrieval, it is important to solve the problem of feature alignment and feature learning between two modalities.
Disclosure of Invention
In order to overcome the defects of the prior art in cross-modal feature learning and alignment, the invention provides a method for combining a mask model and a cross-modal codebook to enhance the model feature learning and alignment capability, and further improves the accuracy of cross-modal pedestrian retrieval.
In order to achieve the above purpose, the invention discloses a text pedestrian searching method based on a self-supervision mask model and a cross-mode codebook, which adopts the following technical scheme:
step 1, reading a data set, and inputting each pair of matched text description and pictures as data of a model;
step 1.1, firstly scaling a picture to a preset size, performing horizontal overturning, randomly increasing Gaussian noise and the like to enhance data, and then dividing a picture into (h/p) square small blocks, wherein p is the side length of each small block, and h and w are the length and width dimensions of the picture respectively;
step 1.2, randomly selecting part of picture blocks, and covering by using a unified mask token;
step 1.3, inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers; meanwhile, randomly selecting part of text blocks, and covering by using a unified mask token;
step 2, inputting the processed picture covered by the mask and the description text into feature encoders of two modes; the method specifically comprises the following steps:
step 2.1, visual backbone network E V And loading model parameters pre-trained on a dataset ImageNet, and processing image input to obtain visual features F V
Step 2.2, text backbone network E T The pre-trained model parameters are loaded as well, text input is processed, and text characteristics F are obtained T
Step 2.3, respectively inputting the two features into a mapping layer to obtain global features of the two modes;
step 2.4, calculating a CMPC loss function and a CMPM loss function for the obtained global features of the two modes to measure the distance between matched text pictures and the distance size relation between unmatched text pictures;
wherein the CMPC loss function is expressed as follows:
Figure BDA0004070877610000021
Figure BDA0004070877610000031
Figure BDA0004070877610000032
L cmpc =L tpi +L ipt (4) The CMPM loss function is expressed as follows:
Figure BDA0004070877610000033
Figure BDA0004070877610000034
Figure BDA0004070877610000035
Figure BDA0004070877610000036
Figure BDA0004070877610000037
L cmpm =L i2t +L t2i (10)
wherein x is i For visual features, z i For text features, W j As a weight matrix, y i,j Representing whether the input is a matching pair of pictures,
Figure BDA0004070877610000038
e is a very small positive number, preventing division by 0;
step 3, the visual characteristics F obtained in the step 2 through the characteristic extraction backbone network are obtained V And textFeature F T Input into a cross-modal codebook, visual feature F V The dimension of (a) is (h/p) ×d, and the text feature F T The dimension of (2) is L x D, L is the length of the text, D is the number of channels of the visual feature or the text feature, and the number of channels of the visual feature and the text feature is the same; the method specifically comprises the following steps: further processing the characteristics obtained in the step 2, wherein the specific operation is implemented according to the following steps;
step 3.1, visual characteristics F V And text feature F T A total of ((h/p) x (w/p) +L) feature vectors whose number of channels is the same as the number of channels of feature vectors in the codebook, and then calculating the distances between the feature vectors in the codebook and all text features and visual features, and combining visual feature F V And text feature F T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:
Figure BDA0004070877610000041
Figure BDA0004070877610000042
wherein z is i Representing visual characteristics F V And text feature F T ,c i Representing the feature vectors in the codebook, wherein K represents the number of the feature vectors in the codebook;
step 3.2, replacing the original visual feature F with the feature vector in the codebook V And text feature F T After the vector in (2), a new visual characteristic F is obtained V2 And new text feature F T2 The method comprises the steps of carrying out a first treatment on the surface of the Because the replacement feature vector is discrete and the replacement process is not trivial, a gradient estimation strain-throughput is required to back-propagate the gradient to the previous module, as shown in the following equation:
Figure BDA0004070877610000043
where sg (. Cndot.) represents the stop-propagation gradient, l 2 Representing a normalization operation;
step 3.3, after replacing the input feature vector, the features in the codebook are updated by the synchronous momentum, and the updated formula is used as follows:
Figure BDA0004070877610000044
wherein lambda is mom Is the weight of the update codebook, c h Is a feature vector in the codebook;
step 4, reconstructing the input pictures and text:
step 4.1, the image decoder uses a single-layer deconvolution network to restore the picture into the input size and channel number, then compares the picture with the original picture, and calculates a reconstruction loss function;
Figure BDA0004070877610000045
step 4.2, selecting text via text encoder E T A pre-trained text classifier (fine tuning is performed in the training stage), the features are classified through a Linear layer of the last layer of the text classifier, the difference between the text and the input is predicted, and a classification loss function is calculated;
Figure BDA0004070877610000051
wherein Ω T To calculate x T Function of how many token there are, x T For visual features, y T The text is correctly labeled;
step 5, optimizing the model by using a back propagation algorithm and a gradient descent algorithm according to the three loss functions in the step 2, the step 3 and the step 4; the method specifically comprises the following steps:
step 5.1, obtaining an overall error formula according to the actual input and the expected output, wherein the formula is as follows:
L total =L align1 L recon2 L cosebook (17)
in which L align Is a CMPC and CMPM loss function for calculating the alignment degree of two modes, L recon Is a loss function of the difference between the text and picture that the model calculation rebuilds the input and the text and picture that is not covered at the beginning; l (L) codebook The method comprises the steps of calculating the difference between a replaced characteristic fragment and an input characteristic fragment in order to optimize a cross-modal codebook; lambda (lambda) 1 And lambda (lambda) 2 Is two loss functions L recon And L codebook The weight occupied in the whole loss function.
Step 5.2, optimizing model parameters by using a back propagation algorithm and a gradient descent algorithm; the batch size was set to 64 and the initial learning rate of the network was set to 4 x 10 using Adam optimizer -5 However, in the first ten training rounds, the learning rate was linearly increased from 4×10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 0.1 at 50 th and 80 th epochs for a total of 120 rounds of training.
And 6, selecting the characteristics after the trunk network and the mapping layer during the test of the model, respectively taking the characteristics of the two modes as input and query sets, and obtaining a corresponding query result by calculating cosine similarity and then sequencing.
Preferably, in step 1.1, the picture is scaled to size of 384px x 128px, horizontal flipping, random increase of gaussian noise, etc. are performed to enhance data, and then a picture is divided into 192 small blocks, and the size of each small block is 16px x 16px.
Preferably, the ratio of masking the randomly selected partial picture blocks in step 1.2 is 5%, 10% or 15%.
Preferably, the ratio of masking the randomly selected portion of text block described in step 1.3 is 20%, 25%, 30%, 35%.
Preferably, the visual encoder E of step 2.1 V Is a Resnet network, vision Transformer network.
Preferably, the text encoder E described in step 2.2 T Be Bert network, lstm network, bi-lstm network.
Preferably, the number of feature vectors in the codebook described in step 3.1 is 512, 1024, 2048.
Preferably, the codebook update weights lambda described in step 3.3 mom Set to 0.8.
Preferably, the loss function L described in step 5.1 recon And L codebook Weight lambda of (2) 1 And lambda (lambda) 2 Are set to 0.2.
Preferably, step 5.2 specifically comprises: set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 -5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.
The working principle of the invention is as follows: the matching accuracy is improved mainly through two modules: firstly, a mask module is used, so that the model can obtain the capability of reconstructing original pictures and original texts while extracting features, and the learned feature vectors have corresponding high-level semantic information. Secondly, through a discrete cross-modal codebook, the alignment of text features and picture features is realized by enabling the features of two modalities to find the nearest tokens in the codebook at the same time, so that the accuracy of cross-modal retrieval and matching on pedestrian data is realized.
The simple mask strategy provided by the invention can be applied to any network, and the learning capability of the features can be improved through reconstructing the original image and the original text, so that the features extracted by the feature extraction network have rich semantic information; secondly, compared with other methods using complex feature alignment methods, the method has the advantages that a discrete codebook for storing the visual token is established, and the feature of the text and the feature of the picture are aligned on the semantic level through the discrete codebook, so that the alignment capability of the model to the cross-modal feature is improved. Furthermore, compared with a complex multi-mode model (the time complexity is O (mn)), the invention adopts a double-flow feature extraction network, and only a text feature extraction network and a picture feature extraction network are needed to be used for respectively extracting text and picture features, and the time complexity is O (m+n), so that the invention can achieve the fastest matching search effect on the premise of the same backbone network while improving the feature extraction capability and the feature alignment capability.
The invention has the advantages that: 1. the method can enable the model to learn semantic information in the text mode and the visual module, and provides a theoretical basis for cross-mode alignment. 2. By establishing a cross-modal discrete codebook, alignment of features of two modalities is facilitated, which is simpler and easier to understand than implementing modal alignment through a complex attention mechanism. 3. The time complexity is minimized using a simple dual stream network, for m pictures and n sentences, O (m+n). 4. On the data set CUHK-PEDES, the accuracy rate rank@1 is improved by 1.98%, and the accuracy rate rank@10 is improved by 1.31%; on the data set ICFG-PEDES, the accuracy rate rank@1 is improved by 1.78%, and the accuracy rate rank@10 is improved by 2.4%.
Drawings
FIG. 1 is an overall flow chart of the method of the present invention for cross-modal pedestrian retrieval.
Detailed Description
The invention will be further described with reference to the drawings and the implementation method.
A cross-modal pedestrian retrieval method based on a mask model and a cross-modal codebook is implemented as shown in fig. 1, specifically according to the following steps:
1) The data set is read, and each pair of matched text description and picture is used as data input of a model.
11 The picture is scaled to size of 384px 128px, and horizontal overturn, random increase of Gaussian noise and the like are performed to enhance the data.
12 A picture is then divided into 192 tiles according to a 16px x 16px size. Then randomly selecting 5%, 10% and 15% of small blocks to cover by using a unified mask token.
13 Inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers. At the same time, 20%, 25%, 30%, 35% numbers are also randomly selected and masked with a unified mask token.
2) As shown in fig. 1, the processed masked pictures and descriptive text are input into feature encoders of both modalities.
21 The visual feature extraction backbone network uses a visual encoder Vision Transformer-base consisting essentially of a normalized Norm layer, a multi-headed self-attention mechanism layer, and a fully connected layer MLP layer, and loads the resulting model parameters pre-trained on the dataset ImageNet. The text feature extraction backbone uses the text encoder Bert-base, which is similar in major constituent structure to the visual encoder Vision Transformer, while also loading the pre-trained model to initialize the model. When the text encoder is a Resnet, it is mainly composed of a convolutional layer CNN and a residual block, and when the text encoder is lstm or bi-lstm, it is mainly composed of a recurrent neural network RNN.
22 After the two-mode feature extraction network, the visual feature F is obtained V Text feature F T . To facilitate alignment of global features of two modalities, visual feature F is set V And text feature F T Input into two mapping layers. Wherein the visual mapping layer is composed of a single Linear layer, and then the visual global feature F is obtained through a global maximum pooling layer V1 . For text features we use three stacked residual blocks, each for each feature in the input residual block, on the one hand via three 1*1 convolutions and ReLU activation functions, and on the other hand via only one 1*1 convolutions and ReLU activation functions, then add the two inputs, finally input the resulting feature to the average pooling layer to obtain the text global feature F T1
23 For the resulting global features of both modalities, a CMPC loss function and a CMPM loss function are calculated to measure the distance between matching text pictures and the distance magnitude relationship between non-matching text pictures. Wherein the CMPC loss function is expressed as follows:
Figure BDA0004070877610000091
Figure BDA0004070877610000092
Figure BDA0004070877610000093
L cmpc =L tpi +L ipt (4)
the CMPM loss function is expressed as follows:
Figure BDA0004070877610000094
Figure BDA0004070877610000095
Figure BDA0004070877610000096
Figure BDA0004070877610000097
Figure BDA0004070877610000098
L cmpm =L i2t +L t2i (10)
wherein x is i For visual features, z i For text features, W j As a weight matrix, y i,j Representing whether the input is a matching pair of pictures,
Figure BDA0004070877610000099
e is a very smallPositive number, prevent dividing by 0; .
24 Overall, the loss function for measuring the difference between the two modality features, for assisting the modality to it, is expressed as follows:
L align =L cmpc +L cmpm (18)
3) The visual characteristic F obtained in the step 2 through the characteristic extraction backbone network is obtained V And text feature F T (visual characteristics F) V The dimension of (2) is 24 x 8 x 768, and the text feature F T Dimension size of 100 x 768) is input into the cross-modality codebook.
31 Visual characteristics F) V And text feature F T A total of (24×8+100) feature vectors with 768 channels, the channels of the feature vectors in the codebook are 768 the same as the visual features and the text features, the codebook is set to have 512, 1024 and 2048 feature vectors, and then the distances between the feature vectors in the codebook and all the text features and the visual features are calculated to obtain the visual feature F V And text feature F T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:
Figure BDA0004070877610000101
Figure BDA0004070877610000102
wherein z is i Representing visual characteristics F V And text feature F T ,c i Representing the feature vectors in the codebook, and K representing the number of feature vectors in the codebook, wherein the selected values are 512, 1024 and 2048.
32 Using feature vectors in the codebook to replace the original visual feature F V And text feature F T After the vector in (2), a new visual characteristic F is obtained V2 And text feature F T2 . Because the features of the substitution are discrete and the substitution process is not differentiable, gradient estimation stress-throughput is required to be performedThe gradient propagates back to the preceding module as shown in the following formula:
Figure BDA0004070877610000103
where sg (. Cndot.) represents the stop-propagation gradient, l 2 Representing a normalization operation.
33 After replacing the input feature vector, the features in the codebook are updated with synchronous momentum, using the updated formula as follows:
Figure BDA0004070877610000104
wherein lambda is mom Is the weight of the update codebook, c h Is a feature vector in the codebook;
4) Step 4, we get the visual feature F through the cross-modal codebook V2 And text feature F T2 And inputting the images and texts into a decoder, and combining a self-supervision learning method to enable the models to reconstruct the input images and texts.
41 Step 4.1, visual characteristics F V2 Input into the visual decoder, the visual encoder used in the present invention is a single-layer deconvolution layer, which will visual feature F V2 The length and width of the channel number is restored to be consistent with the input image. The model is guided to perform self-supervision learning through reconstructing loss, the difference between the picture recovered by the decoder and the input picture is calculated, and the loss function calculation method is as follows:
Figure BDA0004070877610000111
42 The text classifier selected by the invention is a text classifier of a text encoder Bert pre-training model, the input features are classified by a Linear layer of the last layer of the text classifier, the aim is to calculate whether the model can correctly restore the covered word in the input stage, and the loss function of the text mask model is calculated as follows:
Figure BDA0004070877610000112
wherein Ω T To calculate x T Function of how many token there are, x T For visual features, y T The text is correctly labeled;
5) According to the three loss functions, we use a back propagation algorithm and a gradient descent algorithm to optimize the model.
51 According to the actual input and the expected output, we get the overall error formula, which is:
L total =L align1 L recon2 L codebook (17)
in which L align Is a CMPC/M loss function for calculating the alignment degree of two modes, L recon Is the loss function of the model calculation reconstructing the differences between the input text and picture and the initially uncovered text and picture. L (L) codebook The method is used for optimizing the cross-modal codebook, and calculating the difference between the replaced characteristic fragment and the input characteristic fragment. Lambda (lambda) 1 And lambda (lambda) 2 Is L recon And L codebook The weight occupied in the whole loss function is set to 0.2 in the training process; at L codebook In lambda, lambda mom Set to 0.8 to update the codebook.
52 Using a back propagation algorithm and a gradient descent algorithm to optimize the model parameters. Set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 -5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.
6) When testing is performed on the verification set, only a visual backbone network and a text backbone network are selected to extract the input of two modes of the features. For a given text description and a query set consisting of M pictures, the two backbone networks extract N text features and M image features, respectively.
61 And then using cosine similarity to calculate similarity matrixes between the N descriptions and the M images, wherein the size of the matrixes is N x M, sorting according to the similarity, obtaining the first ten pictures with high matching rate corresponding to each text input, comparing the ten pictures with real answers, and calculating an accuracy related index.
62 To verify the accuracy and effectiveness of the actual application of the method of the present invention. The invention uses the index CMC to represent the search result, and calculates the accuracy rank@1, rank@5, rank@10 values on the data set CUHK-PEDES and the data set ICFG-PEDES in detail to evaluate the algorithm performance and show the experimental results in tables 1 and 2. From the experimental results the following conclusions can be drawn: and (1) the accuracy is obviously improved. There is a great boost after using the mask-based model of our invention, both on the data set CUHK-PEDES and the more challenging data set ICFG-PEDES. On the data set CUHK-PEDES, the accuracy rate rank@1 is improved by 1.98%, and the accuracy rate rank@10 is improved by 1.31%; on the data set ICFG-PEDES, the accuracy rate rank@1 is improved by 1.78%, and the accuracy rate rank@10 is improved by 2.4%. (2) The model can reconstruct the input text and image while the retrieval effect is improved, which shows that the model of our invention truly realizes the retrieval accuracy by improving the capability of the model to extract feature learning and alignment.
TABLE 1 search results on CUHK-PEDES dataset
Figure BDA0004070877610000131
TABLE 2 search results on ICFG-PEDES dataset
Figure BDA0004070877610000132

Claims (9)

1. A text pedestrian searching method based on a self-supervision mask model and a cross-modal codebook is characterized in that partial proportion picture blocks and text blocks are covered, and a cross-modal codebook is created, and the method comprises the following steps:
step 1, reading a data set, and inputting each pair of matched text description and pictures as data of a model;
step 1.1, firstly scaling a picture to a preset size, performing horizontal overturning, randomly increasing Gaussian noise and the like to enhance data, and then dividing a picture into (h/p) square small blocks, wherein p is the side length of each small block, and h and w are the length and width dimensions of the picture respectively;
step 1.2, randomly selecting part of picture blocks, and covering by using a unified mask token;
step 1.3, inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers; meanwhile, randomly selecting part of text blocks, and covering by using a unified mask token;
step 2, inputting the processed picture covered by the mask and the description text into feature encoders of two modes; the method specifically comprises the following steps:
step 2.1, visual backbone network E V And loading model parameters pre-trained on a dataset ImageNet, and processing image input to obtain visual features F V
Step 2.2, text backbone network E T The pre-trained model parameters are loaded as well, text input is processed, and text characteristics F are obtained T
Step 2.3, respectively inputting the two features into a mapping layer to obtain global features of the two modes;
step 2.4, calculating a CMPC loss function and a CMPM loss function for the obtained global features of the two modes to measure the distance between matched text pictures and the distance size relation between unmatched text pictures;
wherein the CMPC loss function is expressed as follows:
Figure FDA0004070877600000011
Figure FDA0004070877600000012
Figure FDA0004070877600000021
L cmpc =L tpi +L ipt (4)
the CMPM loss function is expressed as follows:
Figure FDA0004070877600000022
Figure FDA0004070877600000023
Figure FDA0004070877600000024
Figure FDA0004070877600000025
Figure FDA0004070877600000026
L cmpm =L i2t +L t2i (10)
wherein x is i For visual features, z i For text features, W j As a weight matrix, y i,j Representing whether the input is a matching pair of pictures,
Figure FDA0004070877600000027
e is a very small positive number, preventing division by 0;
step 3, the visual characteristics F obtained in the step 2 through the characteristic extraction backbone network are obtained V And text feature F T Input into a cross-modal codebook, visual feature F V The dimension of (a) is (h/p) ×d, and the text feature F T The dimension of (2) is L x D, L is the length of the text, D is the number of channels of the visual feature or the text feature, and the number of channels of the visual feature and the text feature is the same; the method specifically comprises the following steps: further processing the characteristics obtained in the step 2, wherein the specific operation is implemented according to the following steps;
step 3.1, visual characteristics F V And text feature F T A total of ((h/p) x (w/p) +L) feature vectors whose number of channels is the same as the number of channels of feature vectors in the codebook, and then calculating the distances between the feature vectors in the codebook and all text features and visual features, and combining visual feature F V And text feature F T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:
Figure FDA0004070877600000031
Figure FDA0004070877600000032
wherein z is i Representing visual characteristics F V And text feature F T ,c i Representing the feature vectors in the codebook, wherein K represents the number of the feature vectors in the codebook;
step 3.2, replacing the original visual feature F with the feature vector in the codebook V And text feature F T After the vector in (2), a new visual characteristic F is obtained V2 And new text feature F T2 The method comprises the steps of carrying out a first treatment on the surface of the Because the replaced feature vectors are discrete and the replacement process is not differentiable, it is necessary toTo estimate the gradient, the strain-throughput is used to back-propagate the gradient to the previous module, the specific method is as follows:
Figure FDA0004070877600000033
where sg (. Cndot.) represents the stop-propagation gradient, l 2 Representing a normalization operation;
step 3.3, after replacing the input feature vector, the features in the codebook are updated by the synchronous momentum, and the updated formula is used as follows:
Figure FDA0004070877600000034
wherein lambda is mom Is the weight of the update codebook, c h Is a feature vector in the codebook;
step 4, reconstructing the input pictures and text:
step 4.1, the image decoder uses a single-layer deconvolution network to restore the picture into the input size and channel number, then compares the picture with the original picture, and calculates a reconstruction loss function;
Figure FDA0004070877600000035
step 4.2, selecting text via text encoder E T A pre-trained text classifier (fine tuning is performed in the training stage), the features are classified through a Linear layer of the last layer of the text classifier, the difference between the text and the input is predicted, and a classification loss function is calculated;
Figure FDA0004070877600000041
wherein Ω T To calculate x T Function of how many token there are, x T Is special for visionSign, y T The text is correctly labeled;
step 5, optimizing the model by using a back propagation algorithm and a gradient descent algorithm according to the three loss functions in the step 2, the step 3 and the step 4; the method specifically comprises the following steps:
step 5.1, obtaining an overall error formula according to the actual input and the expected output, wherein the formula is as follows:
L total =L align1 L recon2 L codebook (17)
in which L align Is a CMPC and CMPM loss function for calculating the alignment degree of two modes, L recon Is a loss function of the difference between the text and picture that the model calculation rebuilds the input and the text and picture that is not covered at the beginning; l (L) codebook The method comprises the steps of calculating the difference between a replaced characteristic fragment and an input characteristic fragment in order to optimize a cross-modal codebook; lambda (lambda) 1 And lambda (lambda) 2 Is two loss functions L recon And L codebook The weight occupied in the whole loss function;
step 5.2, optimizing model parameters by using a back propagation algorithm and a gradient descent algorithm;
and 6, selecting the characteristics after the trunk network and the mapping layer during the test of the model, respectively taking the characteristics of the two modes as input and query sets, and obtaining a corresponding query result by calculating cosine similarity and then sequencing.
2. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: in step 1.1, the picture is scaled to size of 384px x 128px, and a method of horizontal turning and random increasing of gaussian noise is performed to enhance data, and then a picture is divided into 192 small blocks, and the size of each small block is 16px x 16px.
3. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the proportion of the randomly selected part of the picture blocks in the step 1.2 to be covered is 5%, 10% and 15%.
4. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the proportion of the randomly selected part of text blocks in the step 1.3 to be covered is 20%, 25%, 30% and 35%.
5. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the visual encoder E of step 2.1 V Is a Resnet network, vision Transformer network.
6. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: text encoder E as described in step 2.2 T Be Bert network, lstm network, bi-lstm network.
7. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the number of feature vectors in the codebook described in step 3.1 is 512, 1024, 2048.
8. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the loss function L described in step 5.1 recon And L codebook Weight lambda of (2) 1 And lambda (lambda) 2 Are set to 0.2.
9. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: step 5.2 specifically comprises: set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 -5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.
CN202310093067.3A 2023-02-10 2023-02-10 Text pedestrian searching method based on self-supervision mask model and cross-mode codebook Pending CN116343109A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310093067.3A CN116343109A (en) 2023-02-10 2023-02-10 Text pedestrian searching method based on self-supervision mask model and cross-mode codebook

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310093067.3A CN116343109A (en) 2023-02-10 2023-02-10 Text pedestrian searching method based on self-supervision mask model and cross-mode codebook

Publications (1)

Publication Number Publication Date
CN116343109A true CN116343109A (en) 2023-06-27

Family

ID=86879756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310093067.3A Pending CN116343109A (en) 2023-02-10 2023-02-10 Text pedestrian searching method based on self-supervision mask model and cross-mode codebook

Country Status (1)

Country Link
CN (1) CN116343109A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758562A (en) * 2023-08-22 2023-09-15 杭州实在智能科技有限公司 Universal text verification code identification method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116758562A (en) * 2023-08-22 2023-09-15 杭州实在智能科技有限公司 Universal text verification code identification method and system
CN116758562B (en) * 2023-08-22 2023-12-08 杭州实在智能科技有限公司 Universal text verification code identification method and system

Similar Documents

Publication Publication Date Title
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN109992686A (en) Based on multi-angle from the image-text retrieval system and method for attention mechanism
CN110826338B (en) Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement
CN110309839A (en) A kind of method and device of iamge description
CN106033426A (en) Image retrieval method based on latent semantic minimum hash
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN109919221B (en) Image description method based on bidirectional double-attention machine
CN112232053A (en) Text similarity calculation system, method and storage medium based on multi-keyword pair matching
CN111159485A (en) Tail entity linking method, device, server and storage medium
CN110516070A (en) A kind of Chinese Question Classification method based on text error correction and neural network
CN111694977A (en) Vehicle image retrieval method based on data enhancement
CN114547230A (en) Intelligent administrative law enforcement case information extraction and case law identification method
CN113220891A (en) Unsupervised concept-to-sentence based generation confrontation network image description algorithm
CN116343109A (en) Text pedestrian searching method based on self-supervision mask model and cross-mode codebook
CN115878832A (en) Ocean remote sensing image audio retrieval method based on fine alignment discrimination hash
CN111680529A (en) Machine translation algorithm and device based on layer aggregation
CN113806543B (en) Text classification method of gate control circulation unit based on residual jump connection
CN113392191B (en) Text matching method and device based on multi-dimensional semantic joint learning
CN111079011A (en) Deep learning-based information recommendation method
CN112396091B (en) Social media image popularity prediction method, system, storage medium and application
CN111723572B (en) Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM
CN117150069A (en) Cross-modal retrieval method and system based on global and local semantic comparison learning
CN116705073A (en) Voice emotion recognition method based on bimodal and attentive mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination