CN116343109A - Text pedestrian searching method based on self-supervision mask model and cross-mode codebook - Google Patents
Text pedestrian searching method based on self-supervision mask model and cross-mode codebook Download PDFInfo
- Publication number
- CN116343109A CN116343109A CN202310093067.3A CN202310093067A CN116343109A CN 116343109 A CN116343109 A CN 116343109A CN 202310093067 A CN202310093067 A CN 202310093067A CN 116343109 A CN116343109 A CN 116343109A
- Authority
- CN
- China
- Prior art keywords
- text
- feature
- codebook
- picture
- visual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000000007 visual effect Effects 0.000 claims abstract description 59
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 238000013507 mapping Methods 0.000 claims abstract description 7
- 239000013598 vector Substances 0.000 claims description 45
- 230000006870 function Effects 0.000 claims description 40
- 239000010410 layer Substances 0.000 claims description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 13
- 239000012634 fragment Substances 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 238000012360 testing method Methods 0.000 claims description 4
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000012821 model calculation Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 230000001360 synchronised effect Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 230000000873 masking effect Effects 0.000 abstract description 3
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004040 coloring Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/53—Recognition of crowd images, e.g. recognition of crowd congestion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/15—Correlation function computation including computation of convolution operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Mathematical Optimization (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Artificial Intelligence (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Algebra (AREA)
- General Engineering & Computer Science (AREA)
- Image Analysis (AREA)
Abstract
A text pedestrian searching method based on a self-supervision mask model and a cross-modal codebook comprises the following steps: masking the input text and picture, and inputting the masked text and picture into a feature extraction backbone network to obtain visual features F V And text feature F T Then the visual characteristics F V And text feature F T Input to a mapping layer to obtain a picture global feature F V1 And text global feature F T1 Alignment is then performed. At the same time will visual characteristics F V And text feature F T Inputting into a cross-modal codebook, and obtaining the visual characteristic F V And text feature F T Replaced by the nearest features in the codebook and then eachAnd inputting the replaced features into a classification network of a picture decoder and a text, and finally comparing the result with the original input. The invention can improve the feature learning capability of the model and the alignment capability of the model to two modal features.
Description
Technical Field
The invention relates to the field of cross-modal retrieval, in particular to a method for using a feature alignment mode based on masks and cross-modal codebooks.
Background
Text-based pedestrian searches aim to match text description queries with correct pedestrian images, which has great potential in monitoring systems, activity analysis and intelligent photo album. Text descriptions are in most cases more accessible than image query pedestrian re-recognition (also referred to as image-based pedestrian re-recognition), which has made text-based person searches popular in recent years. The method for solving the cross-modal retrieval is mainly divided into two types, one type is learning feature representation, and the other type is extracting two modal features and then carrying out feature alignment.
To better learn the proper features from pictures and text, there are coloring (prior gray processing) of the person image using the generation of the challenge network and text description; there are also techniques for obtaining a priori knowledge via CLIP using self-supervised learning methods and then passing into a cross-modal momentum contrast learning framework. At the same time, to address the differences between the two modalities, there are also many efforts to use a attentive mechanism to help achieve alignment between text and image features. This may require the use of a pre-trained object detection model or the acquisition of picture information for each position of the person in the picture by manually setting the region, and then inputting the corresponding picture information and the corresponding text information together into the attention module, thereby achieving feature alignment. This can undoubtedly place significant computational stress on training and testing. In order to better realize the accuracy of cross-modal pedestrian retrieval, it is important to solve the problem of feature alignment and feature learning between two modalities.
Disclosure of Invention
In order to overcome the defects of the prior art in cross-modal feature learning and alignment, the invention provides a method for combining a mask model and a cross-modal codebook to enhance the model feature learning and alignment capability, and further improves the accuracy of cross-modal pedestrian retrieval.
In order to achieve the above purpose, the invention discloses a text pedestrian searching method based on a self-supervision mask model and a cross-mode codebook, which adopts the following technical scheme:
step 1, reading a data set, and inputting each pair of matched text description and pictures as data of a model;
step 1.1, firstly scaling a picture to a preset size, performing horizontal overturning, randomly increasing Gaussian noise and the like to enhance data, and then dividing a picture into (h/p) square small blocks, wherein p is the side length of each small block, and h and w are the length and width dimensions of the picture respectively;
step 1.2, randomly selecting part of picture blocks, and covering by using a unified mask token;
step 1.3, inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers; meanwhile, randomly selecting part of text blocks, and covering by using a unified mask token;
step 2, inputting the processed picture covered by the mask and the description text into feature encoders of two modes; the method specifically comprises the following steps:
step 2.1, visual backbone network E V And loading model parameters pre-trained on a dataset ImageNet, and processing image input to obtain visual features F V ;
Step 2.2, text backbone network E T The pre-trained model parameters are loaded as well, text input is processed, and text characteristics F are obtained T ;
Step 2.3, respectively inputting the two features into a mapping layer to obtain global features of the two modes;
step 2.4, calculating a CMPC loss function and a CMPM loss function for the obtained global features of the two modes to measure the distance between matched text pictures and the distance size relation between unmatched text pictures;
wherein the CMPC loss function is expressed as follows:
L cmpc =L tpi +L ipt (4) The CMPM loss function is expressed as follows:
L cmpm =L i2t +L t2i (10)
wherein x is i For visual features, z i For text features, W j As a weight matrix, y i,j Representing whether the input is a matching pair of pictures,e is a very small positive number, preventing division by 0;
step 3, the visual characteristics F obtained in the step 2 through the characteristic extraction backbone network are obtained V And textFeature F T Input into a cross-modal codebook, visual feature F V The dimension of (a) is (h/p) ×d, and the text feature F T The dimension of (2) is L x D, L is the length of the text, D is the number of channels of the visual feature or the text feature, and the number of channels of the visual feature and the text feature is the same; the method specifically comprises the following steps: further processing the characteristics obtained in the step 2, wherein the specific operation is implemented according to the following steps;
step 3.1, visual characteristics F V And text feature F T A total of ((h/p) x (w/p) +L) feature vectors whose number of channels is the same as the number of channels of feature vectors in the codebook, and then calculating the distances between the feature vectors in the codebook and all text features and visual features, and combining visual feature F V And text feature F T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:
wherein z is i Representing visual characteristics F V And text feature F T ,c i Representing the feature vectors in the codebook, wherein K represents the number of the feature vectors in the codebook;
step 3.2, replacing the original visual feature F with the feature vector in the codebook V And text feature F T After the vector in (2), a new visual characteristic F is obtained V2 And new text feature F T2 The method comprises the steps of carrying out a first treatment on the surface of the Because the replacement feature vector is discrete and the replacement process is not trivial, a gradient estimation strain-throughput is required to back-propagate the gradient to the previous module, as shown in the following equation:
where sg (. Cndot.) represents the stop-propagation gradient, l 2 Representing a normalization operation;
step 3.3, after replacing the input feature vector, the features in the codebook are updated by the synchronous momentum, and the updated formula is used as follows:
wherein lambda is mom Is the weight of the update codebook, c h Is a feature vector in the codebook;
step 4, reconstructing the input pictures and text:
step 4.1, the image decoder uses a single-layer deconvolution network to restore the picture into the input size and channel number, then compares the picture with the original picture, and calculates a reconstruction loss function;
step 4.2, selecting text via text encoder E T A pre-trained text classifier (fine tuning is performed in the training stage), the features are classified through a Linear layer of the last layer of the text classifier, the difference between the text and the input is predicted, and a classification loss function is calculated;
wherein Ω T To calculate x T Function of how many token there are, x T For visual features, y T The text is correctly labeled;
step 5, optimizing the model by using a back propagation algorithm and a gradient descent algorithm according to the three loss functions in the step 2, the step 3 and the step 4; the method specifically comprises the following steps:
step 5.1, obtaining an overall error formula according to the actual input and the expected output, wherein the formula is as follows:
L total =L align +λ 1 L recon +λ 2 L cosebook (17)
in which L align Is a CMPC and CMPM loss function for calculating the alignment degree of two modes, L recon Is a loss function of the difference between the text and picture that the model calculation rebuilds the input and the text and picture that is not covered at the beginning; l (L) codebook The method comprises the steps of calculating the difference between a replaced characteristic fragment and an input characteristic fragment in order to optimize a cross-modal codebook; lambda (lambda) 1 And lambda (lambda) 2 Is two loss functions L recon And L codebook The weight occupied in the whole loss function.
Step 5.2, optimizing model parameters by using a back propagation algorithm and a gradient descent algorithm; the batch size was set to 64 and the initial learning rate of the network was set to 4 x 10 using Adam optimizer -5 However, in the first ten training rounds, the learning rate was linearly increased from 4×10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 0.1 at 50 th and 80 th epochs for a total of 120 rounds of training.
And 6, selecting the characteristics after the trunk network and the mapping layer during the test of the model, respectively taking the characteristics of the two modes as input and query sets, and obtaining a corresponding query result by calculating cosine similarity and then sequencing.
Preferably, in step 1.1, the picture is scaled to size of 384px x 128px, horizontal flipping, random increase of gaussian noise, etc. are performed to enhance data, and then a picture is divided into 192 small blocks, and the size of each small block is 16px x 16px.
Preferably, the ratio of masking the randomly selected partial picture blocks in step 1.2 is 5%, 10% or 15%.
Preferably, the ratio of masking the randomly selected portion of text block described in step 1.3 is 20%, 25%, 30%, 35%.
Preferably, the visual encoder E of step 2.1 V Is a Resnet network, vision Transformer network.
Preferably, the text encoder E described in step 2.2 T Be Bert network, lstm network, bi-lstm network.
Preferably, the number of feature vectors in the codebook described in step 3.1 is 512, 1024, 2048.
Preferably, the codebook update weights lambda described in step 3.3 mom Set to 0.8.
Preferably, the loss function L described in step 5.1 recon And L codebook Weight lambda of (2) 1 And lambda (lambda) 2 Are set to 0.2.
Preferably, step 5.2 specifically comprises: set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 -5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.
The working principle of the invention is as follows: the matching accuracy is improved mainly through two modules: firstly, a mask module is used, so that the model can obtain the capability of reconstructing original pictures and original texts while extracting features, and the learned feature vectors have corresponding high-level semantic information. Secondly, through a discrete cross-modal codebook, the alignment of text features and picture features is realized by enabling the features of two modalities to find the nearest tokens in the codebook at the same time, so that the accuracy of cross-modal retrieval and matching on pedestrian data is realized.
The simple mask strategy provided by the invention can be applied to any network, and the learning capability of the features can be improved through reconstructing the original image and the original text, so that the features extracted by the feature extraction network have rich semantic information; secondly, compared with other methods using complex feature alignment methods, the method has the advantages that a discrete codebook for storing the visual token is established, and the feature of the text and the feature of the picture are aligned on the semantic level through the discrete codebook, so that the alignment capability of the model to the cross-modal feature is improved. Furthermore, compared with a complex multi-mode model (the time complexity is O (mn)), the invention adopts a double-flow feature extraction network, and only a text feature extraction network and a picture feature extraction network are needed to be used for respectively extracting text and picture features, and the time complexity is O (m+n), so that the invention can achieve the fastest matching search effect on the premise of the same backbone network while improving the feature extraction capability and the feature alignment capability.
The invention has the advantages that: 1. the method can enable the model to learn semantic information in the text mode and the visual module, and provides a theoretical basis for cross-mode alignment. 2. By establishing a cross-modal discrete codebook, alignment of features of two modalities is facilitated, which is simpler and easier to understand than implementing modal alignment through a complex attention mechanism. 3. The time complexity is minimized using a simple dual stream network, for m pictures and n sentences, O (m+n). 4. On the data set CUHK-PEDES, the accuracy rate rank@1 is improved by 1.98%, and the accuracy rate rank@10 is improved by 1.31%; on the data set ICFG-PEDES, the accuracy rate rank@1 is improved by 1.78%, and the accuracy rate rank@10 is improved by 2.4%.
Drawings
FIG. 1 is an overall flow chart of the method of the present invention for cross-modal pedestrian retrieval.
Detailed Description
The invention will be further described with reference to the drawings and the implementation method.
A cross-modal pedestrian retrieval method based on a mask model and a cross-modal codebook is implemented as shown in fig. 1, specifically according to the following steps:
1) The data set is read, and each pair of matched text description and picture is used as data input of a model.
11 The picture is scaled to size of 384px 128px, and horizontal overturn, random increase of Gaussian noise and the like are performed to enhance the data.
12 A picture is then divided into 192 tiles according to a 16px x 16px size. Then randomly selecting 5%, 10% and 15% of small blocks to cover by using a unified mask token.
13 Inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers. At the same time, 20%, 25%, 30%, 35% numbers are also randomly selected and masked with a unified mask token.
2) As shown in fig. 1, the processed masked pictures and descriptive text are input into feature encoders of both modalities.
21 The visual feature extraction backbone network uses a visual encoder Vision Transformer-base consisting essentially of a normalized Norm layer, a multi-headed self-attention mechanism layer, and a fully connected layer MLP layer, and loads the resulting model parameters pre-trained on the dataset ImageNet. The text feature extraction backbone uses the text encoder Bert-base, which is similar in major constituent structure to the visual encoder Vision Transformer, while also loading the pre-trained model to initialize the model. When the text encoder is a Resnet, it is mainly composed of a convolutional layer CNN and a residual block, and when the text encoder is lstm or bi-lstm, it is mainly composed of a recurrent neural network RNN.
22 After the two-mode feature extraction network, the visual feature F is obtained V Text feature F T . To facilitate alignment of global features of two modalities, visual feature F is set V And text feature F T Input into two mapping layers. Wherein the visual mapping layer is composed of a single Linear layer, and then the visual global feature F is obtained through a global maximum pooling layer V1 . For text features we use three stacked residual blocks, each for each feature in the input residual block, on the one hand via three 1*1 convolutions and ReLU activation functions, and on the other hand via only one 1*1 convolutions and ReLU activation functions, then add the two inputs, finally input the resulting feature to the average pooling layer to obtain the text global feature F T1 。
23 For the resulting global features of both modalities, a CMPC loss function and a CMPM loss function are calculated to measure the distance between matching text pictures and the distance magnitude relationship between non-matching text pictures. Wherein the CMPC loss function is expressed as follows:
L cmpc =L tpi +L ipt (4)
the CMPM loss function is expressed as follows:
L cmpm =L i2t +L t2i (10)
wherein x is i For visual features, z i For text features, W j As a weight matrix, y i,j Representing whether the input is a matching pair of pictures,e is a very smallPositive number, prevent dividing by 0; .
24 Overall, the loss function for measuring the difference between the two modality features, for assisting the modality to it, is expressed as follows:
L align =L cmpc +L cmpm (18)
3) The visual characteristic F obtained in the step 2 through the characteristic extraction backbone network is obtained V And text feature F T (visual characteristics F) V The dimension of (2) is 24 x 8 x 768, and the text feature F T Dimension size of 100 x 768) is input into the cross-modality codebook.
31 Visual characteristics F) V And text feature F T A total of (24×8+100) feature vectors with 768 channels, the channels of the feature vectors in the codebook are 768 the same as the visual features and the text features, the codebook is set to have 512, 1024 and 2048 feature vectors, and then the distances between the feature vectors in the codebook and all the text features and the visual features are calculated to obtain the visual feature F V And text feature F T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:
wherein z is i Representing visual characteristics F V And text feature F T ,c i Representing the feature vectors in the codebook, and K representing the number of feature vectors in the codebook, wherein the selected values are 512, 1024 and 2048.
32 Using feature vectors in the codebook to replace the original visual feature F V And text feature F T After the vector in (2), a new visual characteristic F is obtained V2 And text feature F T2 . Because the features of the substitution are discrete and the substitution process is not differentiable, gradient estimation stress-throughput is required to be performedThe gradient propagates back to the preceding module as shown in the following formula:
where sg (. Cndot.) represents the stop-propagation gradient, l 2 Representing a normalization operation.
33 After replacing the input feature vector, the features in the codebook are updated with synchronous momentum, using the updated formula as follows:
wherein lambda is mom Is the weight of the update codebook, c h Is a feature vector in the codebook;
4) Step 4, we get the visual feature F through the cross-modal codebook V2 And text feature F T2 And inputting the images and texts into a decoder, and combining a self-supervision learning method to enable the models to reconstruct the input images and texts.
41 Step 4.1, visual characteristics F V2 Input into the visual decoder, the visual encoder used in the present invention is a single-layer deconvolution layer, which will visual feature F V2 The length and width of the channel number is restored to be consistent with the input image. The model is guided to perform self-supervision learning through reconstructing loss, the difference between the picture recovered by the decoder and the input picture is calculated, and the loss function calculation method is as follows:
42 The text classifier selected by the invention is a text classifier of a text encoder Bert pre-training model, the input features are classified by a Linear layer of the last layer of the text classifier, the aim is to calculate whether the model can correctly restore the covered word in the input stage, and the loss function of the text mask model is calculated as follows:
wherein Ω T To calculate x T Function of how many token there are, x T For visual features, y T The text is correctly labeled;
5) According to the three loss functions, we use a back propagation algorithm and a gradient descent algorithm to optimize the model.
51 According to the actual input and the expected output, we get the overall error formula, which is:
L total =L align +λ 1 L recon +λ 2 L codebook (17)
in which L align Is a CMPC/M loss function for calculating the alignment degree of two modes, L recon Is the loss function of the model calculation reconstructing the differences between the input text and picture and the initially uncovered text and picture. L (L) codebook The method is used for optimizing the cross-modal codebook, and calculating the difference between the replaced characteristic fragment and the input characteristic fragment. Lambda (lambda) 1 And lambda (lambda) 2 Is L recon And L codebook The weight occupied in the whole loss function is set to 0.2 in the training process; at L codebook In lambda, lambda mom Set to 0.8 to update the codebook.
52 Using a back propagation algorithm and a gradient descent algorithm to optimize the model parameters. Set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 -5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.
6) When testing is performed on the verification set, only a visual backbone network and a text backbone network are selected to extract the input of two modes of the features. For a given text description and a query set consisting of M pictures, the two backbone networks extract N text features and M image features, respectively.
61 And then using cosine similarity to calculate similarity matrixes between the N descriptions and the M images, wherein the size of the matrixes is N x M, sorting according to the similarity, obtaining the first ten pictures with high matching rate corresponding to each text input, comparing the ten pictures with real answers, and calculating an accuracy related index.
62 To verify the accuracy and effectiveness of the actual application of the method of the present invention. The invention uses the index CMC to represent the search result, and calculates the accuracy rank@1, rank@5, rank@10 values on the data set CUHK-PEDES and the data set ICFG-PEDES in detail to evaluate the algorithm performance and show the experimental results in tables 1 and 2. From the experimental results the following conclusions can be drawn: and (1) the accuracy is obviously improved. There is a great boost after using the mask-based model of our invention, both on the data set CUHK-PEDES and the more challenging data set ICFG-PEDES. On the data set CUHK-PEDES, the accuracy rate rank@1 is improved by 1.98%, and the accuracy rate rank@10 is improved by 1.31%; on the data set ICFG-PEDES, the accuracy rate rank@1 is improved by 1.78%, and the accuracy rate rank@10 is improved by 2.4%. (2) The model can reconstruct the input text and image while the retrieval effect is improved, which shows that the model of our invention truly realizes the retrieval accuracy by improving the capability of the model to extract feature learning and alignment.
TABLE 1 search results on CUHK-PEDES dataset
TABLE 2 search results on ICFG-PEDES dataset
Claims (9)
1. A text pedestrian searching method based on a self-supervision mask model and a cross-modal codebook is characterized in that partial proportion picture blocks and text blocks are covered, and a cross-modal codebook is created, and the method comprises the following steps:
step 1, reading a data set, and inputting each pair of matched text description and pictures as data of a model;
step 1.1, firstly scaling a picture to a preset size, performing horizontal overturning, randomly increasing Gaussian noise and the like to enhance data, and then dividing a picture into (h/p) square small blocks, wherein p is the side length of each small block, and h and w are the length and width dimensions of the picture respectively;
step 1.2, randomly selecting part of picture blocks, and covering by using a unified mask token;
step 1.3, inputting the text description into a word segmentation device, and converting words and phrases into corresponding numbers; meanwhile, randomly selecting part of text blocks, and covering by using a unified mask token;
step 2, inputting the processed picture covered by the mask and the description text into feature encoders of two modes; the method specifically comprises the following steps:
step 2.1, visual backbone network E V And loading model parameters pre-trained on a dataset ImageNet, and processing image input to obtain visual features F V ;
Step 2.2, text backbone network E T The pre-trained model parameters are loaded as well, text input is processed, and text characteristics F are obtained T ;
Step 2.3, respectively inputting the two features into a mapping layer to obtain global features of the two modes;
step 2.4, calculating a CMPC loss function and a CMPM loss function for the obtained global features of the two modes to measure the distance between matched text pictures and the distance size relation between unmatched text pictures;
wherein the CMPC loss function is expressed as follows:
L cmpc =L tpi +L ipt (4)
the CMPM loss function is expressed as follows:
L cmpm =L i2t +L t2i (10)
wherein x is i For visual features, z i For text features, W j As a weight matrix, y i,j Representing whether the input is a matching pair of pictures,e is a very small positive number, preventing division by 0;
step 3, the visual characteristics F obtained in the step 2 through the characteristic extraction backbone network are obtained V And text feature F T Input into a cross-modal codebook, visual feature F V The dimension of (a) is (h/p) ×d, and the text feature F T The dimension of (2) is L x D, L is the length of the text, D is the number of channels of the visual feature or the text feature, and the number of channels of the visual feature and the text feature is the same; the method specifically comprises the following steps: further processing the characteristics obtained in the step 2, wherein the specific operation is implemented according to the following steps;
step 3.1, visual characteristics F V And text feature F T A total of ((h/p) x (w/p) +L) feature vectors whose number of channels is the same as the number of channels of feature vectors in the codebook, and then calculating the distances between the feature vectors in the codebook and all text features and visual features, and combining visual feature F V And text feature F T The feature vector which corresponds to the feature vector and is closest to the feature vector is found in the codebook to replace the feature vector, and the searching method comprises the following formula:
wherein z is i Representing visual characteristics F V And text feature F T ,c i Representing the feature vectors in the codebook, wherein K represents the number of the feature vectors in the codebook;
step 3.2, replacing the original visual feature F with the feature vector in the codebook V And text feature F T After the vector in (2), a new visual characteristic F is obtained V2 And new text feature F T2 The method comprises the steps of carrying out a first treatment on the surface of the Because the replaced feature vectors are discrete and the replacement process is not differentiable, it is necessary toTo estimate the gradient, the strain-throughput is used to back-propagate the gradient to the previous module, the specific method is as follows:
where sg (. Cndot.) represents the stop-propagation gradient, l 2 Representing a normalization operation;
step 3.3, after replacing the input feature vector, the features in the codebook are updated by the synchronous momentum, and the updated formula is used as follows:
wherein lambda is mom Is the weight of the update codebook, c h Is a feature vector in the codebook;
step 4, reconstructing the input pictures and text:
step 4.1, the image decoder uses a single-layer deconvolution network to restore the picture into the input size and channel number, then compares the picture with the original picture, and calculates a reconstruction loss function;
step 4.2, selecting text via text encoder E T A pre-trained text classifier (fine tuning is performed in the training stage), the features are classified through a Linear layer of the last layer of the text classifier, the difference between the text and the input is predicted, and a classification loss function is calculated;
wherein Ω T To calculate x T Function of how many token there are, x T Is special for visionSign, y T The text is correctly labeled;
step 5, optimizing the model by using a back propagation algorithm and a gradient descent algorithm according to the three loss functions in the step 2, the step 3 and the step 4; the method specifically comprises the following steps:
step 5.1, obtaining an overall error formula according to the actual input and the expected output, wherein the formula is as follows:
L total =L align +λ 1 L recon +λ 2 L codebook (17)
in which L align Is a CMPC and CMPM loss function for calculating the alignment degree of two modes, L recon Is a loss function of the difference between the text and picture that the model calculation rebuilds the input and the text and picture that is not covered at the beginning; l (L) codebook The method comprises the steps of calculating the difference between a replaced characteristic fragment and an input characteristic fragment in order to optimize a cross-modal codebook; lambda (lambda) 1 And lambda (lambda) 2 Is two loss functions L recon And L codebook The weight occupied in the whole loss function;
step 5.2, optimizing model parameters by using a back propagation algorithm and a gradient descent algorithm;
and 6, selecting the characteristics after the trunk network and the mapping layer during the test of the model, respectively taking the characteristics of the two modes as input and query sets, and obtaining a corresponding query result by calculating cosine similarity and then sequencing.
2. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: in step 1.1, the picture is scaled to size of 384px x 128px, and a method of horizontal turning and random increasing of gaussian noise is performed to enhance data, and then a picture is divided into 192 small blocks, and the size of each small block is 16px x 16px.
3. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the proportion of the randomly selected part of the picture blocks in the step 1.2 to be covered is 5%, 10% and 15%.
4. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the proportion of the randomly selected part of text blocks in the step 1.3 to be covered is 20%, 25%, 30% and 35%.
5. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the visual encoder E of step 2.1 V Is a Resnet network, vision Transformer network.
6. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: text encoder E as described in step 2.2 T Be Bert network, lstm network, bi-lstm network.
7. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the number of feature vectors in the codebook described in step 3.1 is 512, 1024, 2048.
8. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: the loss function L described in step 5.1 recon And L codebook Weight lambda of (2) 1 And lambda (lambda) 2 Are set to 0.2.
9. The text pedestrian search method based on the self-supervision mask model and the cross-modal codebook according to claim 1, wherein: step 5.2 specifically comprises: set batch size to 64, use optimizer Adam, network initial learning rate to 4 x 10 -5 In the first ten training rounds, the learning rate is linearly increased from 4 x 10 -6 First grow to 4 x 10 -5 The learning rate was reduced to 10% of the original at 50 th and 80 th epochs for a total of 120 rounds of training.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310093067.3A CN116343109A (en) | 2023-02-10 | 2023-02-10 | Text pedestrian searching method based on self-supervision mask model and cross-mode codebook |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310093067.3A CN116343109A (en) | 2023-02-10 | 2023-02-10 | Text pedestrian searching method based on self-supervision mask model and cross-mode codebook |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116343109A true CN116343109A (en) | 2023-06-27 |
Family
ID=86879756
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310093067.3A Pending CN116343109A (en) | 2023-02-10 | 2023-02-10 | Text pedestrian searching method based on self-supervision mask model and cross-mode codebook |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116343109A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116758562A (en) * | 2023-08-22 | 2023-09-15 | 杭州实在智能科技有限公司 | Universal text verification code identification method and system |
-
2023
- 2023-02-10 CN CN202310093067.3A patent/CN116343109A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116758562A (en) * | 2023-08-22 | 2023-09-15 | 杭州实在智能科技有限公司 | Universal text verification code identification method and system |
CN116758562B (en) * | 2023-08-22 | 2023-12-08 | 杭州实在智能科技有限公司 | Universal text verification code identification method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN111414461B (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN109992686A (en) | Based on multi-angle from the image-text retrieval system and method for attention mechanism | |
CN110826338B (en) | Fine-grained semantic similarity recognition method for single-selection gate and inter-class measurement | |
CN110309839A (en) | A kind of method and device of iamge description | |
CN106033426A (en) | Image retrieval method based on latent semantic minimum hash | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN109919221B (en) | Image description method based on bidirectional double-attention machine | |
CN112232053A (en) | Text similarity calculation system, method and storage medium based on multi-keyword pair matching | |
CN111159485A (en) | Tail entity linking method, device, server and storage medium | |
CN110516070A (en) | A kind of Chinese Question Classification method based on text error correction and neural network | |
CN111694977A (en) | Vehicle image retrieval method based on data enhancement | |
CN114547230A (en) | Intelligent administrative law enforcement case information extraction and case law identification method | |
CN113220891A (en) | Unsupervised concept-to-sentence based generation confrontation network image description algorithm | |
CN116343109A (en) | Text pedestrian searching method based on self-supervision mask model and cross-mode codebook | |
CN115878832A (en) | Ocean remote sensing image audio retrieval method based on fine alignment discrimination hash | |
CN111680529A (en) | Machine translation algorithm and device based on layer aggregation | |
CN113806543B (en) | Text classification method of gate control circulation unit based on residual jump connection | |
CN113392191B (en) | Text matching method and device based on multi-dimensional semantic joint learning | |
CN111079011A (en) | Deep learning-based information recommendation method | |
CN112396091B (en) | Social media image popularity prediction method, system, storage medium and application | |
CN111723572B (en) | Chinese short text correlation measurement method based on CNN convolutional layer and BilSTM | |
CN117150069A (en) | Cross-modal retrieval method and system based on global and local semantic comparison learning | |
CN116705073A (en) | Voice emotion recognition method based on bimodal and attentive mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |