CN116383671A

CN116383671A - Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment

Info

Publication number: CN116383671A
Application number: CN202310328349.7A
Authority: CN
Inventors: 叶茫; 姜定; 潘思甜
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-03-27
Filing date: 2023-03-27
Publication date: 2023-07-04
Anticipated expiration: 2043-03-27
Also published as: CN116383671B

Abstract

The invention discloses a text image cross-mode pedestrian retrieval method and a system with implicit relation reasoning alignment, which are characterized in that firstly, an image encoder and a text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed; then, utilizing a cross-mode visual text interaction encoder, implicitly mining fine granularity relations through mask masking modeling to learn global features with discriminant so as to perform fine granularity interaction; and finally, based on the image-text similarity distribution matching SDM loss, optimizing cosine similarity distribution of N image-text pair characteristics by using KL divergence, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing the KL divergence so as to realize cross-mode matching. The invention has high recognition efficiency from text to image pedestrian.

Description

Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment

Technical Field

The invention belongs to the technical field of cross-mode pedestrian re-recognition, relates to a text image cross-mode pedestrian retrieval method and system, and in particular relates to a text image cross-mode pedestrian retrieval method and system based on implicit relation reasoning alignment.

Background

In recent years, the task of searching pedestrians from text to image is attracting more and more attention, and the task is widely applied to the fields of public security and protection in the scene where a target image cannot be obtained. The text-to-image pedestrian retrieval aims at retrieving a target person which is most matched with the description content of the given text from a large-scale image database, and is a comprehensive task integrating image-text retrieval and pedestrian re-identification. The core problem of this task is how to map two different modality data of text and images to a common potential feature space.

Text-to-image pedestrian retrieval tasks are extremely challenging due to the differences in internal features and modal heterogeneity between the two modalities of vision and language. The visual characteristics of the target pedestrian may be affected by a variety of factors, such as pose, viewing angle, illumination, etc., while the textual description may also be affected by its order of description and ambiguity. The problem of cross-modal feature alignment caused by modal differences between vision and language is the core research of the task. Thus, researchers need to explore better methods to obtain more discriminant feature representations and design better cross-modality matching methods to align images and text into a joint feature space. This is one of the research hotspots for text-to-image pedestrian retrieval tasks.

Early text-to-image pedestrian retrieval efforts utilized VGG and LSTM to learn representations of visual and text modalities and align images and text into a joint feature space by designing cross-modality matching loss functions. "Sepp Hochreiter and Jurgen Schmidhuber long short-term computer 9 (8): 1735-1780,1997.3" (Long term memory, sepp Hochreiter and Jurgen Schmidhuber, nerve computation 9 (8): 1735-1780,1997.3)

Some of the latter work improved the feature extraction backbone network using ResNet50/101 and BERT and designed a new cross-modal projection matching penalty for aligning global image-text features to joint feature space. (1) "Yucheng Chen, rui Huang, hong Chang, chuanqi Tan, tao Xue, and Bingpen Ma. Cross-modal knowledge adaptation for language-based person search.IEEE Transactions on Image Processing,30:4057-4069,2021.3" (Cross-modal knowledge adaptation for language-based people search. Chen Yucheng et al, IEEE Transactions on Image Processing,30:4057-4069, 2021.3), (2) "Nikolaos Sarafianos, xiang Xu, and Ioanis AKakadariaris. Adversaril representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages5814-5824,2019.3,6" (for text-to-image matching resistance representation learning. Nikolaos Sarafianos et al, IEEE/CVF computer Vietnam International conference discussion, pages5814-5824,2019.3,6), (3) "Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text ng. In Proceedings of the European conference on Computer Vision (CV), pages 686-686 (EC35) for use in computer Vietnam conference (EC35 ).

Recent research work has widely utilized additional local feature learning branches, and some work has explicitly used external tools such as body segmentation, body part information, color information, and text phrase segmentation. In addition, some work also uses a focus mechanism to perform local feature learning, and although this local matching strategy improves the retrieval performance, unavoidable noise is introduced at the same time, and uncertainty in the retrieval process is increased. The limitation of these efforts is that the recently popular visual language pre-training model is not utilized and therefore lacks powerful cross-modal alignment capabilities.

Some work has recently emerged to apply CLIP to text-to-image pedestrian retrieval that enables knowledge transfer from CLIP through the use of a momentum contrast learning framework or fine-grained information mining framework. (1) "Xiao Han, sen He, li Zhang, and Tao Xiang.textbased person search with limited data.arXiv preprint arXiv:2110.10807,2021.2,3,6" (limited data text-based people search. Shore Cold, etc., arXiv preprint arXiv:2110.108072021.2,3,6), (2) "Shangalin Yan, nengDong, liyan Zhang, and Jinhui Tang.CLIP-drive fine-graded text-image person re-identification.arXiv preprint arXiv:2210.10276,2022.2,3,6,7" (CLIP-driven fine-grained text-image pedestrian re-identification. Strict duplex, etc., arXiv preprint arXiv:2210.102762022.2,3,6,7).

However, these methods use only a single image encoder of CLIP, and fail to successfully migrate the complete CLIP image text encoder knowledge to the text-to-image pedestrian retrieval dataset, and thus fail to achieve optimal performance.

Disclosure of Invention

Aiming at the problems of lack of corresponding relation between multi-modal data of visual-text characteristics, intra-modal information distortion caused by explicit local matching and the like in the prior art, the invention provides a text image cross-modal pedestrian retrieval method and system based on implicit relation reasoning alignment.

The technical scheme adopted by the method is as follows: a cross-modal pedestrian retrieval method for text images with implicit relation reasoning alignment comprises the following steps:

step 1: the image encoder and the text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed;

the image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;

The multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;

the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;

the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;

Step 2: implicitly mining the fine granularity relation by using a cross-modal visual text interaction encoder through mask masking modeling so as to learn global features with discrimination, thereby carrying out fine granularity interaction;

the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;

the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;

the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;

step 3: based on the image-text similarity distribution matching SDM loss, combining the cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby realizing cross-mode matching.

The system of the invention adopts the technical proposal that: a text image cross-modality pedestrian retrieval system with implicit relationship reasoning alignment, comprising the following modules:

the first module is used for converting the pedestrian image to be processed and the corresponding text description into feature vector representation through a self-attention and cross-attention mechanism by utilizing an image encoder and a text encoder respectively, aligning the full image features and the text features through an SDM loss function, and constructing the position relation of the two modes in a common feature space;

The second module is used for implicitly mining fine granularity relations through mask masking modeling by utilizing the cross-mode visual text interactive encoder so as to assist the image encoder and the text encoder to learn global features with discrimination, thereby enhancing the retrieval performance of a pedestrian retrieval system from text to image;

and the third module is used for matching SDM loss based on image-text similarity distribution, combining cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.

The invention has the advantages that:

1. a new cross-modal matching loss function is designed, and the image-text alignment capability can be remarkably improved.

2. The designed implicit relation reasoning module utilizes a mask masking modeling (maskedLanguage modeling) task to implicitly mine fine-grained relation so as to assist an image encoder and a text encoder to learn discriminative global features, thereby enhancing the retrieval performance of a text-to-image pedestrian retrieval system without additional supervision and reasoning cost.

3. And the knowledge of the general image-text large model CLIP is successfully transferred to the special text-to-image pedestrian re-identification data, so that the basic alignment capability of the image text is remarkably improved.

4. The performance of the proposed cross-modal implicit relationship inference alignment network (IRRA) on a plurality of public data sets is remarkably improved compared with the previous work, and the method is the most advanced text-to-image pedestrian re-identification method.

Drawings

FIG. 1 is a diagram of a cross-modal implicit relationship inference alignment network (IRRA) architecture in accordance with an embodiment of the present invention.

FIG. 2 is a block diagram of a text encoder according to an embodiment of the present invention;

FIG. 3 is a block diagram of an image encoder according to an embodiment of the present invention;

fig. 4 is a diagram of a visual text interactive encoder according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.

The present invention proposes a cross-modal implicit relationship inference alignment network (IRRA) that enhances global image-text matching by learning and inferring relationships between local visual-text labels, and without additional supervision and inference costs.

The cross-modal implicit relationship inference alignment network (IRRA) of the present embodiment is comprised of an image encoder, a text encoder, and a cross-modal visual text interaction encoder.

Referring to fig. 1, the method for searching the text image cross-mode pedestrians with the aligned implicit relation reasoning provided by the invention comprises the following steps:

referring to fig. 2 and 3, the image encoder and the text encoder of the present embodiment each include a multi-head self-attention layer, a residual connection layer, and a feedforward full connection layer;

the multi-head self-attention layer of the embodiment transmits the query vector, the key vector and the value vector to a plurality of independent attention heads respectively; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;

The residual connection layer of the embodiment adds a shortcut connection to the output of the multi-head self-attention layer of the network and directly connects to the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer; the shortcut connection in the residual connection layer refers to: the output of the input after passing through one multi-head self-attention layer is added with the output of the input without passing through the multi-head self-attention layer to obtain the final output. This shortcut addition operation is the implementation of a shortcut connection. The connection mode can avoid the degradation phenomenon of the deep neural network, so that the network is trained better.

The feedforward full-connection layer of the embodiment takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by its weight value, adds the same, and then adds the offset value to the result, which is a single number; this number is then passed to an activation function which maps it to another range and generates the final output.

The present embodiment employs an image and text encoder of the CLIP model to initialize the model backbone network in order to enhance the image text alignment capabilities of the text-to-image pedestrian retrieval model base.

The image encoder inputs a given image, and uses CLIP pre-training vision Transformer (ViT) to obtain image features. "Alexey Dosovitskiy, lucas Beyer, alexander Kolesnikov, dirk Weissenborn, xiaohuazhai, thomas Unterthiner, mostafa Dehghani, matthias Minderer, georg Heigold, sylvain Gelly, et al image is worth16x16words: transformers for image recognition at scale.In International Conference on Learning Representations,2020.3" (one image value is 16x16words: transformer. Alexey Dosovitskiy et al for large scale image recognition, international study representative Congress, 2020.3).

An image encoder segments a given image input into a series of non-overlapping image blocks of fixed size. And secondly mapping the image block sequence onto the corresponding label by means of a trainable linear projection. The block sequence is input into an L-layer transducer block, and the correlation between each image block is modeled by its position features and additional markers. Finally, the image blocks are encoded as features having the same dimensions as the text features to achieve a representation of the global image.

The CLIP text encoder is used to extract features of the entered pedestrian text description and the text description after randomly masking the words. The encoder uses Byte Pair Encoding (BPE) to segment the input text, the original text and the randomly masked text share the same text feature encoder, the extracted original text description features are linearly projected into an image-text joint feature space, and [ EOS ]The features represented at the markers are represented as global text. For the randomly masked text, the extracted features at each marker are fused with the features at the image markers through a cross-modal cross-attention mechanism, and the fused multi-modal features are sent to an implicit relation reasoning module for learning a masking word prediction task and used for improving the alignment capability of the model mining cross-modal fine granularity features. (1) "Ashish Vaswani, noam Shazer, niki Parmar, jakob Uszkoreit, llion Jones, aidan N Gomez,

kaiser, and IlliaPolosukin. Attention is all you need advance in neural information processing systems,30,2017.3 "(attention mechanism is all you need. Ashish Vaswani et al, development of neuro information processing systems,30,2017.3), (2)" Alec Radford, jong Wook Kim, chrisHallacy, aditya Ramesh, gabriel Goh, sandhini Agarwal, girish Sary, amanda Askell, pamela Mishkin, jack Clark, et al, learning transferable visual models from natural language, supervision in International Conference on Machine Learning, pages8748-8763, PMLR,2021.2,3 "(transferable visual model from natural language super visual learning. Alec Radford et al, international machine learning conference, pages8748-8763, PMLR,2021.2,3).

please refer to fig. 4, the cross-modal visual text interactive encoder of the present embodiment includes a cross-attention mechanism layer, a multi-head self-attention layer, a residual connection layer and a feed-forward full connection layer;

the cross-attention mechanism layer of this embodiment splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; the query matrix will then be applied to the key-value matrix to get an attention matrix that will be used in the weighted summation of the feature matrices (the mask text feature matrix Q and the image feature matrix V-K input to the cross-attention mechanism) to obtain the final feature representation;

The residual connection layer of the embodiment adds a shortcut connection to the output of the multi-head self-attention layer of the network and directly connects to the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;

the feedforward full-connection layer of the embodiment takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by its weight value, adds the same, and then adds the offset value to the result, which is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;

the cross-mode visual text interactive encoder of the embodiment learns the global feature with discriminant by implicitly mining the fine granularity relationship through the mask masking modeling task;

the specific implementation comprises the following substeps:

step 2.1: the visual text interactive encoder consists of a multi-head cross attention layer and four layers of Transformer blocks;

wherein the method comprises the steps of

Representing a fused image and a masked text situational representation, LN (·) representing layer normalization, MCA (·) representing a multi-headed cross-attention mechanism; />

For a representation that merges the image and the mask text, m indicates that it is a feature representation at the masked text, N represents the total number of image-text representation pairs, and Tansformer () represents the input of the corresponding data into the transducer to obtain an output; d represents the characteristic dimension of the mask mark, +. >

To mask text features>

And->

Is an image feature;

step 2.2: for each of the screening positions

Predicting corresponding original markers using an MLP classifier

Probability of (2); />

Is vocabulary->

Is of a size of (2); m represents a set of occluded text;

in the MLP classifier in the embodiment, an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;

step 2.3: obtaining the objective function of IRR (Implicit Relation Reasoning and Aligning)

Wherein,,

representing a set of obscured text labels, m ⁱ Is a predictive marker probability distribution, y ⁱ Is a uniheat vector of a real tag, where the probability of a real tag is 1.

The specific implementation of step 3 in this embodiment includes the following sub-steps:

step 3.1: global representation for each image

Defining the set of image-text representation pairs to be

Wherein N represents the total number of image-text representation pairs; y is _i，j Is a true matching tag, y _i，j =1 means

Is a matched pair from the same identity, and y _i，j =0 represents a non-matched pair; let sim (u, v) =u ^T v/|u|v|represents ++>

Normalizing the dot product of u and v (i.e., cosine similarity);

the probability p of a matched pair is calculated using the following softmax function _i，j ；

Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p _i，j In small batches

And->

Cosine similarity between->

And->

A ratio of the sum of cosine similarities between them;

step 3.2: calculating an image-to-text SDM loss function in a small batch

Where ε is a minimum value to avoid the value overflow problem,

is the true match probability; p is p _i ||q _i The representation is from p _i To q _i KL divergence of (2);

step 3.3: calculating a bi-directional SDM loss function

Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction; calculating matching pair probability p in SDM loss function _i，j There is a distinction in the direction of computation,

a loss function representing a match from an image to text, the calculation being as shown in equation (4); / >

A loss function representing text-to-image matching, in equation (4)>

Need to be changed to->

Step 3.4: alignment between image text similarity distribution and standardized label matching distribution is achieved by minimizing KL divergence, and cross-mode matching is achieved.

The cross-modal implicit relationship reasoning alignment network of the embodiment is a trained network;

the functions adopted in the training process are as follows:

wherein,,

representing the objective function of the IRR model, +.>

Representing a bi-directional SDM loss function, ">

Representing an ID loss function;

wherein the method comprises the steps of

And->

Representing the logits output by the image and text classification network for category i, respectively, and y representing the real label.

The present embodiment trains the IRRA framework using the ID loss function and the SDM loss function together with the IRR loss function. The ID loss function groups images or texts according to their corresponding identities, and explicitly considers intra-modal distances, so that feature representations of the same image-text group are closer in the joint feature space. "Zhedong Zheng, liang Zheng, michael Garrett, yi Yang, mingliang Xu, and Yi-Dong shen. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, communications, and Applications (TOMM), 16 (2): 1-23,2020.2,6,7" (there is an example missing dual-path convolution image text feature. Zhengdong et al, ACMTOMM,16 (2): 1-23,2020.2,6,7).

The CUHK-PEDES dataset was used in this example during training. The CUHK-PEDES dataset was the first dataset dedicated to text-to-image pedestrian retrieval, containing a total of 40206 images of 13003 people and 80412 text descriptions. The training set includes 11003 characters, 34054 images, and 68108 text descriptions; the validation set includes 1000 people, 3078 images, and 6158 text descriptions; the test set includes 1000 people, 3074 images, and 6156 text descriptions.

The method of the present application is further illustrated by specific experiments below.

In experiments, the hidden size of each layer of the visual text interactive encoder was set to 512 and the number of top layers was set to 8. The size of all representations in the image and text is set to 512. All input images are resized to 384×128. The maximum length of the text sequence is set to 77. Learning rate initialization of 1×10 ^-5 The cosine learning rate decreases. For a randomly initialized module, the initial learning rate is set to 5×10 ^-5 . The temperature parameter τ in the SDM loss is set to 0.02.

Experiments were performed on a single RTX3090 24GB GPU using pyrerch. During training, random horizontal flipping, random cropping with padding, and random erasure are used to enhance the image data. The model was trained with Adam optimizer [23] for 60 cycles. Initially, the learning rate was linearly increased from 1 x 10-6 to 1 x 10-5 with 5 warm-up cycles.

The general Rank-k is adopted as a main evaluation index, namely: the probability of at least one matching pedestrian image can be found in the top k candidate lists when a given text description is retrieved. In addition, for comprehensive evaluation, mean average precision (mAP) and mINP [51] were also used as two other search criteria. The higher Rank-k, mAP and mINP, the better the performance.

In order to verify the effectiveness of the present invention, the present invention is compared with the existing most advanced methods, which mainly include: (1) ISANet: shangalin Yan, hao Tang, liyan Zhang, and Jinhui tang.image-specific information suppression and implicit local alignment for text-based person search. ArXiv preprint arXiv:2208.14365,2022.3,6,7

(2)LBUL：Zijie Wang,Aichun Zhu,JingyiXue,Xili Wan,Chao Liu,Tian Wang,and Yifeng Li.Look before you leap:Improving text-based person retrieval by learning a consistent crossmodal common manifold.In Proceedings of the 30th ACM International Conference on Multimedia,pages 1984–1992,2022.6,7

(3)SAF：Shiping Li,Min Cao,and Min Zhang.Learning semanticaligned feature representation for text-based person search.In ICASSP 2022-2022IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),pages 2724–2728.IEEE,2022.6

(4)TIPCB：Yuhao Chen,Guoqing Zhang,Yujiang Lu,Zhenxing Wang,and Yuhui Zheng.Tipcb:A simple but effective part-based convolutional baseline for text-based person search.Neurocomputing,494:171–181,2022.2,3,6

(5)CAIBC：Zijie Wang,Aichun Zhu,JingyiXue,Xili Wan,Chao Liu,Tian Wang,and Yifeng Li.Caibc:Capturing all-round information beyond color for text-based person retrieval.arXiv preprint arXiv:2209.05773,2022.3,6

(6)AXM-Net：Ammarah Farooq,Muhammad Awais,Josef Kittler,and Syed Safwan Khalid.Axm-net:Implicit cross-modal feature alignment for person re-identification.36(4):4477–4485,2022.3,6

(7)LGUR：Zhiyin Shao,Xinyu Zhang,Meng Fang,Zhifeng Lin,Jian Wang,and Changxing Ding.Learning granularity-unified representations for text-to-image person re-identification.arXiv preprint arXiv:2207.07802,2022.2,3,6

(8)IVT：Xiujun Shu,Wei Wen,Haoqian Wu,Keyu Chen,Yiran Song,RuizhiQiao,Bo Ren,and Xiao Wang.See finer,see more:Implicit modality alignment for text-based person retrieval.arXiv preprintarXiv:2208.08608,2022.2,6,7

(9)CFine：Shuanglin Yan,Neng Dong,Liyan Zhang,and Jinhui Tang.Clip-driven fine-grained text-image person re-identification.arXiv preprint arXiv:2210.10276,2022.2,3,6,7

The results of the tests on the CUHK-PEDES dataset are shown in Table 1:

TABLE 1

As can be seen from table 1: the method has the advantages that all indexes are higher than those of the existing method, and the performance is obviously improved. There are two main reasons: 1. the implicit relation reasoning module used in the invention utilizes mask masking modeling to enable the model to learn the alignment relation of fine granularity information among image text modes, thereby realizing full cross-mode interaction. 2. The similarity distribution matching loss provided by the invention effectively expands the variance between non-matching pairs and the correlation between matching pairs, and realizes the alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby achieving the effect of cross-modal matching.

The innovation of the invention comprises:

1. an IRRA framework is presented that implicitly utilizes fine-grained interactions to enhance global alignment without additional supervision and reasoning costs.

2. A new cross-modality matching loss function, namely image-text Similarity Distribution Matching (SDM) loss, is designed to minimize the KL difference between the image text similarity distribution and the normalized label matching distribution.

3. An implicit relation reasoning module is designed, and mask modeling is utilized to enable the model to learn the alignment relation of fine granularity information between the image and the text mode.

It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims

1. The cross-modal pedestrian retrieval method for the text images with the implicit relation reasoning alignment is characterized by comprising the following steps of:

2. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval method of claim 1, wherein: step 2, the cross-modal visual text interactive encoder learns the global feature with discriminant by implicitly mining fine granularity relations through a mask masking modeling task;

The specific implementation comprises the following substeps:

wherein the method comprises the steps of

For a representation that merges the image and the occluded text, m indicates that it is a feature representation at the occluded text, N represents the total number of image-text representation pairs, tamsformer (·) represents the input of the corresponding data into the transducer to obtain an output; d represents the feature dimension of the mask tag, Q is the mask text feature, ++>

And->

Is an image feature;

step 2.2: for each of the screening positions

Predicting corresponding original markers using an MLP classifier

Probability of (2); />

Is vocabulary->

Is of a size of (2); m represents a set of occluded text;

the MLP classifier is characterized in that an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;

Step 2.3: obtaining an objective function of IRR

Wherein,,

3. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval method of claim 1, wherein the specific implementation of step 3 includes the sub-steps of:

step 3.1: global representation f for each image _i ^v Defining a set of image-text representation pairs to be

Wherein N represents the total number of image-text representation pairs; y is _i,j Is a true matching tag, y _i,j =1 means->

Is a matched pair from the same identity, and y _i,j =0 represents a non-matched pair; is provided with->

Representation->

Normalizing the dot product of u and v;

the probability p of a matched pair is calculated using the following softmax function _i,j ；

Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p _i,j For f in small batches _i ^v And

cosine similarity between f _i ^v And->

A ratio of the sum of cosine similarities between them;

step 3.2: calculating an image-to-text SDM loss function in a small batch

Where ε is a minimum value to avoid the value overflow problem,

step 3.3: calculating a bi-directional SDM loss function

Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction;

4. A method for cross-modal pedestrian retrieval of text images aligned by implicit relationship reasoning according to any one of claims 1-3, characterized in that: the cross-modal visual text interaction encoder and the visual text interaction encoder form a cross-modal implicit relation reasoning alignment network, and the cross-modal implicit relation reasoning alignment network is a trained network;

the functions adopted in the training process are as follows:

wherein,,

representing the objective function of the IRR model, +.>

Representing a bi-directional SDM loss function, ">

Representing an ID loss function;

wherein the method comprises the steps of

And->

5. A text image cross-modality pedestrian retrieval system with implicit relationship reasoning alignment comprising the following modules:

6. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of claim 5 wherein: the cross-modal visual text interactive encoder in the second module learns the global features with discriminant by implicitly mining fine-grained relationships through a mask masking modeling task;

The specific implementation comprises the following sub-modules:

the module 2.1 is used for the visual text interactive encoder and consists of a multi-head cross attention layer and four layers of Transformer blocks;

wherein the method comprises the steps of

Representing a fused image and a masked text-rendered representation, LN (·) represents layer normalization, MCA (·) represents multipleHead cross attention mechanism; />

For a representation that merges the image and the mask text, m indicates that it is a feature representation at the masked text, N represents the total number of image-text representation pairs, and Tansformer () represents the input of the corresponding data into the transducer to obtain an output; d represents the feature dimension of the mask tag, Q is the mask text feature, ++>

And->

Is an image feature;

a module 2.2 for each screening position

Predicting corresponding original markers using an MLP classifier

Probability of (2); />

Is vocabulary->

Is of a size of (2); m represents a set of occluded text;

Module 2.3 for deriving an objective function of IRR

Wherein,,

7. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of claim 5 wherein the third module includes the following sub-modules:

a first sub-module for globally representing f for each image _i ^v Defining a set of image-text representation pairs to be

Wherein N represents the total number of image-text representation pairs; y is _i,j Is a true matching tag, y _i,j =1 means

Representation->

Normalizing the dot product of u and v;

cosine similarity between f _i ^v And->

A ratio of the sum of cosine similarities between them;

a second sub-module for calculating an SDM loss function from image to text in a small batch

Where ε is a minimum value to avoid the value overflow problem,

a third sub-module for calculating a bi-directional SDM loss function

and the fourth sub-module is used for realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.

8. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of any of claims 5-7, wherein: the cross-modal visual text interaction encoder and the visual text interaction encoder form a cross-modal implicit relation reasoning alignment network, and the cross-modal implicit relation reasoning alignment network is a trained network;

the functions adopted in the training process are as follows:

wherein,,

representing the objective function of the IRR model, +.>

Representing a bi-directional SDM loss function, ">

Representing an ID loss function;

wherein the method comprises the steps of

And->

Representing image and text classification network pair category i, respectivelyThe output logits, y represents the real tag.