CN116383671A - Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment - Google Patents

Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment Download PDF

Info

Publication number
CN116383671A
CN116383671A CN202310328349.7A CN202310328349A CN116383671A CN 116383671 A CN116383671 A CN 116383671A CN 202310328349 A CN202310328349 A CN 202310328349A CN 116383671 A CN116383671 A CN 116383671A
Authority
CN
China
Prior art keywords
text
image
attention
layer
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310328349.7A
Other languages
Chinese (zh)
Other versions
CN116383671B (en
Inventor
叶茫
姜定
潘思甜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202310328349.7A priority Critical patent/CN116383671B/en
Publication of CN116383671A publication Critical patent/CN116383671A/en
Application granted granted Critical
Publication of CN116383671B publication Critical patent/CN116383671B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text image cross-mode pedestrian retrieval method and a system with implicit relation reasoning alignment, which are characterized in that firstly, an image encoder and a text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed; then, utilizing a cross-mode visual text interaction encoder, implicitly mining fine granularity relations through mask masking modeling to learn global features with discriminant so as to perform fine granularity interaction; and finally, based on the image-text similarity distribution matching SDM loss, optimizing cosine similarity distribution of N image-text pair characteristics by using KL divergence, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing the KL divergence so as to realize cross-mode matching. The invention has high recognition efficiency from text to image pedestrian.

Description

Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment
Technical Field
The invention belongs to the technical field of cross-mode pedestrian re-recognition, relates to a text image cross-mode pedestrian retrieval method and system, and in particular relates to a text image cross-mode pedestrian retrieval method and system based on implicit relation reasoning alignment.
Background
In recent years, the task of searching pedestrians from text to image is attracting more and more attention, and the task is widely applied to the fields of public security and protection in the scene where a target image cannot be obtained. The text-to-image pedestrian retrieval aims at retrieving a target person which is most matched with the description content of the given text from a large-scale image database, and is a comprehensive task integrating image-text retrieval and pedestrian re-identification. The core problem of this task is how to map two different modality data of text and images to a common potential feature space.
Text-to-image pedestrian retrieval tasks are extremely challenging due to the differences in internal features and modal heterogeneity between the two modalities of vision and language. The visual characteristics of the target pedestrian may be affected by a variety of factors, such as pose, viewing angle, illumination, etc., while the textual description may also be affected by its order of description and ambiguity. The problem of cross-modal feature alignment caused by modal differences between vision and language is the core research of the task. Thus, researchers need to explore better methods to obtain more discriminant feature representations and design better cross-modality matching methods to align images and text into a joint feature space. This is one of the research hotspots for text-to-image pedestrian retrieval tasks.
Early text-to-image pedestrian retrieval efforts utilized VGG and LSTM to learn representations of visual and text modalities and align images and text into a joint feature space by designing cross-modality matching loss functions. "Sepp Hochreiter and Jurgen Schmidhuber long short-term computer 9 (8): 1735-1780,1997.3" (Long term memory, sepp Hochreiter and Jurgen Schmidhuber, nerve computation 9 (8): 1735-1780,1997.3)
Some of the latter work improved the feature extraction backbone network using ResNet50/101 and BERT and designed a new cross-modal projection matching penalty for aligning global image-text features to joint feature space. (1) "Yucheng Chen, rui Huang, hong Chang, chuanqi Tan, tao Xue, and Bingpen Ma. Cross-modal knowledge adaptation for language-based person search.IEEE Transactions on Image Processing,30:4057-4069,2021.3" (Cross-modal knowledge adaptation for language-based people search. Chen Yucheng et al, IEEE Transactions on Image Processing,30:4057-4069, 2021.3), (2) "Nikolaos Sarafianos, xiang Xu, and Ioanis AKakadariaris. Adversaril representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages5814-5824,2019.3,6" (for text-to-image matching resistance representation learning. Nikolaos Sarafianos et al, IEEE/CVF computer Vietnam International conference discussion, pages5814-5824,2019.3,6), (3) "Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text ng. In Proceedings of the European conference on Computer Vision (CV), pages 686-686 (EC35) for use in computer Vietnam conference (EC35 ).
Recent research work has widely utilized additional local feature learning branches, and some work has explicitly used external tools such as body segmentation, body part information, color information, and text phrase segmentation. In addition, some work also uses a focus mechanism to perform local feature learning, and although this local matching strategy improves the retrieval performance, unavoidable noise is introduced at the same time, and uncertainty in the retrieval process is increased. The limitation of these efforts is that the recently popular visual language pre-training model is not utilized and therefore lacks powerful cross-modal alignment capabilities.
Some work has recently emerged to apply CLIP to text-to-image pedestrian retrieval that enables knowledge transfer from CLIP through the use of a momentum contrast learning framework or fine-grained information mining framework. (1) "Xiao Han, sen He, li Zhang, and Tao Xiang.textbased person search with limited data.arXiv preprint arXiv:2110.10807,2021.2,3,6" (limited data text-based people search. Shore Cold, etc., arXiv preprint arXiv:2110.108072021.2,3,6), (2) "Shangalin Yan, nengDong, liyan Zhang, and Jinhui Tang.CLIP-drive fine-graded text-image person re-identification.arXiv preprint arXiv:2210.10276,2022.2,3,6,7" (CLIP-driven fine-grained text-image pedestrian re-identification. Strict duplex, etc., arXiv preprint arXiv:2210.102762022.2,3,6,7).
However, these methods use only a single image encoder of CLIP, and fail to successfully migrate the complete CLIP image text encoder knowledge to the text-to-image pedestrian retrieval dataset, and thus fail to achieve optimal performance.
Disclosure of Invention
Aiming at the problems of lack of corresponding relation between multi-modal data of visual-text characteristics, intra-modal information distortion caused by explicit local matching and the like in the prior art, the invention provides a text image cross-modal pedestrian retrieval method and system based on implicit relation reasoning alignment.
The technical scheme adopted by the method is as follows: a cross-modal pedestrian retrieval method for text images with implicit relation reasoning alignment comprises the following steps:
step 1: the image encoder and the text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed;
the image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
The multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
Step 2: implicitly mining the fine granularity relation by using a cross-modal visual text interaction encoder through mask masking modeling so as to learn global features with discrimination, thereby carrying out fine granularity interaction;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
step 3: based on the image-text similarity distribution matching SDM loss, combining the cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby realizing cross-mode matching.
The system of the invention adopts the technical proposal that: a text image cross-modality pedestrian retrieval system with implicit relationship reasoning alignment, comprising the following modules:
the first module is used for converting the pedestrian image to be processed and the corresponding text description into feature vector representation through a self-attention and cross-attention mechanism by utilizing an image encoder and a text encoder respectively, aligning the full image features and the text features through an SDM loss function, and constructing the position relation of the two modes in a common feature space;
The image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
The second module is used for implicitly mining fine granularity relations through mask masking modeling by utilizing the cross-mode visual text interactive encoder so as to assist the image encoder and the text encoder to learn global features with discrimination, thereby enhancing the retrieval performance of a pedestrian retrieval system from text to image;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
and the third module is used for matching SDM loss based on image-text similarity distribution, combining cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.
The invention has the advantages that:
1. a new cross-modal matching loss function is designed, and the image-text alignment capability can be remarkably improved.
2. The designed implicit relation reasoning module utilizes a mask masking modeling (maskedLanguage modeling) task to implicitly mine fine-grained relation so as to assist an image encoder and a text encoder to learn discriminative global features, thereby enhancing the retrieval performance of a text-to-image pedestrian retrieval system without additional supervision and reasoning cost.
3. And the knowledge of the general image-text large model CLIP is successfully transferred to the special text-to-image pedestrian re-identification data, so that the basic alignment capability of the image text is remarkably improved.
4. The performance of the proposed cross-modal implicit relationship inference alignment network (IRRA) on a plurality of public data sets is remarkably improved compared with the previous work, and the method is the most advanced text-to-image pedestrian re-identification method.
Drawings
FIG. 1 is a diagram of a cross-modal implicit relationship inference alignment network (IRRA) architecture in accordance with an embodiment of the present invention.
FIG. 2 is a block diagram of a text encoder according to an embodiment of the present invention;
FIG. 3 is a block diagram of an image encoder according to an embodiment of the present invention;
fig. 4 is a diagram of a visual text interactive encoder according to an embodiment of the present invention.
Detailed Description
In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.
The present invention proposes a cross-modal implicit relationship inference alignment network (IRRA) that enhances global image-text matching by learning and inferring relationships between local visual-text labels, and without additional supervision and inference costs.
The cross-modal implicit relationship inference alignment network (IRRA) of the present embodiment is comprised of an image encoder, a text encoder, and a cross-modal visual text interaction encoder.
Referring to fig. 1, the method for searching the text image cross-mode pedestrians with the aligned implicit relation reasoning provided by the invention comprises the following steps:
step 1: the image encoder and the text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed;
referring to fig. 2 and 3, the image encoder and the text encoder of the present embodiment each include a multi-head self-attention layer, a residual connection layer, and a feedforward full connection layer;
the multi-head self-attention layer of the embodiment transmits the query vector, the key vector and the value vector to a plurality of independent attention heads respectively; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual connection layer of the embodiment adds a shortcut connection to the output of the multi-head self-attention layer of the network and directly connects to the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer; the shortcut connection in the residual connection layer refers to: the output of the input after passing through one multi-head self-attention layer is added with the output of the input without passing through the multi-head self-attention layer to obtain the final output. This shortcut addition operation is the implementation of a shortcut connection. The connection mode can avoid the degradation phenomenon of the deep neural network, so that the network is trained better.
The feedforward full-connection layer of the embodiment takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by its weight value, adds the same, and then adds the offset value to the result, which is a single number; this number is then passed to an activation function which maps it to another range and generates the final output.
The present embodiment employs an image and text encoder of the CLIP model to initialize the model backbone network in order to enhance the image text alignment capabilities of the text-to-image pedestrian retrieval model base.
The image encoder inputs a given image, and uses CLIP pre-training vision Transformer (ViT) to obtain image features. "Alexey Dosovitskiy, lucas Beyer, alexander Kolesnikov, dirk Weissenborn, xiaohuazhai, thomas Unterthiner, mostafa Dehghani, matthias Minderer, georg Heigold, sylvain Gelly, et al image is worth16x16words: transformers for image recognition at scale.In International Conference on Learning Representations,2020.3" (one image value is 16x16words: transformer. Alexey Dosovitskiy et al for large scale image recognition, international study representative Congress, 2020.3).
An image encoder segments a given image input into a series of non-overlapping image blocks of fixed size. And secondly mapping the image block sequence onto the corresponding label by means of a trainable linear projection. The block sequence is input into an L-layer transducer block, and the correlation between each image block is modeled by its position features and additional markers. Finally, the image blocks are encoded as features having the same dimensions as the text features to achieve a representation of the global image.
The CLIP text encoder is used to extract features of the entered pedestrian text description and the text description after randomly masking the words. The encoder uses Byte Pair Encoding (BPE) to segment the input text, the original text and the randomly masked text share the same text feature encoder, the extracted original text description features are linearly projected into an image-text joint feature space, and [ EOS ]The features represented at the markers are represented as global text. For the randomly masked text, the extracted features at each marker are fused with the features at the image markers through a cross-modal cross-attention mechanism, and the fused multi-modal features are sent to an implicit relation reasoning module for learning a masking word prediction task and used for improving the alignment capability of the model mining cross-modal fine granularity features. (1) "Ashish Vaswani, noam Shazer, niki Parmar, jakob Uszkoreit, llion Jones, aidan N Gomez,
Figure BDA0004154098920000071
kaiser, and IlliaPolosukin. Attention is all you need advance in neural information processing systems,30,2017.3 "(attention mechanism is all you need. Ashish Vaswani et al, development of neuro information processing systems,30,2017.3), (2)" Alec Radford, jong Wook Kim, chrisHallacy, aditya Ramesh, gabriel Goh, sandhini Agarwal, girish Sary, amanda Askell, pamela Mishkin, jack Clark, et al, learning transferable visual models from natural language, supervision in International Conference on Machine Learning, pages8748-8763, PMLR,2021.2,3 "(transferable visual model from natural language super visual learning. Alec Radford et al, international machine learning conference, pages8748-8763, PMLR,2021.2,3).
Step 2: implicitly mining the fine granularity relation by using a cross-modal visual text interaction encoder through mask masking modeling so as to learn global features with discrimination, thereby carrying out fine granularity interaction;
please refer to fig. 4, the cross-modal visual text interactive encoder of the present embodiment includes a cross-attention mechanism layer, a multi-head self-attention layer, a residual connection layer and a feed-forward full connection layer;
the cross-attention mechanism layer of this embodiment splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; the query matrix will then be applied to the key-value matrix to get an attention matrix that will be used in the weighted summation of the feature matrices (the mask text feature matrix Q and the image feature matrix V-K input to the cross-attention mechanism) to obtain the final feature representation;
the multi-head self-attention layer of the embodiment transmits the query vector, the key vector and the value vector to a plurality of independent attention heads respectively; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual connection layer of the embodiment adds a shortcut connection to the output of the multi-head self-attention layer of the network and directly connects to the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer of the embodiment takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by its weight value, adds the same, and then adds the offset value to the result, which is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
the cross-mode visual text interactive encoder of the embodiment learns the global feature with discriminant by implicitly mining the fine granularity relationship through the mask masking modeling task;
the specific implementation comprises the following substeps:
step 2.1: the visual text interactive encoder consists of a multi-head cross attention layer and four layers of Transformer blocks;
Figure BDA0004154098920000081
Figure BDA0004154098920000082
wherein the method comprises the steps of
Figure BDA0004154098920000083
Representing a fused image and a masked text situational representation, LN (·) representing layer normalization, MCA (·) representing a multi-headed cross-attention mechanism; />
Figure BDA0004154098920000084
For a representation that merges the image and the mask text, m indicates that it is a feature representation at the masked text, N represents the total number of image-text representation pairs, and Tansformer () represents the input of the corresponding data into the transducer to obtain an output; d represents the characteristic dimension of the mask mark, +. >
Figure BDA0004154098920000085
To mask text features>
Figure BDA0004154098920000086
And->
Figure BDA0004154098920000087
Is an image feature;
step 2.2: for each of the screening positions
Figure BDA0004154098920000088
Predicting corresponding original markers using an MLP classifier
Figure BDA0004154098920000089
Probability of (2); />
Figure BDA00041540989200000810
Is vocabulary->
Figure BDA00041540989200000811
Is of a size of (2); m represents a set of occluded text;
in the MLP classifier in the embodiment, an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;
step 2.3: obtaining the objective function of IRR (Implicit Relation Reasoning and Aligning)
Figure BDA00041540989200000812
Figure BDA00041540989200000813
Wherein,,
Figure BDA00041540989200000814
representing a set of obscured text labels, m i Is a predictive marker probability distribution, y i Is a uniheat vector of a real tag, where the probability of a real tag is 1.
Step 3: based on the image-text similarity distribution matching SDM loss, combining the cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby realizing cross-mode matching.
The specific implementation of step 3 in this embodiment includes the following sub-steps:
step 3.1: global representation for each image
Figure BDA00041540989200000815
Defining the set of image-text representation pairs to be
Figure BDA00041540989200000816
Wherein N represents the total number of image-text representation pairs; y is i,j Is a true matching tag, y i,j =1 means
Figure BDA0004154098920000091
Is a matched pair from the same identity, and y i,j =0 represents a non-matched pair; let sim (u, v) =u T v/|u|v|represents ++>
Figure BDA0004154098920000092
Normalizing the dot product of u and v (i.e., cosine similarity);
the probability p of a matched pair is calculated using the following softmax function i,j
Figure BDA0004154098920000093
Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p i,j In small batches
Figure BDA0004154098920000094
And->
Figure BDA0004154098920000095
Cosine similarity between->
Figure BDA0004154098920000096
And->
Figure BDA0004154098920000097
A ratio of the sum of cosine similarities between them;
step 3.2: calculating an image-to-text SDM loss function in a small batch
Figure BDA0004154098920000098
Figure BDA0004154098920000099
Where ε is a minimum value to avoid the value overflow problem,
Figure BDA00041540989200000910
is the true match probability; p is p i ||q i The representation is from p i To q i KL divergence of (2);
step 3.3: calculating a bi-directional SDM loss function
Figure BDA00041540989200000911
Figure BDA00041540989200000912
Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction; calculating matching pair probability p in SDM loss function i,j There is a distinction in the direction of computation,
Figure BDA00041540989200000913
a loss function representing a match from an image to text, the calculation being as shown in equation (4); / >
Figure BDA00041540989200000914
A loss function representing text-to-image matching, in equation (4)>
Figure BDA00041540989200000915
Need to be changed to->
Figure BDA00041540989200000916
Step 3.4: alignment between image text similarity distribution and standardized label matching distribution is achieved by minimizing KL divergence, and cross-mode matching is achieved.
The cross-modal implicit relationship reasoning alignment network of the embodiment is a trained network;
the functions adopted in the training process are as follows:
Figure BDA00041540989200000917
wherein,,
Figure BDA00041540989200000918
representing the objective function of the IRR model, +.>
Figure BDA00041540989200000919
Representing a bi-directional SDM loss function, ">
Figure BDA00041540989200000920
Representing an ID loss function;
Figure BDA00041540989200000921
wherein the method comprises the steps of
Figure BDA00041540989200000922
And->
Figure BDA00041540989200000923
Representing the logits output by the image and text classification network for category i, respectively, and y representing the real label.
The present embodiment trains the IRRA framework using the ID loss function and the SDM loss function together with the IRR loss function. The ID loss function groups images or texts according to their corresponding identities, and explicitly considers intra-modal distances, so that feature representations of the same image-text group are closer in the joint feature space. "Zhedong Zheng, liang Zheng, michael Garrett, yi Yang, mingliang Xu, and Yi-Dong shen. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, communications, and Applications (TOMM), 16 (2): 1-23,2020.2,6,7" (there is an example missing dual-path convolution image text feature. Zhengdong et al, ACMTOMM,16 (2): 1-23,2020.2,6,7).
The CUHK-PEDES dataset was used in this example during training. The CUHK-PEDES dataset was the first dataset dedicated to text-to-image pedestrian retrieval, containing a total of 40206 images of 13003 people and 80412 text descriptions. The training set includes 11003 characters, 34054 images, and 68108 text descriptions; the validation set includes 1000 people, 3078 images, and 6158 text descriptions; the test set includes 1000 people, 3074 images, and 6156 text descriptions.
The method of the present application is further illustrated by specific experiments below.
In experiments, the hidden size of each layer of the visual text interactive encoder was set to 512 and the number of top layers was set to 8. The size of all representations in the image and text is set to 512. All input images are resized to 384×128. The maximum length of the text sequence is set to 77. Learning rate initialization of 1×10 -5 The cosine learning rate decreases. For a randomly initialized module, the initial learning rate is set to 5×10 -5 . The temperature parameter τ in the SDM loss is set to 0.02.
Experiments were performed on a single RTX3090 24GB GPU using pyrerch. During training, random horizontal flipping, random cropping with padding, and random erasure are used to enhance the image data. The model was trained with Adam optimizer [23] for 60 cycles. Initially, the learning rate was linearly increased from 1 x 10-6 to 1 x 10-5 with 5 warm-up cycles.
The general Rank-k is adopted as a main evaluation index, namely: the probability of at least one matching pedestrian image can be found in the top k candidate lists when a given text description is retrieved. In addition, for comprehensive evaluation, mean average precision (mAP) and mINP [51] were also used as two other search criteria. The higher Rank-k, mAP and mINP, the better the performance.
In order to verify the effectiveness of the present invention, the present invention is compared with the existing most advanced methods, which mainly include: (1) ISANet: shangalin Yan, hao Tang, liyan Zhang, and Jinhui tang.image-specific information suppression and implicit local alignment for text-based person search. ArXiv preprint arXiv:2208.14365,2022.3,6,7
(2)LBUL:Zijie Wang,Aichun Zhu,JingyiXue,Xili Wan,Chao Liu,Tian Wang,and Yifeng Li.Look before you leap:Improving text-based person retrieval by learning a consistent crossmodal common manifold.In Proceedings of the 30th ACM International Conference on Multimedia,pages 1984–1992,2022.6,7
(3)SAF:Shiping Li,Min Cao,and Min Zhang.Learning semanticaligned feature representation for text-based person search.In ICASSP 2022-2022IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),pages 2724–2728.IEEE,2022.6
(4)TIPCB:Yuhao Chen,Guoqing Zhang,Yujiang Lu,Zhenxing Wang,and Yuhui Zheng.Tipcb:A simple but effective part-based convolutional baseline for text-based person search.Neurocomputing,494:171–181,2022.2,3,6
(5)CAIBC:Zijie Wang,Aichun Zhu,JingyiXue,Xili Wan,Chao Liu,Tian Wang,and Yifeng Li.Caibc:Capturing all-round information beyond color for text-based person retrieval.arXiv preprint arXiv:2209.05773,2022.3,6
(6)AXM-Net:Ammarah Farooq,Muhammad Awais,Josef Kittler,and Syed Safwan Khalid.Axm-net:Implicit cross-modal feature alignment for person re-identification.36(4):4477–4485,2022.3,6
(7)LGUR:Zhiyin Shao,Xinyu Zhang,Meng Fang,Zhifeng Lin,Jian Wang,and Changxing Ding.Learning granularity-unified representations for text-to-image person re-identification.arXiv preprint arXiv:2207.07802,2022.2,3,6
(8)IVT:Xiujun Shu,Wei Wen,Haoqian Wu,Keyu Chen,Yiran Song,RuizhiQiao,Bo Ren,and Xiao Wang.See finer,see more:Implicit modality alignment for text-based person retrieval.arXiv preprintarXiv:2208.08608,2022.2,6,7
(9)CFine:Shuanglin Yan,Neng Dong,Liyan Zhang,and Jinhui Tang.Clip-driven fine-grained text-image person re-identification.arXiv preprint arXiv:2210.10276,2022.2,3,6,7
The results of the tests on the CUHK-PEDES dataset are shown in Table 1:
TABLE 1
Figure BDA0004154098920000111
Figure BDA0004154098920000121
As can be seen from table 1: the method has the advantages that all indexes are higher than those of the existing method, and the performance is obviously improved. There are two main reasons: 1. the implicit relation reasoning module used in the invention utilizes mask masking modeling to enable the model to learn the alignment relation of fine granularity information among image text modes, thereby realizing full cross-mode interaction. 2. The similarity distribution matching loss provided by the invention effectively expands the variance between non-matching pairs and the correlation between matching pairs, and realizes the alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby achieving the effect of cross-modal matching.
The innovation of the invention comprises:
1. an IRRA framework is presented that implicitly utilizes fine-grained interactions to enhance global alignment without additional supervision and reasoning costs.
2. A new cross-modality matching loss function, namely image-text Similarity Distribution Matching (SDM) loss, is designed to minimize the KL difference between the image text similarity distribution and the normalized label matching distribution.
3. An implicit relation reasoning module is designed, and mask modeling is utilized to enable the model to learn the alignment relation of fine granularity information between the image and the text mode.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims (8)

1. The cross-modal pedestrian retrieval method for the text images with the implicit relation reasoning alignment is characterized by comprising the following steps of:
step 1: the image encoder and the text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed;
The image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
Step 2: implicitly mining the fine granularity relation by using a cross-modal visual text interaction encoder through mask masking modeling so as to learn global features with discrimination, thereby carrying out fine granularity interaction;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
step 3: based on the image-text similarity distribution matching SDM loss, combining the cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby realizing cross-mode matching.
2. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval method of claim 1, wherein: step 2, the cross-modal visual text interactive encoder learns the global feature with discriminant by implicitly mining fine granularity relations through a mask masking modeling task;
The specific implementation comprises the following substeps:
step 2.1: the visual text interactive encoder consists of a multi-head cross attention layer and four layers of Transformer blocks;
Figure FDA00041540988800000213
Figure FDA0004154098880000021
wherein the method comprises the steps of
Figure FDA0004154098880000022
Representing a fused image and a masked text situational representation, LN (·) representing layer normalization, MCA (·) representing a multi-headed cross-attention mechanism; />
Figure FDA0004154098880000023
For a representation that merges the image and the occluded text, m indicates that it is a feature representation at the occluded text, N represents the total number of image-text representation pairs, tamsformer (·) represents the input of the corresponding data into the transducer to obtain an output; d represents the feature dimension of the mask tag, Q is the mask text feature, ++>
Figure FDA0004154098880000024
And->
Figure FDA0004154098880000025
Is an image feature;
step 2.2: for each of the screening positions
Figure FDA0004154098880000026
Predicting corresponding original markers using an MLP classifier
Figure FDA0004154098880000027
Probability of (2); />
Figure FDA0004154098880000028
Is vocabulary->
Figure FDA0004154098880000029
Is of a size of (2); m represents a set of occluded text;
the MLP classifier is characterized in that an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;
Step 2.3: obtaining an objective function of IRR
Figure FDA00041540988800000210
Figure FDA00041540988800000211
Wherein,,
Figure FDA00041540988800000212
representing a set of obscured text labels, m i Is a predictive marker probability distribution, y i Is a uniheat vector of a real tag, where the probability of a real tag is 1.
3. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval method of claim 1, wherein the specific implementation of step 3 includes the sub-steps of:
step 3.1: global representation f for each image i v Defining a set of image-text representation pairs to be
Figure FDA0004154098880000031
Wherein N represents the total number of image-text representation pairs; y is i,j Is a true matching tag, y i,j =1 means->
Figure FDA0004154098880000032
Is a matched pair from the same identity, and y i,j =0 represents a non-matched pair; is provided with->
Figure FDA00041540988800000316
Representation->
Figure FDA00041540988800000313
Normalizing the dot product of u and v;
the probability p of a matched pair is calculated using the following softmax function i,j
Figure FDA0004154098880000033
Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p i,j For f in small batches i v And
Figure FDA00041540988800000315
cosine similarity between f i v And->
Figure FDA0004154098880000034
A ratio of the sum of cosine similarities between them;
step 3.2: calculating an image-to-text SDM loss function in a small batch
Figure FDA00041540988800000314
Figure FDA0004154098880000035
Where ε is a minimum value to avoid the value overflow problem,
Figure FDA0004154098880000036
is the true match probability; p is p i ||q i The representation is from p i To q i KL divergence of (2);
step 3.3: calculating a bi-directional SDM loss function
Figure FDA0004154098880000037
Figure FDA0004154098880000038
Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction;
step 3.4: alignment between image text similarity distribution and standardized label matching distribution is achieved by minimizing KL divergence, and cross-mode matching is achieved.
4. A method for cross-modal pedestrian retrieval of text images aligned by implicit relationship reasoning according to any one of claims 1-3, characterized in that: the cross-modal visual text interaction encoder and the visual text interaction encoder form a cross-modal implicit relation reasoning alignment network, and the cross-modal implicit relation reasoning alignment network is a trained network;
the functions adopted in the training process are as follows:
Figure FDA0004154098880000039
wherein,,
Figure FDA00041540988800000310
representing the objective function of the IRR model, +.>
Figure FDA00041540988800000311
Representing a bi-directional SDM loss function, ">
Figure FDA00041540988800000312
Representing an ID loss function;
Figure FDA0004154098880000041
wherein the method comprises the steps of
Figure FDA0004154098880000042
And->
Figure FDA0004154098880000043
Representing the logits output by the image and text classification network for category i, respectively, and y representing the real label.
5. A text image cross-modality pedestrian retrieval system with implicit relationship reasoning alignment comprising the following modules:
the first module is used for converting the pedestrian image to be processed and the corresponding text description into feature vector representation through a self-attention and cross-attention mechanism by utilizing an image encoder and a text encoder respectively, aligning the full image features and the text features through an SDM loss function, and constructing the position relation of the two modes in a common feature space;
The image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
The second module is used for implicitly mining fine granularity relations through mask masking modeling by utilizing the cross-mode visual text interactive encoder so as to assist the image encoder and the text encoder to learn global features with discrimination, thereby enhancing the retrieval performance of a pedestrian retrieval system from text to image;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
and the third module is used for matching SDM loss based on image-text similarity distribution, combining cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.
6. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of claim 5 wherein: the cross-modal visual text interactive encoder in the second module learns the global features with discriminant by implicitly mining fine-grained relationships through a mask masking modeling task;
The specific implementation comprises the following sub-modules:
the module 2.1 is used for the visual text interactive encoder and consists of a multi-head cross attention layer and four layers of Transformer blocks;
Figure FDA0004154098880000051
Figure FDA0004154098880000052
wherein the method comprises the steps of
Figure FDA0004154098880000053
Representing a fused image and a masked text-rendered representation, LN (·) represents layer normalization, MCA (·) represents multipleHead cross attention mechanism; />
Figure FDA0004154098880000054
For a representation that merges the image and the mask text, m indicates that it is a feature representation at the masked text, N represents the total number of image-text representation pairs, and Tansformer () represents the input of the corresponding data into the transducer to obtain an output; d represents the feature dimension of the mask tag, Q is the mask text feature, ++>
Figure FDA0004154098880000055
And->
Figure FDA0004154098880000056
Is an image feature;
a module 2.2 for each screening position
Figure FDA0004154098880000057
Predicting corresponding original markers using an MLP classifier
Figure FDA0004154098880000058
Probability of (2); />
Figure FDA0004154098880000059
Is vocabulary->
Figure FDA00041540988800000510
Is of a size of (2); m represents a set of occluded text;
the MLP classifier is characterized in that an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;
Module 2.3 for deriving an objective function of IRR
Figure FDA00041540988800000511
Figure FDA0004154098880000061
Wherein,,
Figure FDA00041540988800000613
representing a set of obscured text labels, m i Is a predictive marker probability distribution, y i Is a uniheat vector of a real tag, where the probability of a real tag is 1.
7. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of claim 5 wherein the third module includes the following sub-modules:
a first sub-module for globally representing f for each image i v Defining a set of image-text representation pairs to be
Figure FDA0004154098880000062
Wherein N represents the total number of image-text representation pairs; y is i,j Is a true matching tag, y i,j =1 means
Figure FDA0004154098880000063
Is a matched pair from the same identity, and y i,j =0 represents a non-matched pair; is provided with->
Figure FDA00041540988800000614
Representation->
Figure FDA0004154098880000064
Normalizing the dot product of u and v;
the probability p of a matched pair is calculated using the following softmax function i,j
Figure FDA0004154098880000065
Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p i,j For f in small batches i v And
Figure FDA0004154098880000066
cosine similarity between f i v And->
Figure FDA0004154098880000067
A ratio of the sum of cosine similarities between them;
a second sub-module for calculating an SDM loss function from image to text in a small batch
Figure FDA0004154098880000068
Figure FDA0004154098880000069
Where ε is a minimum value to avoid the value overflow problem,
Figure FDA00041540988800000610
is the true match probability; p is p i ||q i The representation is from p i To q i KL divergence of (2);
a third sub-module for calculating a bi-directional SDM loss function
Figure FDA00041540988800000611
Figure FDA00041540988800000612
Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction;
and the fourth sub-module is used for realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.
8. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of any of claims 5-7, wherein: the cross-modal visual text interaction encoder and the visual text interaction encoder form a cross-modal implicit relation reasoning alignment network, and the cross-modal implicit relation reasoning alignment network is a trained network;
the functions adopted in the training process are as follows:
Figure FDA0004154098880000071
wherein,,
Figure FDA0004154098880000072
representing the objective function of the IRR model, +.>
Figure FDA0004154098880000073
Representing a bi-directional SDM loss function, ">
Figure FDA0004154098880000074
Representing an ID loss function;
Figure FDA0004154098880000075
wherein the method comprises the steps of
Figure FDA0004154098880000076
And->
Figure FDA0004154098880000077
Representing image and text classification network pair category i, respectivelyThe output logits, y represents the real tag.
CN202310328349.7A 2023-03-27 2023-03-27 Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment Active CN116383671B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310328349.7A CN116383671B (en) 2023-03-27 2023-03-27 Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310328349.7A CN116383671B (en) 2023-03-27 2023-03-27 Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment

Publications (2)

Publication Number Publication Date
CN116383671A true CN116383671A (en) 2023-07-04
CN116383671B CN116383671B (en) 2024-05-28

Family

ID=86980048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310328349.7A Active CN116383671B (en) 2023-03-27 2023-03-27 Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment

Country Status (1)

Country Link
CN (1) CN116383671B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152142A (en) * 2023-10-30 2023-12-01 菲特(天津)检测技术有限公司 Bearing defect detection model construction method and system
CN117312592A (en) * 2023-11-28 2023-12-29 云南联合视觉科技有限公司 Text-pedestrian image retrieval method based on modal invariant feature learning
CN117612071A (en) * 2024-01-23 2024-02-27 中国科学技术大学 Video action recognition method based on transfer learning
CN118038497A (en) * 2024-04-10 2024-05-14 四川大学 SAM-based text information driven pedestrian retrieval method and system
CN118114124A (en) * 2024-04-26 2024-05-31 武汉大学 Text-guided controllable portrait generation method, system and equipment based on diffusion model
CN118170938A (en) * 2024-05-12 2024-06-11 西北工业大学 Information guiding target searching method based on cross-modal self-evolution knowledge generalization
CN118170936A (en) * 2024-05-08 2024-06-11 齐鲁工业大学(山东省科学院) Multi-mode data and relation enhancement-based pedestrian shielding retrieval method
CN118411739A (en) * 2024-07-02 2024-07-30 江西财经大学 Visual language pedestrian re-recognition network method and system based on dynamic attention

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114663677A (en) * 2022-04-08 2022-06-24 杭州电子科技大学 Visual question answering method based on cross-modal pre-training feature enhancement
CN114926835A (en) * 2022-05-20 2022-08-19 京东科技控股股份有限公司 Text generation method and device, and model training method and device
CN115033670A (en) * 2022-06-02 2022-09-09 西安电子科技大学 Cross-modal image-text retrieval method with multi-granularity feature fusion
CN115292533A (en) * 2022-08-17 2022-11-04 苏州大学 Cross-modal pedestrian retrieval method driven by visual positioning
CN115311389A (en) * 2022-08-05 2022-11-08 西北大学 Multi-mode visual prompting technology representation learning method based on pre-training model
WO2022261570A1 (en) * 2021-08-04 2022-12-15 Innopeak Technology, Inc. Cross-attention system and method for fast video-text retrieval task with image clip
WO2023004206A1 (en) * 2021-08-04 2023-01-26 Innopeak Technology, Inc. Unsupervised hashing method for cross-modal video-text retrieval with clip

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022261570A1 (en) * 2021-08-04 2022-12-15 Innopeak Technology, Inc. Cross-attention system and method for fast video-text retrieval task with image clip
WO2023004206A1 (en) * 2021-08-04 2023-01-26 Innopeak Technology, Inc. Unsupervised hashing method for cross-modal video-text retrieval with clip
CN114663677A (en) * 2022-04-08 2022-06-24 杭州电子科技大学 Visual question answering method based on cross-modal pre-training feature enhancement
CN114926835A (en) * 2022-05-20 2022-08-19 京东科技控股股份有限公司 Text generation method and device, and model training method and device
CN115033670A (en) * 2022-06-02 2022-09-09 西安电子科技大学 Cross-modal image-text retrieval method with multi-granularity feature fusion
CN115311389A (en) * 2022-08-05 2022-11-08 西北大学 Multi-mode visual prompting technology representation learning method based on pre-training model
CN115292533A (en) * 2022-08-17 2022-11-04 苏州大学 Cross-modal pedestrian retrieval method driven by visual positioning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWANI 等: "Attention Is All You Need", ARXIV.ORG, 12 June 2017 (2017-06-12), pages 1 - 15 *
DING JIANG, MANG YE: "Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval", 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 22 August 2023 (2023-08-22), pages 2787 - 2797 *
JACOB DEVLIN 等: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", ARXIV.ORG, 24 May 2019 (2019-05-24), pages 1 - 16 *
赵晋巍 等: "基于CLIP 模型的军事领域图片资源多模态搜索工具研究", 中华医学图书情报杂志, vol. 31, no. 08, 31 August 2022 (2022-08-31), pages 14 - 20 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117152142A (en) * 2023-10-30 2023-12-01 菲特(天津)检测技术有限公司 Bearing defect detection model construction method and system
CN117152142B (en) * 2023-10-30 2024-02-02 菲特(天津)检测技术有限公司 Bearing defect detection model construction method and system
CN117312592A (en) * 2023-11-28 2023-12-29 云南联合视觉科技有限公司 Text-pedestrian image retrieval method based on modal invariant feature learning
CN117312592B (en) * 2023-11-28 2024-02-09 云南联合视觉科技有限公司 Text-pedestrian image retrieval method based on modal invariant feature learning
CN117612071A (en) * 2024-01-23 2024-02-27 中国科学技术大学 Video action recognition method based on transfer learning
CN117612071B (en) * 2024-01-23 2024-04-19 中国科学技术大学 Video action recognition method based on transfer learning
CN118038497A (en) * 2024-04-10 2024-05-14 四川大学 SAM-based text information driven pedestrian retrieval method and system
CN118114124A (en) * 2024-04-26 2024-05-31 武汉大学 Text-guided controllable portrait generation method, system and equipment based on diffusion model
CN118170936A (en) * 2024-05-08 2024-06-11 齐鲁工业大学(山东省科学院) Multi-mode data and relation enhancement-based pedestrian shielding retrieval method
CN118170938A (en) * 2024-05-12 2024-06-11 西北工业大学 Information guiding target searching method based on cross-modal self-evolution knowledge generalization
CN118170938B (en) * 2024-05-12 2024-08-23 西北工业大学 Information guiding target searching method based on cross-modal self-evolution knowledge generalization
CN118411739A (en) * 2024-07-02 2024-07-30 江西财经大学 Visual language pedestrian re-recognition network method and system based on dynamic attention

Also Published As

Publication number Publication date
CN116383671B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
CN116383671B (en) Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment
Jiang et al. Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval
Zhang et al. HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization
Zeng et al. Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa
Suo et al. A simple and robust correlation filtering method for text-based person search
Harizi et al. Convolutional neural network with joint stepwise character/word modeling based system for scene text recognition
CN113033438A (en) Data feature learning method for modal imperfect alignment
Liu et al. Facial attractiveness computation by label distribution learning with deep CNN and geometric features
CN114398681A (en) Method and device for training privacy information classification model and method and device for identifying privacy information
Patel et al. Abstractive information extraction from scanned invoices (AIESI) using end-to-end sequential approach
He et al. Cross-modal retrieval by real label partial least squares
Lu et al. Domain-aware se network for sketch-based image retrieval with multiplicative euclidean margin softmax
Sharma et al. Multilevel attention and relation network based image captioning model
Fu et al. Look back again: Dual parallel attention network for accurate and robust scene text recognition
Luo et al. An efficient multi-scale channel attention network for person re-identification
CN116578734B (en) Probability embedding combination retrieval method based on CLIP
Yang et al. Facial expression recognition based on multi-dataset neural network
Zhang et al. Transformer-based global–local feature learning model for occluded person re-identification
Hasnat et al. Robust license plate signatures matching based on multi-task learning approach
Lin et al. A deep learning based bank card detection and recognition method in complex scenes
Hao et al. A lightweight attention-based network for micro-expression recognition
Sharma et al. A framework for image captioning based on relation network and multilevel attention mechanism
Yang et al. Robust feature mining transformer for occluded person re-identification
Zhang et al. Multiplicative angular margin loss for text-based person search
Wu et al. Naster: non-local attentional scene text recognizer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant