CN116383671A - Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment - Google Patents
Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment Download PDFInfo
- Publication number
- CN116383671A CN116383671A CN202310328349.7A CN202310328349A CN116383671A CN 116383671 A CN116383671 A CN 116383671A CN 202310328349 A CN202310328349 A CN 202310328349A CN 116383671 A CN116383671 A CN 116383671A
- Authority
- CN
- China
- Prior art keywords
- text
- image
- attention
- layer
- cross
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000006870 function Effects 0.000 claims abstract description 63
- 230000000007 visual effect Effects 0.000 claims abstract description 29
- 230000007246 mechanism Effects 0.000 claims abstract description 27
- 230000003993 interaction Effects 0.000 claims abstract description 15
- 230000000873 masking effect Effects 0.000 claims abstract description 13
- 238000005065 mining Methods 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 49
- 230000002452 interceptive effect Effects 0.000 claims description 15
- 210000002569 neuron Anatomy 0.000 claims description 14
- 230000009466 transformation Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 10
- 230000009467 reduction Effects 0.000 claims description 10
- 238000012549 training Methods 0.000 claims description 8
- 239000003550 marker Substances 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000013016 learning Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 239000009438 liyan Substances 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- TVYLLZQTGLZFBW-ZBFHGGJFSA-N (R,R)-tramadol Chemical compound COC1=CC=CC([C@]2(O)[C@H](CCCC2)CN(C)C)=C1 TVYLLZQTGLZFBW-ZBFHGGJFSA-N 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 229940119265 sepp Drugs 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000031836 visual learning Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0499—Feedforward networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a text image cross-mode pedestrian retrieval method and a system with implicit relation reasoning alignment, which are characterized in that firstly, an image encoder and a text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed; then, utilizing a cross-mode visual text interaction encoder, implicitly mining fine granularity relations through mask masking modeling to learn global features with discriminant so as to perform fine granularity interaction; and finally, based on the image-text similarity distribution matching SDM loss, optimizing cosine similarity distribution of N image-text pair characteristics by using KL divergence, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing the KL divergence so as to realize cross-mode matching. The invention has high recognition efficiency from text to image pedestrian.
Description
Technical Field
The invention belongs to the technical field of cross-mode pedestrian re-recognition, relates to a text image cross-mode pedestrian retrieval method and system, and in particular relates to a text image cross-mode pedestrian retrieval method and system based on implicit relation reasoning alignment.
Background
In recent years, the task of searching pedestrians from text to image is attracting more and more attention, and the task is widely applied to the fields of public security and protection in the scene where a target image cannot be obtained. The text-to-image pedestrian retrieval aims at retrieving a target person which is most matched with the description content of the given text from a large-scale image database, and is a comprehensive task integrating image-text retrieval and pedestrian re-identification. The core problem of this task is how to map two different modality data of text and images to a common potential feature space.
Text-to-image pedestrian retrieval tasks are extremely challenging due to the differences in internal features and modal heterogeneity between the two modalities of vision and language. The visual characteristics of the target pedestrian may be affected by a variety of factors, such as pose, viewing angle, illumination, etc., while the textual description may also be affected by its order of description and ambiguity. The problem of cross-modal feature alignment caused by modal differences between vision and language is the core research of the task. Thus, researchers need to explore better methods to obtain more discriminant feature representations and design better cross-modality matching methods to align images and text into a joint feature space. This is one of the research hotspots for text-to-image pedestrian retrieval tasks.
Early text-to-image pedestrian retrieval efforts utilized VGG and LSTM to learn representations of visual and text modalities and align images and text into a joint feature space by designing cross-modality matching loss functions. "Sepp Hochreiter and Jurgen Schmidhuber long short-term computer 9 (8): 1735-1780,1997.3" (Long term memory, sepp Hochreiter and Jurgen Schmidhuber, nerve computation 9 (8): 1735-1780,1997.3)
Some of the latter work improved the feature extraction backbone network using ResNet50/101 and BERT and designed a new cross-modal projection matching penalty for aligning global image-text features to joint feature space. (1) "Yucheng Chen, rui Huang, hong Chang, chuanqi Tan, tao Xue, and Bingpen Ma. Cross-modal knowledge adaptation for language-based person search.IEEE Transactions on Image Processing,30:4057-4069,2021.3" (Cross-modal knowledge adaptation for language-based people search. Chen Yucheng et al, IEEE Transactions on Image Processing,30:4057-4069, 2021.3), (2) "Nikolaos Sarafianos, xiang Xu, and Ioanis AKakadariaris. Adversaril representation learning for text-to-image matching. In Proceedings of the IEEE/CVF international conference on computer vision, pages5814-5824,2019.3,6" (for text-to-image matching resistance representation learning. Nikolaos Sarafianos et al, IEEE/CVF computer Vietnam International conference discussion, pages5814-5824,2019.3,6), (3) "Ying Zhang and Huchuan Lu. Deep cross-modal projection learning for image-text ng. In Proceedings of the European conference on Computer Vision (CV), pages 686-686 (EC35) for use in computer Vietnam conference (EC35 ).
Recent research work has widely utilized additional local feature learning branches, and some work has explicitly used external tools such as body segmentation, body part information, color information, and text phrase segmentation. In addition, some work also uses a focus mechanism to perform local feature learning, and although this local matching strategy improves the retrieval performance, unavoidable noise is introduced at the same time, and uncertainty in the retrieval process is increased. The limitation of these efforts is that the recently popular visual language pre-training model is not utilized and therefore lacks powerful cross-modal alignment capabilities.
Some work has recently emerged to apply CLIP to text-to-image pedestrian retrieval that enables knowledge transfer from CLIP through the use of a momentum contrast learning framework or fine-grained information mining framework. (1) "Xiao Han, sen He, li Zhang, and Tao Xiang.textbased person search with limited data.arXiv preprint arXiv:2110.10807,2021.2,3,6" (limited data text-based people search. Shore Cold, etc., arXiv preprint arXiv:2110.108072021.2,3,6), (2) "Shangalin Yan, nengDong, liyan Zhang, and Jinhui Tang.CLIP-drive fine-graded text-image person re-identification.arXiv preprint arXiv:2210.10276,2022.2,3,6,7" (CLIP-driven fine-grained text-image pedestrian re-identification. Strict duplex, etc., arXiv preprint arXiv:2210.102762022.2,3,6,7).
However, these methods use only a single image encoder of CLIP, and fail to successfully migrate the complete CLIP image text encoder knowledge to the text-to-image pedestrian retrieval dataset, and thus fail to achieve optimal performance.
Disclosure of Invention
Aiming at the problems of lack of corresponding relation between multi-modal data of visual-text characteristics, intra-modal information distortion caused by explicit local matching and the like in the prior art, the invention provides a text image cross-modal pedestrian retrieval method and system based on implicit relation reasoning alignment.
The technical scheme adopted by the method is as follows: a cross-modal pedestrian retrieval method for text images with implicit relation reasoning alignment comprises the following steps:
step 1: the image encoder and the text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed;
the image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
The multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
Step 2: implicitly mining the fine granularity relation by using a cross-modal visual text interaction encoder through mask masking modeling so as to learn global features with discrimination, thereby carrying out fine granularity interaction;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
step 3: based on the image-text similarity distribution matching SDM loss, combining the cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby realizing cross-mode matching.
The system of the invention adopts the technical proposal that: a text image cross-modality pedestrian retrieval system with implicit relationship reasoning alignment, comprising the following modules:
the first module is used for converting the pedestrian image to be processed and the corresponding text description into feature vector representation through a self-attention and cross-attention mechanism by utilizing an image encoder and a text encoder respectively, aligning the full image features and the text features through an SDM loss function, and constructing the position relation of the two modes in a common feature space;
The image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
The second module is used for implicitly mining fine granularity relations through mask masking modeling by utilizing the cross-mode visual text interactive encoder so as to assist the image encoder and the text encoder to learn global features with discrimination, thereby enhancing the retrieval performance of a pedestrian retrieval system from text to image;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
and the third module is used for matching SDM loss based on image-text similarity distribution, combining cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.
The invention has the advantages that:
1. a new cross-modal matching loss function is designed, and the image-text alignment capability can be remarkably improved.
2. The designed implicit relation reasoning module utilizes a mask masking modeling (maskedLanguage modeling) task to implicitly mine fine-grained relation so as to assist an image encoder and a text encoder to learn discriminative global features, thereby enhancing the retrieval performance of a text-to-image pedestrian retrieval system without additional supervision and reasoning cost.
3. And the knowledge of the general image-text large model CLIP is successfully transferred to the special text-to-image pedestrian re-identification data, so that the basic alignment capability of the image text is remarkably improved.
4. The performance of the proposed cross-modal implicit relationship inference alignment network (IRRA) on a plurality of public data sets is remarkably improved compared with the previous work, and the method is the most advanced text-to-image pedestrian re-identification method.
Drawings
FIG. 1 is a diagram of a cross-modal implicit relationship inference alignment network (IRRA) architecture in accordance with an embodiment of the present invention.
FIG. 2 is a block diagram of a text encoder according to an embodiment of the present invention;
FIG. 3 is a block diagram of an image encoder according to an embodiment of the present invention;
fig. 4 is a diagram of a visual text interactive encoder according to an embodiment of the present invention.
Detailed Description
In order to facilitate the understanding and practice of the invention, those of ordinary skill in the art will now make further details with reference to the drawings and examples, it being understood that the examples described herein are for the purpose of illustration and explanation only and are not intended to limit the invention thereto.
The present invention proposes a cross-modal implicit relationship inference alignment network (IRRA) that enhances global image-text matching by learning and inferring relationships between local visual-text labels, and without additional supervision and inference costs.
The cross-modal implicit relationship inference alignment network (IRRA) of the present embodiment is comprised of an image encoder, a text encoder, and a cross-modal visual text interaction encoder.
Referring to fig. 1, the method for searching the text image cross-mode pedestrians with the aligned implicit relation reasoning provided by the invention comprises the following steps:
step 1: the image encoder and the text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed;
referring to fig. 2 and 3, the image encoder and the text encoder of the present embodiment each include a multi-head self-attention layer, a residual connection layer, and a feedforward full connection layer;
the multi-head self-attention layer of the embodiment transmits the query vector, the key vector and the value vector to a plurality of independent attention heads respectively; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual connection layer of the embodiment adds a shortcut connection to the output of the multi-head self-attention layer of the network and directly connects to the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer; the shortcut connection in the residual connection layer refers to: the output of the input after passing through one multi-head self-attention layer is added with the output of the input without passing through the multi-head self-attention layer to obtain the final output. This shortcut addition operation is the implementation of a shortcut connection. The connection mode can avoid the degradation phenomenon of the deep neural network, so that the network is trained better.
The feedforward full-connection layer of the embodiment takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by its weight value, adds the same, and then adds the offset value to the result, which is a single number; this number is then passed to an activation function which maps it to another range and generates the final output.
The present embodiment employs an image and text encoder of the CLIP model to initialize the model backbone network in order to enhance the image text alignment capabilities of the text-to-image pedestrian retrieval model base.
The image encoder inputs a given image, and uses CLIP pre-training vision Transformer (ViT) to obtain image features. "Alexey Dosovitskiy, lucas Beyer, alexander Kolesnikov, dirk Weissenborn, xiaohuazhai, thomas Unterthiner, mostafa Dehghani, matthias Minderer, georg Heigold, sylvain Gelly, et al image is worth16x16words: transformers for image recognition at scale.In International Conference on Learning Representations,2020.3" (one image value is 16x16words: transformer. Alexey Dosovitskiy et al for large scale image recognition, international study representative Congress, 2020.3).
An image encoder segments a given image input into a series of non-overlapping image blocks of fixed size. And secondly mapping the image block sequence onto the corresponding label by means of a trainable linear projection. The block sequence is input into an L-layer transducer block, and the correlation between each image block is modeled by its position features and additional markers. Finally, the image blocks are encoded as features having the same dimensions as the text features to achieve a representation of the global image.
The CLIP text encoder is used to extract features of the entered pedestrian text description and the text description after randomly masking the words. The encoder uses Byte Pair Encoding (BPE) to segment the input text, the original text and the randomly masked text share the same text feature encoder, the extracted original text description features are linearly projected into an image-text joint feature space, and [ EOS ]The features represented at the markers are represented as global text. For the randomly masked text, the extracted features at each marker are fused with the features at the image markers through a cross-modal cross-attention mechanism, and the fused multi-modal features are sent to an implicit relation reasoning module for learning a masking word prediction task and used for improving the alignment capability of the model mining cross-modal fine granularity features. (1) "Ashish Vaswani, noam Shazer, niki Parmar, jakob Uszkoreit, llion Jones, aidan N Gomez,kaiser, and IlliaPolosukin. Attention is all you need advance in neural information processing systems,30,2017.3 "(attention mechanism is all you need. Ashish Vaswani et al, development of neuro information processing systems,30,2017.3), (2)" Alec Radford, jong Wook Kim, chrisHallacy, aditya Ramesh, gabriel Goh, sandhini Agarwal, girish Sary, amanda Askell, pamela Mishkin, jack Clark, et al, learning transferable visual models from natural language, supervision in International Conference on Machine Learning, pages8748-8763, PMLR,2021.2,3 "(transferable visual model from natural language super visual learning. Alec Radford et al, international machine learning conference, pages8748-8763, PMLR,2021.2,3).
Step 2: implicitly mining the fine granularity relation by using a cross-modal visual text interaction encoder through mask masking modeling so as to learn global features with discrimination, thereby carrying out fine granularity interaction;
please refer to fig. 4, the cross-modal visual text interactive encoder of the present embodiment includes a cross-attention mechanism layer, a multi-head self-attention layer, a residual connection layer and a feed-forward full connection layer;
the cross-attention mechanism layer of this embodiment splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; the query matrix will then be applied to the key-value matrix to get an attention matrix that will be used in the weighted summation of the feature matrices (the mask text feature matrix Q and the image feature matrix V-K input to the cross-attention mechanism) to obtain the final feature representation;
the multi-head self-attention layer of the embodiment transmits the query vector, the key vector and the value vector to a plurality of independent attention heads respectively; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual connection layer of the embodiment adds a shortcut connection to the output of the multi-head self-attention layer of the network and directly connects to the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer of the embodiment takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by its weight value, adds the same, and then adds the offset value to the result, which is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
the cross-mode visual text interactive encoder of the embodiment learns the global feature with discriminant by implicitly mining the fine granularity relationship through the mask masking modeling task;
the specific implementation comprises the following substeps:
step 2.1: the visual text interactive encoder consists of a multi-head cross attention layer and four layers of Transformer blocks;
wherein the method comprises the steps ofRepresenting a fused image and a masked text situational representation, LN (·) representing layer normalization, MCA (·) representing a multi-headed cross-attention mechanism; />For a representation that merges the image and the mask text, m indicates that it is a feature representation at the masked text, N represents the total number of image-text representation pairs, and Tansformer () represents the input of the corresponding data into the transducer to obtain an output; d represents the characteristic dimension of the mask mark, +. >To mask text features>And->Is an image feature;
step 2.2: for each of the screening positionsPredicting corresponding original markers using an MLP classifierProbability of (2); />Is vocabulary->Is of a size of (2); m represents a set of occluded text;
in the MLP classifier in the embodiment, an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;
Wherein,,representing a set of obscured text labels, m i Is a predictive marker probability distribution, y i Is a uniheat vector of a real tag, where the probability of a real tag is 1.
Step 3: based on the image-text similarity distribution matching SDM loss, combining the cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby realizing cross-mode matching.
The specific implementation of step 3 in this embodiment includes the following sub-steps:
step 3.1: global representation for each imageDefining the set of image-text representation pairs to beWherein N represents the total number of image-text representation pairs; y is i,j Is a true matching tag, y i,j =1 meansIs a matched pair from the same identity, and y i,j =0 represents a non-matched pair; let sim (u, v) =u T v/|u|v|represents ++>Normalizing the dot product of u and v (i.e., cosine similarity);
the probability p of a matched pair is calculated using the following softmax function i,j ;
Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p i,j In small batchesAnd->Cosine similarity between->And->A ratio of the sum of cosine similarities between them;
Where ε is a minimum value to avoid the value overflow problem,is the true match probability; p is p i ||q i The representation is from p i To q i KL divergence of (2);
Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction; calculating matching pair probability p in SDM loss function i,j There is a distinction in the direction of computation,a loss function representing a match from an image to text, the calculation being as shown in equation (4); / >A loss function representing text-to-image matching, in equation (4)>Need to be changed to->
Step 3.4: alignment between image text similarity distribution and standardized label matching distribution is achieved by minimizing KL divergence, and cross-mode matching is achieved.
The cross-modal implicit relationship reasoning alignment network of the embodiment is a trained network;
the functions adopted in the training process are as follows:
wherein,,representing the objective function of the IRR model, +.>Representing a bi-directional SDM loss function, ">Representing an ID loss function;
wherein the method comprises the steps ofAnd->Representing the logits output by the image and text classification network for category i, respectively, and y representing the real label.
The present embodiment trains the IRRA framework using the ID loss function and the SDM loss function together with the IRR loss function. The ID loss function groups images or texts according to their corresponding identities, and explicitly considers intra-modal distances, so that feature representations of the same image-text group are closer in the joint feature space. "Zhedong Zheng, liang Zheng, michael Garrett, yi Yang, mingliang Xu, and Yi-Dong shen. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, communications, and Applications (TOMM), 16 (2): 1-23,2020.2,6,7" (there is an example missing dual-path convolution image text feature. Zhengdong et al, ACMTOMM,16 (2): 1-23,2020.2,6,7).
The CUHK-PEDES dataset was used in this example during training. The CUHK-PEDES dataset was the first dataset dedicated to text-to-image pedestrian retrieval, containing a total of 40206 images of 13003 people and 80412 text descriptions. The training set includes 11003 characters, 34054 images, and 68108 text descriptions; the validation set includes 1000 people, 3078 images, and 6158 text descriptions; the test set includes 1000 people, 3074 images, and 6156 text descriptions.
The method of the present application is further illustrated by specific experiments below.
In experiments, the hidden size of each layer of the visual text interactive encoder was set to 512 and the number of top layers was set to 8. The size of all representations in the image and text is set to 512. All input images are resized to 384×128. The maximum length of the text sequence is set to 77. Learning rate initialization of 1×10 -5 The cosine learning rate decreases. For a randomly initialized module, the initial learning rate is set to 5×10 -5 . The temperature parameter τ in the SDM loss is set to 0.02.
Experiments were performed on a single RTX3090 24GB GPU using pyrerch. During training, random horizontal flipping, random cropping with padding, and random erasure are used to enhance the image data. The model was trained with Adam optimizer [23] for 60 cycles. Initially, the learning rate was linearly increased from 1 x 10-6 to 1 x 10-5 with 5 warm-up cycles.
The general Rank-k is adopted as a main evaluation index, namely: the probability of at least one matching pedestrian image can be found in the top k candidate lists when a given text description is retrieved. In addition, for comprehensive evaluation, mean average precision (mAP) and mINP [51] were also used as two other search criteria. The higher Rank-k, mAP and mINP, the better the performance.
In order to verify the effectiveness of the present invention, the present invention is compared with the existing most advanced methods, which mainly include: (1) ISANet: shangalin Yan, hao Tang, liyan Zhang, and Jinhui tang.image-specific information suppression and implicit local alignment for text-based person search. ArXiv preprint arXiv:2208.14365,2022.3,6,7
(2)LBUL:Zijie Wang,Aichun Zhu,JingyiXue,Xili Wan,Chao Liu,Tian Wang,and Yifeng Li.Look before you leap:Improving text-based person retrieval by learning a consistent crossmodal common manifold.In Proceedings of the 30th ACM International Conference on Multimedia,pages 1984–1992,2022.6,7
(3)SAF:Shiping Li,Min Cao,and Min Zhang.Learning semanticaligned feature representation for text-based person search.In ICASSP 2022-2022IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP),pages 2724–2728.IEEE,2022.6
(4)TIPCB:Yuhao Chen,Guoqing Zhang,Yujiang Lu,Zhenxing Wang,and Yuhui Zheng.Tipcb:A simple but effective part-based convolutional baseline for text-based person search.Neurocomputing,494:171–181,2022.2,3,6
(5)CAIBC:Zijie Wang,Aichun Zhu,JingyiXue,Xili Wan,Chao Liu,Tian Wang,and Yifeng Li.Caibc:Capturing all-round information beyond color for text-based person retrieval.arXiv preprint arXiv:2209.05773,2022.3,6
(6)AXM-Net:Ammarah Farooq,Muhammad Awais,Josef Kittler,and Syed Safwan Khalid.Axm-net:Implicit cross-modal feature alignment for person re-identification.36(4):4477–4485,2022.3,6
(7)LGUR:Zhiyin Shao,Xinyu Zhang,Meng Fang,Zhifeng Lin,Jian Wang,and Changxing Ding.Learning granularity-unified representations for text-to-image person re-identification.arXiv preprint arXiv:2207.07802,2022.2,3,6
(8)IVT:Xiujun Shu,Wei Wen,Haoqian Wu,Keyu Chen,Yiran Song,RuizhiQiao,Bo Ren,and Xiao Wang.See finer,see more:Implicit modality alignment for text-based person retrieval.arXiv preprintarXiv:2208.08608,2022.2,6,7
(9)CFine:Shuanglin Yan,Neng Dong,Liyan Zhang,and Jinhui Tang.Clip-driven fine-grained text-image person re-identification.arXiv preprint arXiv:2210.10276,2022.2,3,6,7
The results of the tests on the CUHK-PEDES dataset are shown in Table 1:
TABLE 1
As can be seen from table 1: the method has the advantages that all indexes are higher than those of the existing method, and the performance is obviously improved. There are two main reasons: 1. the implicit relation reasoning module used in the invention utilizes mask masking modeling to enable the model to learn the alignment relation of fine granularity information among image text modes, thereby realizing full cross-mode interaction. 2. The similarity distribution matching loss provided by the invention effectively expands the variance between non-matching pairs and the correlation between matching pairs, and realizes the alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby achieving the effect of cross-modal matching.
The innovation of the invention comprises:
1. an IRRA framework is presented that implicitly utilizes fine-grained interactions to enhance global alignment without additional supervision and reasoning costs.
2. A new cross-modality matching loss function, namely image-text Similarity Distribution Matching (SDM) loss, is designed to minimize the KL difference between the image text similarity distribution and the normalized label matching distribution.
3. An implicit relation reasoning module is designed, and mask modeling is utilized to enable the model to learn the alignment relation of fine granularity information between the image and the text mode.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.
Claims (8)
1. The cross-modal pedestrian retrieval method for the text images with the implicit relation reasoning alignment is characterized by comprising the following steps of:
step 1: the image encoder and the text encoder are respectively utilized, pedestrian images to be processed and corresponding text descriptions are converted into feature vector representations through self-attention and cross-attention mechanisms, the full-local image features and the text features are aligned through SDM loss functions, and the position relation of the two modes in a common feature space is constructed;
The image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
Step 2: implicitly mining the fine granularity relation by using a cross-modal visual text interaction encoder through mask masking modeling so as to learn global features with discrimination, thereby carrying out fine granularity interaction;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
step 3: based on the image-text similarity distribution matching SDM loss, combining the cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence, thereby realizing cross-mode matching.
2. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval method of claim 1, wherein: step 2, the cross-modal visual text interactive encoder learns the global feature with discriminant by implicitly mining fine granularity relations through a mask masking modeling task;
The specific implementation comprises the following substeps:
step 2.1: the visual text interactive encoder consists of a multi-head cross attention layer and four layers of Transformer blocks;
wherein the method comprises the steps ofRepresenting a fused image and a masked text situational representation, LN (·) representing layer normalization, MCA (·) representing a multi-headed cross-attention mechanism; />For a representation that merges the image and the occluded text, m indicates that it is a feature representation at the occluded text, N represents the total number of image-text representation pairs, tamsformer (·) represents the input of the corresponding data into the transducer to obtain an output; d represents the feature dimension of the mask tag, Q is the mask text feature, ++>And->Is an image feature;
step 2.2: for each of the screening positionsPredicting corresponding original markers using an MLP classifierProbability of (2); />Is vocabulary->Is of a size of (2); m represents a set of occluded text;
the MLP classifier is characterized in that an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;
3. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval method of claim 1, wherein the specific implementation of step 3 includes the sub-steps of:
step 3.1: global representation f for each image i v Defining a set of image-text representation pairs to beWherein N represents the total number of image-text representation pairs; y is i,j Is a true matching tag, y i,j =1 means->Is a matched pair from the same identity, and y i,j =0 represents a non-matched pair; is provided with->Representation->Normalizing the dot product of u and v;
the probability p of a matched pair is calculated using the following softmax function i,j ;
Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p i,j For f in small batches i v Andcosine similarity between f i v And->A ratio of the sum of cosine similarities between them;
Where ε is a minimum value to avoid the value overflow problem,is the true match probability; p is p i ||q i The representation is from p i To q i KL divergence of (2);
Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction;
step 3.4: alignment between image text similarity distribution and standardized label matching distribution is achieved by minimizing KL divergence, and cross-mode matching is achieved.
4. A method for cross-modal pedestrian retrieval of text images aligned by implicit relationship reasoning according to any one of claims 1-3, characterized in that: the cross-modal visual text interaction encoder and the visual text interaction encoder form a cross-modal implicit relation reasoning alignment network, and the cross-modal implicit relation reasoning alignment network is a trained network;
the functions adopted in the training process are as follows:
wherein,,representing the objective function of the IRR model, +.>Representing a bi-directional SDM loss function, ">Representing an ID loss function;
5. A text image cross-modality pedestrian retrieval system with implicit relationship reasoning alignment comprising the following modules:
the first module is used for converting the pedestrian image to be processed and the corresponding text description into feature vector representation through a self-attention and cross-attention mechanism by utilizing an image encoder and a text encoder respectively, aligning the full image features and the text features through an SDM loss function, and constructing the position relation of the two modes in a common feature space;
The image encoder and the text encoder comprise a multi-head self-attention layer, a residual error connecting layer and a feedforward full connecting layer;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
the residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, randomly distributes weight and bias to each neuron, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the bias value to the result, so that the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
The second module is used for implicitly mining fine granularity relations through mask masking modeling by utilizing the cross-mode visual text interactive encoder so as to assist the image encoder and the text encoder to learn global features with discrimination, thereby enhancing the retrieval performance of a pedestrian retrieval system from text to image;
the cross-modal visual text interactive encoder comprises a cross-attention mechanism layer, a multi-head self-attention layer, a residual error connection layer and a feedforward full connection layer;
the cross-attention mechanism layer splits the input vector into two parts: one for generating a query matrix and the other for generating a key-value matrix; the query matrix is intended to learn each spatial location representation, while the key-value matrix is used to learn the correlation between different locations; then the query matrix is applied to the key value matrix to obtain an attention matrix, and the attention matrix is used for being input into the weighted summation of the shielding text characteristic matrix Q and the image characteristic matrix V-K of the cross attention mechanism, so that the final characteristic representation is obtained;
the multi-head self-attention layer respectively transmits the query vector, the key vector and the value vector to a plurality of independent attention heads; in each attention head, scaling the dot product of the query vector and the key vector by the square root of the model feature dimension, normalizing the scores through a softmax function to obtain weights, and weighting and summing each value vector by using the weights to obtain the output of each attention head; splicing the outputs of a plurality of attention heads together, and performing dimension reduction through linear transformation to finally obtain the output of a multi-head self-attention layer;
The residual error connection layer is used for adding a shortcut connection to the output of the multi-head self-attention layer of the network and directly connecting the output of the layer; adding the output connected with the shortcut and the output of the layer to obtain the final output of the layer;
the feedforward full-connection layer takes the output of the multi-head self-attention layer as input, multiplies the input of each neuron by the weight value thereof, adds the weight values, and then adds the offset value to the result, and the result is a single number; this number is then passed to an activation function which maps it to another range and generates the final output;
and the third module is used for matching SDM loss based on image-text similarity distribution, combining cosine similarity distribution of N image-text pair characteristics into KL difference, and realizing alignment between the image text similarity distribution and standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.
6. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of claim 5 wherein: the cross-modal visual text interactive encoder in the second module learns the global features with discriminant by implicitly mining fine-grained relationships through a mask masking modeling task;
The specific implementation comprises the following sub-modules:
the module 2.1 is used for the visual text interactive encoder and consists of a multi-head cross attention layer and four layers of Transformer blocks;
wherein the method comprises the steps ofRepresenting a fused image and a masked text-rendered representation, LN (·) represents layer normalization, MCA (·) represents multipleHead cross attention mechanism; />For a representation that merges the image and the mask text, m indicates that it is a feature representation at the masked text, N represents the total number of image-text representation pairs, and Tansformer () represents the input of the corresponding data into the transducer to obtain an output; d represents the feature dimension of the mask tag, Q is the mask text feature, ++>And->Is an image feature;
a module 2.2 for each screening positionPredicting corresponding original markers using an MLP classifierProbability of (2); />Is vocabulary->Is of a size of (2); m represents a set of occluded text;
the MLP classifier is characterized in that an input vector passes through a plurality of full-connection layers, nonlinear transformation and a Dropout layer are added between the full-connection layers to perform regularization so as to prevent overfitting, a softmax function is added after the last full-connection layer, and the output of a network is converted into probability distribution, so that classified prediction of a shielded text word is performed;
7. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of claim 5 wherein the third module includes the following sub-modules:
a first sub-module for globally representing f for each image i v Defining a set of image-text representation pairs to beWherein N represents the total number of image-text representation pairs; y is i,j Is a true matching tag, y i,j =1 meansIs a matched pair from the same identity, and y i,j =0 represents a non-matched pair; is provided with->Representation->Normalizing the dot product of u and v;
the probability p of a matched pair is calculated using the following softmax function i,j ;
Wherein τ is a temperature superparameter controlling the peak of probability distribution, matching probability p i,j For f in small batches i v Andcosine similarity between f i v And->A ratio of the sum of cosine similarities between them;
Where ε is a minimum value to avoid the value overflow problem,is the true match probability; p is p i ||q i The representation is from p i To q i KL divergence of (2);
Where i2t represents a match from the image to the text direction, and t2i represents a match from the text to the image direction;
and the fourth sub-module is used for realizing alignment between the image text similarity distribution and the standardized label matching distribution by minimizing KL divergence so as to realize cross-mode matching.
8. The implicit relationship reasoning aligned text image cross-modality pedestrian retrieval system of any of claims 5-7, wherein: the cross-modal visual text interaction encoder and the visual text interaction encoder form a cross-modal implicit relation reasoning alignment network, and the cross-modal implicit relation reasoning alignment network is a trained network;
the functions adopted in the training process are as follows:
wherein,,representing the objective function of the IRR model, +.>Representing a bi-directional SDM loss function, ">Representing an ID loss function;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310328349.7A CN116383671B (en) | 2023-03-27 | 2023-03-27 | Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310328349.7A CN116383671B (en) | 2023-03-27 | 2023-03-27 | Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116383671A true CN116383671A (en) | 2023-07-04 |
CN116383671B CN116383671B (en) | 2024-05-28 |
Family
ID=86980048
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310328349.7A Active CN116383671B (en) | 2023-03-27 | 2023-03-27 | Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116383671B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117152142A (en) * | 2023-10-30 | 2023-12-01 | 菲特(天津)检测技术有限公司 | Bearing defect detection model construction method and system |
CN117312592A (en) * | 2023-11-28 | 2023-12-29 | 云南联合视觉科技有限公司 | Text-pedestrian image retrieval method based on modal invariant feature learning |
CN117612071A (en) * | 2024-01-23 | 2024-02-27 | 中国科学技术大学 | Video action recognition method based on transfer learning |
CN118038497A (en) * | 2024-04-10 | 2024-05-14 | 四川大学 | SAM-based text information driven pedestrian retrieval method and system |
CN118114124A (en) * | 2024-04-26 | 2024-05-31 | 武汉大学 | Text-guided controllable portrait generation method, system and equipment based on diffusion model |
CN118170938A (en) * | 2024-05-12 | 2024-06-11 | 西北工业大学 | Information guiding target searching method based on cross-modal self-evolution knowledge generalization |
CN118170936A (en) * | 2024-05-08 | 2024-06-11 | 齐鲁工业大学(山东省科学院) | Multi-mode data and relation enhancement-based pedestrian shielding retrieval method |
CN118411739A (en) * | 2024-07-02 | 2024-07-30 | 江西财经大学 | Visual language pedestrian re-recognition network method and system based on dynamic attention |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114663677A (en) * | 2022-04-08 | 2022-06-24 | 杭州电子科技大学 | Visual question answering method based on cross-modal pre-training feature enhancement |
CN114926835A (en) * | 2022-05-20 | 2022-08-19 | 京东科技控股股份有限公司 | Text generation method and device, and model training method and device |
CN115033670A (en) * | 2022-06-02 | 2022-09-09 | 西安电子科技大学 | Cross-modal image-text retrieval method with multi-granularity feature fusion |
CN115292533A (en) * | 2022-08-17 | 2022-11-04 | 苏州大学 | Cross-modal pedestrian retrieval method driven by visual positioning |
CN115311389A (en) * | 2022-08-05 | 2022-11-08 | 西北大学 | Multi-mode visual prompting technology representation learning method based on pre-training model |
WO2022261570A1 (en) * | 2021-08-04 | 2022-12-15 | Innopeak Technology, Inc. | Cross-attention system and method for fast video-text retrieval task with image clip |
WO2023004206A1 (en) * | 2021-08-04 | 2023-01-26 | Innopeak Technology, Inc. | Unsupervised hashing method for cross-modal video-text retrieval with clip |
-
2023
- 2023-03-27 CN CN202310328349.7A patent/CN116383671B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022261570A1 (en) * | 2021-08-04 | 2022-12-15 | Innopeak Technology, Inc. | Cross-attention system and method for fast video-text retrieval task with image clip |
WO2023004206A1 (en) * | 2021-08-04 | 2023-01-26 | Innopeak Technology, Inc. | Unsupervised hashing method for cross-modal video-text retrieval with clip |
CN114663677A (en) * | 2022-04-08 | 2022-06-24 | 杭州电子科技大学 | Visual question answering method based on cross-modal pre-training feature enhancement |
CN114926835A (en) * | 2022-05-20 | 2022-08-19 | 京东科技控股股份有限公司 | Text generation method and device, and model training method and device |
CN115033670A (en) * | 2022-06-02 | 2022-09-09 | 西安电子科技大学 | Cross-modal image-text retrieval method with multi-granularity feature fusion |
CN115311389A (en) * | 2022-08-05 | 2022-11-08 | 西北大学 | Multi-mode visual prompting technology representation learning method based on pre-training model |
CN115292533A (en) * | 2022-08-17 | 2022-11-04 | 苏州大学 | Cross-modal pedestrian retrieval method driven by visual positioning |
Non-Patent Citations (4)
Title |
---|
ASHISH VASWANI 等: "Attention Is All You Need", ARXIV.ORG, 12 June 2017 (2017-06-12), pages 1 - 15 * |
DING JIANG, MANG YE: "Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval", 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 22 August 2023 (2023-08-22), pages 2787 - 2797 * |
JACOB DEVLIN 等: "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", ARXIV.ORG, 24 May 2019 (2019-05-24), pages 1 - 16 * |
赵晋巍 等: "基于CLIP 模型的军事领域图片资源多模态搜索工具研究", 中华医学图书情报杂志, vol. 31, no. 08, 31 August 2022 (2022-08-31), pages 14 - 20 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117152142A (en) * | 2023-10-30 | 2023-12-01 | 菲特(天津)检测技术有限公司 | Bearing defect detection model construction method and system |
CN117152142B (en) * | 2023-10-30 | 2024-02-02 | 菲特(天津)检测技术有限公司 | Bearing defect detection model construction method and system |
CN117312592A (en) * | 2023-11-28 | 2023-12-29 | 云南联合视觉科技有限公司 | Text-pedestrian image retrieval method based on modal invariant feature learning |
CN117312592B (en) * | 2023-11-28 | 2024-02-09 | 云南联合视觉科技有限公司 | Text-pedestrian image retrieval method based on modal invariant feature learning |
CN117612071A (en) * | 2024-01-23 | 2024-02-27 | 中国科学技术大学 | Video action recognition method based on transfer learning |
CN117612071B (en) * | 2024-01-23 | 2024-04-19 | 中国科学技术大学 | Video action recognition method based on transfer learning |
CN118038497A (en) * | 2024-04-10 | 2024-05-14 | 四川大学 | SAM-based text information driven pedestrian retrieval method and system |
CN118114124A (en) * | 2024-04-26 | 2024-05-31 | 武汉大学 | Text-guided controllable portrait generation method, system and equipment based on diffusion model |
CN118170936A (en) * | 2024-05-08 | 2024-06-11 | 齐鲁工业大学(山东省科学院) | Multi-mode data and relation enhancement-based pedestrian shielding retrieval method |
CN118170938A (en) * | 2024-05-12 | 2024-06-11 | 西北工业大学 | Information guiding target searching method based on cross-modal self-evolution knowledge generalization |
CN118170938B (en) * | 2024-05-12 | 2024-08-23 | 西北工业大学 | Information guiding target searching method based on cross-modal self-evolution knowledge generalization |
CN118411739A (en) * | 2024-07-02 | 2024-07-30 | 江西财经大学 | Visual language pedestrian re-recognition network method and system based on dynamic attention |
Also Published As
Publication number | Publication date |
---|---|
CN116383671B (en) | 2024-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116383671B (en) | Text image cross-mode pedestrian retrieval method and system with implicit relation reasoning alignment | |
Jiang et al. | Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval | |
Zhang et al. | HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization | |
Zeng et al. | Beyond ocr+ vqa: involving ocr into the flow for robust and accurate textvqa | |
Suo et al. | A simple and robust correlation filtering method for text-based person search | |
Harizi et al. | Convolutional neural network with joint stepwise character/word modeling based system for scene text recognition | |
CN113033438A (en) | Data feature learning method for modal imperfect alignment | |
Liu et al. | Facial attractiveness computation by label distribution learning with deep CNN and geometric features | |
CN114398681A (en) | Method and device for training privacy information classification model and method and device for identifying privacy information | |
Patel et al. | Abstractive information extraction from scanned invoices (AIESI) using end-to-end sequential approach | |
He et al. | Cross-modal retrieval by real label partial least squares | |
Lu et al. | Domain-aware se network for sketch-based image retrieval with multiplicative euclidean margin softmax | |
Sharma et al. | Multilevel attention and relation network based image captioning model | |
Fu et al. | Look back again: Dual parallel attention network for accurate and robust scene text recognition | |
Luo et al. | An efficient multi-scale channel attention network for person re-identification | |
CN116578734B (en) | Probability embedding combination retrieval method based on CLIP | |
Yang et al. | Facial expression recognition based on multi-dataset neural network | |
Zhang et al. | Transformer-based global–local feature learning model for occluded person re-identification | |
Hasnat et al. | Robust license plate signatures matching based on multi-task learning approach | |
Lin et al. | A deep learning based bank card detection and recognition method in complex scenes | |
Hao et al. | A lightweight attention-based network for micro-expression recognition | |
Sharma et al. | A framework for image captioning based on relation network and multilevel attention mechanism | |
Yang et al. | Robust feature mining transformer for occluded person re-identification | |
Zhang et al. | Multiplicative angular margin loss for text-based person search | |
Wu et al. | Naster: non-local attentional scene text recognizer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |