CN115311687A - Natural language pedestrian retrieval method and system combining token and feature alignment - Google Patents

Natural language pedestrian retrieval method and system combining token and feature alignment Download PDF

Info

Publication number
CN115311687A
CN115311687A CN202210951558.2A CN202210951558A CN115311687A CN 115311687 A CN115311687 A CN 115311687A CN 202210951558 A CN202210951558 A CN 202210951558A CN 115311687 A CN115311687 A CN 115311687A
Authority
CN
China
Prior art keywords
text
image
feature
features
token
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210951558.2A
Other languages
Chinese (zh)
Inventor
李成龙
李尚泽
鹿安东
黄岩
王亮
程致远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202210951558.2A priority Critical patent/CN115311687A/en
Publication of CN115311687A publication Critical patent/CN115311687A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a natural language pedestrian retrieval method and a system combining token and feature alignment, comprising the following steps of: extracting visual features of the input pedestrian images by using image branches in the double-current feature learning network; extracting text features described by input pedestrians by using text branches in the double-flow feature learning network; aligning the global feature map extracted from the image and the text branch in a feature space; generating a token sequence using the aligned image global features in the feature space; performing token alignment between the generated token sequence and the real token sequence; performing cross-modal fusion interaction on the image and text features; training a natural language pedestrian retrieval model combining token and feature alignment; and (4) testing a natural language pedestrian retrieval model combining token and feature alignment. The method solves the technical problems of ambiguity embedding, high complexity, dependence on preset data, poor modal distance and poor intra-class distance optimization effect.

Description

Natural language pedestrian retrieval method and system combining token and feature alignment
Technical Field
The invention relates to the technical field of deep learning and criminal investigation, in particular to a natural language pedestrian retrieval method and system combining token and feature alignment.
Background
In the field of natural language pedestrian retrieval, most of the existing advanced methods are dedicated to mining local features of two modes and then performing fine-grained visual text matching. Therefore, these methods can be roughly classified into two types from the viewpoint of the manner of dividing the local region: manual a priori multi-scale methods and additional model-assisted methods.
A first type of prior art method uses a series of manually designed local regions to construct matches between features of different scales. For example, a natural language pedestrian retrieval method based on text dynamic guidance visual feature extraction is proposed in the prior patent application document with publication number CN113221680A (2021-08-06, university of northwest industry, royal peng, niu kai, gao li, mazehong, bright withdrawal, tan kho). The technique uses MobileNet and Bi-LSTM as image and text feature extraction networks, respectively. In order to obtain the feature representation of a fine-grained image, the technology sequentially divides an input pedestrian image into k horizontal areas from top to bottom, and then sends the horizontal areas into a feature extraction network to obtain the local features of the image. Next, the positions of the visual objects appearing in the natural language description in the horizontal region are given different weights, so that the visual features are extracted by text dynamic guidance to perform natural language pedestrian retrieval. Although the existing first-class methods can obtain performance better than that obtained by only using global features through learning discriminant local feature representation, the problem of ambiguity embedding easily occurs because local features in two modes are difficult to align accurately, so that the further improvement of the performance of the methods is limited.
The second prior art approach attempts to pre-process by means of additional models or natural language processing tools to segment out valuable image regions or text phrases. For example, the prior patent application publication No. CN114036336A, "pedestrian image search method based on semantic division and visual text attribute alignment" (shanghai transportation university, 2022-02-11, new populus, populus) proposes a natural language pedestrian retrieval method based on semantic division and visual text attribute alignment, which not only extracts pedestrian images and global feature expressions of text descriptions, but also processes raw data of image and text modalities. The technology utilizes the existing human body segmentation network to segment the pedestrian image and divide the image blocks of the head, the upper half body, the lower half body, the shoes and the backpack; in addition, the text phrases corresponding to each body part are extracted by means of the natural language processing tool library NLTK. Next, using ResNet50 as an image feature extraction network and Bi-LSTM as a text feature extraction network, extracting global and local features of the image and text modalities, and then performing cross-modal feature alignment at both global and local scales, respectively. The second method greatly increases parameter quantity and network complexity, the preprocessing time cost is high, the steps are complex, and the performance of a subsequent pedestrian retrieval model is seriously dependent on the dividing quality of an image area or a text phrase in a preprocessing stage.
Furthermore, both types of approaches still focus essentially on feature level alignment. The modal distance and the intra-class distance are optimized based on the same characteristics at the same time, and obviously, the optimal result is difficult to obtain.
In conclusion, the prior art has the technical problems of ambiguity embedding, high complexity, dependence on preset data, modal distance and intra-class distance optimization effect and poor optimization effect.
Disclosure of Invention
The invention aims to solve the technical problems of ambiguity embedding, high complexity, dependence on preset data, poor modal distance and poor intra-class distance optimization effect in the prior art.
The invention adopts the following technical scheme to solve the technical problem, and the natural language pedestrian retrieval method combining token and feature alignment comprises the following steps:
s1, processing image branches in a preset double-flow characteristic learning network, and extracting input pedestrian image characteristics by using a pyramid vision Transformer as a backbone network;
s2, processing text branches in the double-flow feature learning network so as to extract high-level global features of the text by utilizing a preset convolutional neural network;
s3, aligning global feature graphs extracted from the image branches and the text branches in a preset feature space to obtain aligned global features, learning discriminant visual text features by using a cross-mode projection matching loss function CMPM, associating two modes of the image and the text according to the learned discriminant visual text features, and reducing the mode distance between the image and the text;
s4, generating a token sequence according to the global features of the aligned images, converting the features of the image mode and the text mode into the same space for measurement, bridging the image mode and the text mode, reducing the distance between the image mode and the text mode by using a new string optimization mode, obtaining the mode invariance features, generating text description by using deep semantic features of an input image by using a text generation module, mapping the image features and the text features to the same space, increasing token space supervision on the basis of the feature space, reducing the intra-class distance and drawing the distance between the image mode and the text mode closer;
s5, utilizing a combined token and a frame TFAF of feature alignment, taking cross entropy loss as a reconstruction loss function, and constraining the distance between a generated token sequence and a real token sequence so as to realize token space alignment;
s6, cross-modal fusion interaction image features and text features, mapping image high-layer global features and generated text features to respective feature spaces by a cross-modal interaction module convolution, down-sampling and strengthening the image high-layer global features and the generated text features, acquiring a weight matrix between the image high-layer global features and the generated text features, normalizing and weighting the weight matrix to acquire an applicable attention matrix, processing the applicable attention matrix by utilizing residual connection to acquire applicable fusion output, supervising the applicable fusion output and the text high-layer global features extracted by text branches in the step S2 by taking a cross-modal projection matching loss function as an interaction loss function, and reducing modal differences by shortening the distance between the image and the text modes;
s7, extracting image features and text features according to the steps from S1 to S6, and training a natural language pedestrian retrieval model by using an Adam neural network optimizer;
and S8, testing the natural language pedestrian retrieval model to obtain a pedestrian retrieval result.
The method aims to guide network learning by aligning two spaces of a token and a feature, firstly, a powerful double-flow feature learning network based on a pyramid vision Transformer and a convolutional neural network is constructed, only global features are used for aligning the feature spaces, and modal distance is effectively shortened. Secondly, a text generation module is designed, and the intra-class distance is reduced in the token space in a cross-modal text generation mode. Finally, a cross-modal interaction module is provided to aggregate the image features and the generated text features, further shorten the distance between the image and text modes and reduce modal differences. The invention avoids the ambiguity embedding problem caused by using local characteristics, does not need additional preprocessing steps and reduces the time and resource expenditure.
In addition, the invention effectively solves the defects of the existing method by carrying out fine-grained natural language pedestrian retrieval by respectively optimizing the modal and the intra-class distance in the feature and token space through a novel framework combining the token and the feature alignment.
In a more specific technical solution, step S1 includes:
s11, the pyramid vision Transformer comprises four stages, each stage comprises a patch embedding and a Transformer encoder, and in the training stage, a batch of training data is set as follows:
Figure BDA0003789678030000031
wherein N represents the number of image-text pairs that match each other and belong to the same identity;
s12, giving a pedestrian image I, and representing a high-level global feature map generated in a fourth stage of the pyramid vision Transformer by the following logic:
Figure BDA0003789678030000032
wherein, H, W and C respectively represent the height, width and channel number of the characteristic diagram.
In a more specific technical solution, step S2 includes:
s21, in the text branch, converting the text description into a token sequence by using a BERT model and extracting a word vector;
s22, setting a fixed value L to control the sentence length;
s23, in the process of converting the text description into the token sequence, zero filling operation is carried out on the sequence to be converted, wherein the sequence length of the sequence is smaller than a preset length threshold value L;
s24, for sequences to be converted with sequence lengths exceeding a preset length threshold value L, taking the first L tokens to obtain fixed-length token sequences, and inputting the fixed-length token sequences into a BERT model to obtain word vectors:
Figure BDA0003789678030000041
where D is the dimension of each word vector;
s25, dimension of word vector is changed from
Figure BDA0003789678030000042
Extend to
Figure BDA0003789678030000043
For extracting pedestrian descriptionsA local feature map;
s26, converting the dimension D of the word vector by using a convolution layer and batch norm operation, so that the numerical value of the dimension D of the word vector is converted into the channel number C of the high-level global feature map of the image;
s27, extracting a high-level global feature map described by each sentence by utilizing a deep convolutional neural network
Figure BDA0003789678030000044
Wherein the deep convolutional neural network comprises a text residual bottleneck structure.
In the text branch, the invention adopts the BERT model widely applied in natural language processing to convert the text description into the token sequence and extract the word vector. The invention sets a fixed value L to control the sentence length so as to facilitate the subsequent processing.
In a more specific technical solution, step S3 includes:
s31, giving image characteristics and text characteristics of a batch to express image-text pairs as follows:
Figure BDA0003789678030000045
s32, processing the image characteristics and the text characteristics in a global maximum pooling mode to obtain maximum pooling data
Figure BDA0003789678030000046
Based on the filtering to obtain important global context information, using image characteristics
Figure BDA0003789678030000047
And text features
Figure BDA0003789678030000048
Scalar projection value therebetween to characterize similarity of image and text feature vectors;
s33, acquiring all feature pairs of scalar projection values in one batch
Figure BDA0003789678030000049
The ratio of the total amount of the components in the image to obtain image characteristics by the following logic processing
Figure BDA00037896780300000410
And text features
Figure BDA00037896780300000411
The same identity probability of (c):
Figure BDA00037896780300000412
in the formula (I), the compound is shown in the specification,
Figure BDA00037896780300000413
representing a normalized text feature;
s34, utilizing the following logic, and carrying out object function on each image feature in batch processing
Figure BDA00037896780300000414
Associated with its correctly matched text features and optimizing the objective function:
Figure BDA00037896780300000415
wherein, epsilon is used as a numerical problem processing parameter, q i,j Is a feature of the image
Figure BDA00037896780300000416
And text features
Figure BDA00037896780300000417
Normalized correct match probability;
s35, not less than 2 text features and image features exist in one batch
Figure BDA0003789678030000051
When there is a match, the correct match rate is characterized by the following logic
Figure BDA0003789678030000052
In one batch, the image-to-text projection loss function is defined with the following logic:
Figure BDA0003789678030000053
where the subscript I2T denotes image-to-text, L I2T Is an image-to-text projection loss function, L T2I Representing a text-to-image projection loss function;
the following logic processes the CMPM loss function to zoom in the image-to-text modality distance in both directions:
L CMPM =L I2T +L T2I (4)
aiming at the problem that a remarkable modal distance exists between an image and a text in a cross-modal pedestrian retrieval task, the invention learns discriminant visual text characteristics by using a cross-modal projection matching loss function (CMPM), and can combine the cross-modal projection into KL divergence to associate the two modes of the image and the text.
In a more specific technical solution, step S4 includes:
s41, encoding the input image into a fixed dimension characteristic vector by using an encoder, and converting the fixed dimension characteristic vector into a generated text characteristic by using a decoder;
taking pyramid vision transformers in image branches as encoders, extracting features by using a backbone network and processing an input image by using a global maximum pooling layer to obtain a fixed dimension feature vector
Figure BDA0003789678030000054
Using fixed dimension feature vectors with the following logical maximization
Figure BDA0003789678030000055
Probability of generating correct text:
Figure BDA0003789678030000056
where ω represents the parameter of the model, T r Represent and
Figure BDA0003789678030000057
a sequence of real tokens having the same identity;
s42, predicting the current word according to the previous word in the sentence by using a chain rule according to the following logic so as to generate a text:
Figure BDA0003789678030000058
where l is the length of this sentence description;
s43, modeling the logic by using a long-short term memory network LSTM, wherein the long-short term memory network comprises the following logic: an input gate IG, an output gate OG, and a forgetting gate FG, for controlling the flow of information. Wherein, the input gate IG and the output gate OG are used for determining whether to input or output information, and the forgetting gate FG is used for determining the proportion of discarded information;
s44, representing the candidate memory unit as
Figure BDA0003789678030000059
To map values to [ -1,1 ] using the Tanh activation function]The interval is used for determining the state of the memory cell at the current moment, processing the information of the memory cell at the previous moment and the candidate memory cell at the current moment under the control of the forgetting gate and the input gate, and determining the memory cell C at the current moment according to the information t
S45, input of the current time t is given by the following logic
Figure BDA00037896780300000510
And hidden state at previous time t-1
Figure BDA0003789678030000061
To determine the transmission to the hidden state H through the output gate t Information ofQuantity:
Figure BDA0003789678030000062
Figure BDA0003789678030000063
Figure BDA0003789678030000064
in the formula, W xc And W hc Is a weight parameter, b c Is a bias parameter.
The invention tries to bridge the image and text modes, converts the characteristics of the two modes into the same space for measurement, and reduces the distance between the two modes by a new paradigm of string optimization, thereby obtaining the mode invariance characteristics.
The invention provides a text generation module which generates text description by utilizing deep semantic features of an input image and then restricts the difference between the generated text and the real text by using a reconstruction loss function. In this way, visual and textual features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced. The invention adopts the long-short term memory network LSTM to relieve the problem of gradient anomaly and better model the long-term dependence relationship in the time sequence.
In a more specific technical solution, in step S5, using a cross entropy loss function, generating a distance between a token sequence and a real token sequence with the following logic constraint to implement token space alignment:
Figure BDA0003789678030000065
where p (x) is the true distribution of the sample and q (x) is the predicted distribution.
The invention adopts the cross entropy loss function to restrict the distance between the generated token sequence and the real token sequence, improves the quality of converting image characteristics into text description by a decoder in a long-term and short-term memory network, and ensures that the generated description is more real.
According to the method, the token sequence is generated by utilizing the image characteristics of the pedestrians, the alignment is carried out in the token space, so that the intra-class distance is further reduced, and the cross-modal text generation is utilized to promote the natural language pedestrian retrieval.
In a more specific embodiment, step S6 includes:
s61, convolution processing image high-level global features and generated text features to map the image high-level global features and the generated text features to respective feature spaces;
s62, downsampling the high-level global features and the generated text features of the processed image, and strengthening the channel information of the processed image by using a full connection layer and an activation function so as to strengthen the attention of the downsampled features;
s63, obtaining a weight matrix between the image characteristic and the generated text characteristic through matrix multiplication;
s64, normalizing the weight matrix by utilizing a Softmax activation function, weighting and summing the normalized weight matrix and the image characteristics so as to obtain an applicable attention matrix;
s65, adding the applicable attention matrix to the original image characteristics by utilizing residual connection so as to obtain applicable fusion output;
and S66, taking the cross-modal projection matching loss function as an interaction loss function, supervising the fusion output and the text high-level global features extracted by the preset convolutional neural network in the step S2, and reducing the modal difference by shortening the distance between the image and the text mode.
The invention performs cross-modal feature fusion interaction, and further reduces the distance between the image and the text mode by using multi-stage feature fusion. According to the method, a cross-modal projection matching loss function is used as an interaction loss function, the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network are supervised, the distance between an image mode and a text mode is gradually shortened by the whole model in a gradual mode, the mode difference is reduced, and the model is promoted to be further improved.
In a more specific embodiment, step S7 includes:
s71, searching a data set CUHK-PEDES by using a pedestrian in natural language, and extracting image features and text features;
and S72, training a natural language pedestrian retrieval model under the supervision of loss functions of all modules by using an Adam neural network optimizer.
In a more specific aspect, a natural language pedestrian retrieval system that combines token and feature alignment includes:
the pedestrian image feature extraction module is used for processing image branches in a preset double-flow feature learning network, and extracting input pedestrian image features by using a pyramid vision Transformer as a backbone network;
the pedestrian text feature extraction module is used for processing text branches in the double-flow feature learning network so as to extract high-level global features of the text by utilizing a preset convolutional neural network;
the feature space alignment module is used for aligning the global feature maps extracted from the image branches and the text branches in a preset feature space to obtain aligned global features, learning discriminant visual text features by using a cross-mode projection matching loss function CMPM (China Mobile particulate matter), associating two modes of images and texts according to the learned discriminant visual text features, and reducing the modal distance between the images and the texts, and is connected with the pedestrian image feature extraction module and the pedestrian text feature extraction module;
the system comprises a text generation module, a pedestrian image feature extraction module, a pedestrian text feature extraction module and a feature space alignment module, wherein the text generation module is used for generating a token sequence according to the global features of an aligned image, converting the features of an image mode and a text mode into the same space for measurement, bridging the image mode and the text mode, reducing the distance between the image mode and the text mode by using a string-shaped optimization new mode, acquiring the mode invariance features, generating text description by using deep semantic features of an input image through the text generation module, mapping the image features and the text features to the same space, increasing token space supervision on the basis of the feature space, reducing the intra-class distance and shortening the distance between the image mode and the text mode, and is connected with the pedestrian image feature extraction module, the pedestrian text feature extraction module and the feature space alignment module;
the token space alignment module is used for utilizing the combined token and the frame TFAF of feature alignment, taking cross entropy loss as a reconstruction loss function, and constraining the distance between the generated token sequence and the real token sequence to realize token space alignment, and the token space alignment module is connected with the text generation module;
a cross-modal fusion interaction module, which is used for cross-modal fusion of interaction image features and text features, mapping image high-layer global features and generated text features to respective feature spaces by convolution of the cross-modal interaction module, down-sampling and strengthening the image high-layer global features and the generated text features, acquiring a weight matrix between the image high-layer global features and the generated text features, normalizing and weighting the weight matrix to acquire an applicable attention matrix, processing the applicable attention matrix by using residual connection to acquire applicable fusion output, taking a cross-modal projection matching loss function as an interaction loss function, supervising the applicable fusion output and the text high-layer global features extracted by text branching in the step S2, and reducing modal differences by shortening the distance between the image and the text modes;
the model training module is used for training a natural language pedestrian retrieval model by utilizing an Adam neural network optimizer according to the image features and the text features, and is connected with the pedestrian image feature extraction module, the pedestrian text feature extraction module, the feature space alignment module, the text generation module, the token space alignment module and the cross-modal fusion interaction module;
and the retrieval result acquisition module is used for testing the natural language pedestrian retrieval model so as to acquire a pedestrian retrieval result, and is connected with the model training module.
Compared with the prior art, the invention has the following advantages: the method aims to guide network learning by aligning two spaces of a token and a feature, firstly, a powerful double-flow feature learning network based on a pyramid vision Transformer and a convolutional neural network is constructed, only global features are used for aligning the feature spaces, and modal distance is effectively shortened. Secondly, a text generation module is designed, and the intra-class distance is reduced in the token space in a mode of cross-modal text generation. Finally, a cross-modal interaction module is provided to aggregate the image features and the generated text features, further shorten the distance between the image and the text modalities and reduce modal difference. The invention avoids the ambiguity embedding problem caused by using local characteristics, does not need additional preprocessing steps and reduces the time and resource expenditure.
In addition, the invention effectively solves the defects of the existing method by a novel framework of combining tokens and feature alignment and optimizing the modal and the intra-class distance in the feature and token spaces respectively to search natural language pedestrians with fine granularity.
In the text branch, the invention adopts the BERT model widely applied in natural language processing to convert the text description into the token sequence and extract the word vector. The invention sets a fixed value L to control the sentence length, so as to facilitate the subsequent processing.
Aiming at the problem that a remarkable modal distance exists between an image and a text in a cross-modal pedestrian retrieval task, the invention learns discriminant visual text characteristics by using a cross-modal projection matching loss function (CMPM), and can combine the cross-modal projection into KL divergence to associate the two modes of the image and the text.
The invention tries to bridge the image and text modes, converts the characteristics of the two modes into the same space for measurement, and reduces the distance between the two modes by a new paradigm of string optimization, thereby obtaining the mode invariance characteristics.
The invention provides a text generation module which generates text description by utilizing deep semantic features of an input image. In this way, visual and textual features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced. The invention adopts LSTM to relieve the problem of gradient anomaly and better model the long-term dependence relationship in the time sequence.
The invention adopts the cross entropy loss function to restrict the distance between the generated token sequence and the real token sequence, improves the quality of converting image characteristics into text description by a decoder in a long-term and short-term memory network, and ensures that the generated description is more real.
The method generates a token sequence by using the image characteristics of the pedestrians, aligns in the token space so as to further reduce the intra-class distance, and promotes natural language pedestrian retrieval by using cross-modal text generation.
The invention performs cross-modal feature fusion interaction, and further reduces the distance between the image and the text modality by utilizing multi-stage feature fusion. According to the method, a cross-modal projection matching loss function is used as an interaction loss function, the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network are supervised, the distance between an image mode and a text mode is gradually shortened by the whole model in a gradual mode, the mode difference is reduced, and the model is promoted to be further improved. The method solves the technical problems of ambiguity embedding, high complexity, dependence on preset data, modal distance and poor intra-class distance optimization effect in the prior art.
Drawings
FIG. 1 is a schematic diagram of an overall network framework of a joint token and feature aligned natural language pedestrian retrieval method according to embodiment 1 of the present invention;
FIG. 2 is a schematic diagram of basic steps of a joint token and feature aligned natural language pedestrian retrieval method according to embodiment 1 of the present invention
Fig. 3 is a schematic connection diagram of a cross-modal interaction module according to embodiment 1 of the present invention;
fig. 4 is a schematic diagram of specific steps of generating a token sequence in embodiment 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the present invention applies the PyTorch framework to the field of natural language pedestrian retrieval, and proposes a new framework of Token and Feature Alignment (TFAF) that pursues joint token and feature alignment to reduce modal and intra-class distances between images and text, as shown in fig. 1. Specifically, the method firstly constructs a new double-flow feature learning network, respectively extracts image and text features, and approximates samples of two modes in a feature space to carry out feature alignment. Secondly, a text generation module is designed, the token sequence is generated by using the image features aligned in the feature space, and then token alignment is carried out between the generated token sequence and the real token sequence, so that the intra-class distance is reduced in the token space. Finally, a cross-modality interaction module is proposed that further reduces the distance between the image and the text modality using multi-stage feature fusion.
As shown in fig. 2, the natural language pedestrian retrieval method with aligned joint token and feature provided by the invention comprises the following steps:
s1, extracting visual features of an input pedestrian image by using image branches in a double-current feature learning network;
the invention adopts pyramid vision Transformer as a backbone network to extract an image characteristic graph. It contains four phases, each consisting of a patch embedding and a transform encoder. In the training phase, the invention assumes a batch of training data as
Figure BDA0003789678030000101
Where N represents the number of image-text pairs that match each other and belong to the same identity. Given a pedestrian image I, the pyramid vision Transformer fourth order is adopted in the inventionThe high-level global feature map generated by the segment is represented as
Figure BDA0003789678030000102
Where H, W and C represent the height, width and number of channels, respectively, of the feature map.
S2, extracting text features described by input pedestrians by using text branches in the double-current feature learning network;
in the text branch, the invention uses a BERT model, which is widely used in natural language processing, to convert text descriptions into token sequences and extract word vectors. For the convenience of subsequent processing, the invention sets a fixed value L to control the sentence length. In the process of converting the text description into the token sequence, for the sequence with the length less than L, zero filling operation is carried out; for sequences that are longer than L, the present invention takes the first L tokens. Thus, a fixed length token sequence is obtained. Then inputting them into BERT model to obtain word vector
Figure BDA0003789678030000103
Where D is the dimension of each word vector.
In order to extract the global feature map of the pedestrian description, the invention first extracts the dimension of the word vector from
Figure BDA0003789678030000104
Extend to
Figure BDA0003789678030000105
And thus can be processed by a subsequent convolutional neural network. Next, the word vector dimension D is converted to the same number of channels C as the image high level global feature map using a convolutional layer and batch norm operation. Finally, a deep convolutional neural network containing a text residual bottleneck structure is used for extracting a high-level global feature map of each sentence description
Figure BDA0003789678030000106
S3, aligning the global feature maps extracted from the image and text branches in a feature space;
one of the major challenges for the cross-modal pedestrian retrieval task is the significant modal distance between the image and the text. In order to reduce the distance between the two modes. The invention learns discriminative visual text features using a cross-modal projection matching penalty function (CMPM), which can merge cross-modal projections into KL divergence to correlate image and text modalities.
Given a batch of image and text features, an image-text pair is represented as
Figure BDA0003789678030000111
In order to filter out important global context information and reduce the sensitivity of the network to modal differences, the invention firstly applies global maximum pooling to image features and text features to obtain
Figure BDA0003789678030000112
The similarity between two feature vectors can be reflected by the size of the scalar projection, the greater the value of which, the greater the similarity between two feature vectors. Thus, according to
Figure BDA0003789678030000113
And
Figure BDA0003789678030000114
all feature pairs in a batch with scalar projection values
Figure BDA0003789678030000115
Can obtain the proportion of
Figure BDA0003789678030000116
And
Figure BDA0003789678030000117
the probability of belonging to the same identity is:
Figure BDA0003789678030000118
wherein the content of the first and second substances,
Figure BDA0003789678030000119
representing a standardized text feature. The present invention requires optimizing each image feature in the batch process
Figure BDA00037896780300001110
The objective function associated with its correctly matched text feature is expressed as:
Figure BDA00037896780300001111
where e is used to avoid numerical problems, q i,j Is a feature of the image
Figure BDA00037896780300001112
And text features
Figure BDA00037896780300001113
Normalized correct match probability because there may be multiple text features and matches in a batch
Figure BDA00037896780300001114
Match, can be expressed as
Figure BDA00037896780300001115
Thus, in a batch, the loss function of the image-to-text projection can be summarized as:
Figure BDA00037896780300001116
vice versa, the loss function of the text-to-image projection can be expressed as L T2I . To zoom in the distance between the image and the text modality in both directions, the CMPM loss function can be defined as:
L CMPM =L I2T +L T2I (4)
s4, generating a token sequence by using the image global features aligned in the feature space;
to reduce the modal distance between the image and the text, many existing methods segment the image or extract attribute phrases from the sentence description. Visual features are encouraged to match given text features by establishing a variety of granular associations between images and text. However, these methods introduce additional pre-processing steps, resulting in a significant increase in the amount of computation and model complexity. The present invention attempts to bridge the image and text modalities, transform the features of the two modalities into the same space for metrology, and reduce the distance between the two modalities with a new paradigm of string optimization to obtain the modality invariance features.
In view of the above, the present invention proposes a text generation module that uses deep semantic features of an input image to generate a text description, and in this way, visual and text features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced.
As shown in fig. 4, step S4 further includes the following specific steps:
the whole framework can be regarded as one encoder-decoder structure at step S41. An input image is first encoded into fixed-dimension feature vectors using an encoder, and then these feature vectors are converted into a generated text using a decoder. The pyramid vision Transformer in the image branch is used as the role of an encoder, and after the feature extraction of the backbone network and the processing of the global maximum pooling layer, the feature vector with fixed dimensionality can be obtained
Figure BDA0003789678030000121
The next goal is to maximize the use of feature vectors
Figure BDA0003789678030000122
Probability of generating correct text, as shown in equation (5), where ω represents a parameter of the model (for brevity, in the figure of brevity)Omitted in subsequent processing), T r Represent and
Figure BDA0003789678030000123
a sequence of real tokens having the same identity.
Figure BDA0003789678030000124
Step S42, taking a sentence description as an example, in order to complete the text generation task, it is necessary to predict the current word from the previous word using the chain rule, and this process is shown in formula (6), where l is the length of this sentence description. A Recurrent Neural Network (RNN) is a network with recurrent connections that can pass information between different times, predicting the state at the current time based on previously remembered information, and thus enabling modeling
Figure BDA0003789678030000125
However, although the recurrent neural network has a certain memory capacity, it does not deal well with the long-term dependence problem, i.e., when the predicted point is far away from the information on which it depends, it is difficult for the recurrent neural network to accurately learn the relevant information.
Figure BDA0003789678030000126
In step S43, in order to solve the above problem, the present invention employs a long short term memory network (LSTM). The network introduces three gating mechanisms, input Gate (IG), output Gate (OG) and Forgetting Gate (FG), which are used to control the flow of information. The first two gates are used to determine whether to input or output information, and the last gate is used to determine the proportion of information that should be discarded. In addition, the candidate memory cells are represented as
Figure BDA0003789678030000127
It uses the Tanh activation function to map values to [ -1,1]And the interval is used for determining the state of the memory unit at the current moment.Memory cell C at the current moment t The memory cell is determined by the information of the memory cell at the previous moment and the candidate memory cell at the current moment under the control of the forgetting gate and the input gate. Finally, the output gate decides to pass to the hidden state H t The amount of information of (2). Given the input of the current time t
Figure BDA0003789678030000128
And hidden state at last time t-1
Figure BDA0003789678030000129
The above process is expressed as equation (7-9). Wherein, W xc And W hc Is a weight parameter, b c Is a bias parameter. In this way, the long-short term memory network alleviates the gradient anomaly problem and better models long-term dependencies in the time series.
Figure BDA00037896780300001210
Figure BDA00037896780300001211
H t =OG t ⊙tanh(C t ) (9)
S5, performing token alignment between the generated token sequence and the real token sequence;
in order to improve the quality of converting image features into text descriptions by a decoder in a long-short term memory network and enable the generated descriptions to be more real, the invention adopts a cross entropy loss function to restrict the distance between a generated token sequence and a real token sequence, thereby realizing token space alignment, as shown in formula (10). Where p (x) is the true distribution of the sample and q (x) is the predicted distribution:
Figure BDA0003789678030000131
s6, performing cross-mode fusion interaction on the image and text features;
as shown in fig. 3, in order to further reduce the distance between the image and text modalities, the present invention designs a new cross-modality interaction module. And respectively mapping the image high-level global features extracted by the pyramid vision Transformer and the generated text features obtained by utilizing the image features in the text generation module to respective feature spaces through convolution operation. And performing attention enhancement on the down-sampled feature vector, namely enhancing the channel information of the input feature by using a module consisting of a full connection layer and an activation function. And obtaining a weight matrix between the image and the text characteristic through matrix multiplication, normalizing the weight matrix by using a Softmax activation function, and carrying out weighted summation on the result and the image characteristic to obtain a final attention matrix. In addition, the idea of residual error connection is introduced here, and the attention moment matrix is added back to the original image characteristics to obtain the final fusion output.
The generation of the text features is converted from the high-level global features of the images through a text generation module, is essentially another stage expression form of the image features, and the adoption of the module for cross-modal feature fusion interaction is equivalent to the multi-stage aggregation of the two forms of image features. And then, taking a cross-modal projection matching loss function as an interaction loss function, supervising the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network, and gradually shortening the distance between the image and the text mode by the whole model in a gradual mode to reduce the mode difference.
S7, training a natural language pedestrian retrieval model combining tokens and feature alignment;
and (3) searching a data set CUHK-PEDES by using a natural language pedestrian, extracting image and text characteristics according to the steps, and training a neural network model under the supervision of loss functions of all modules by adopting an Adam neural network optimizer.
S8, testing a natural language pedestrian retrieval model by combining the token and feature alignment;
and giving a text description of a natural language as a query, and comparing cosine similarity between the global features extracted from the text description and the global features of each pedestrian image in the image library, wherein the image with the highest similarity is the result of the pedestrian retrieval.
In summary, the present invention aims to guide network learning by aligning two spaces of a token and a feature, and first, a powerful double-flow feature learning network based on a pyramid vision Transformer and a convolutional neural network is constructed, and alignment of the feature space is performed only by using global features, thereby effectively reducing modal distance. Secondly, a text generation module is designed, and the intra-class distance is reduced in the token space in a cross-modal text generation mode. Finally, a cross-modal interaction module is provided to aggregate the image features and the generated text features, further shorten the distance between the image and the text modalities and reduce modal difference. The invention avoids the ambiguity embedding problem caused by using local characteristics, does not need additional preprocessing steps and reduces the time and resource expenditure.
In addition, the invention effectively solves the defects of the existing method by carrying out fine-grained natural language pedestrian retrieval by respectively optimizing the modal and the intra-class distance in the feature and token space through a novel framework combining the token and the feature alignment.
In the text branch, the invention adopts the BERT model widely applied in natural language processing to convert the text description into the token sequence and extract the word vector. The invention sets a fixed value L to control the sentence length so as to facilitate the subsequent processing.
Aiming at the problem that a remarkable modal distance exists between an image and a text in a cross-modal pedestrian retrieval task, the invention learns discriminant visual text characteristics by using a cross-modal projection matching loss function (CMPM), and can combine the cross-modal projection into KL divergence to associate the two modes of the image and the text.
The present invention attempts to bridge the image and text modalities, transform the features of the two modalities into the same space for metrology, and reduce the distance between the two modalities with a new paradigm of string optimization to obtain the modality invariance features.
The invention provides a text generation module which generates text description by utilizing deep semantic features of an input image. In this way, visual and textual features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced. The invention adopts LSTM to relieve the problem of gradient anomaly and better model the long-term dependence in the time sequence.
The invention adopts the cross entropy loss function to restrict the distance between the generated token sequence and the real token sequence, improves the quality of converting the image characteristics into text description by a decoder in the long-short term memory network, and ensures that the generated description is more real.
The method generates a token sequence by using the image characteristics of the pedestrians, aligns in the token space so as to further reduce the intra-class distance, and promotes natural language pedestrian retrieval by using cross-modal text generation.
The invention performs cross-modal feature fusion interaction, and further reduces the distance between the image and the text mode by using multi-stage feature fusion. According to the invention, a cross-modal projection matching loss function is taken as an interaction loss function, the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network are supervised, the distance between the image mode and the text mode is gradually shortened by the whole model in a progressive mode, the modal difference is reduced, and the model is promoted further. The method solves the technical problems of ambiguity embedding, high complexity, dependence on preset data, poor modal distance and poor intra-class distance optimization effect in the prior art.
The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A natural language pedestrian retrieval method combining token and feature alignment, the method comprising:
s1, processing image branches in a preset double-flow feature learning network, and extracting input pedestrian image features according to the image branches by using a pyramid vision Transformer as a backbone network;
s2, processing text branches in the double-flow feature learning network so as to extract high-level global features of the text by utilizing a preset convolutional neural network;
s3, aligning the global feature graphs extracted from the image branches and the text branches in a preset feature space to obtain aligned image global features, learning discriminant visual text features by using a cross-mode projection matching loss function CMPM, associating two modes of an image and a text and reducing the modal distance between the image and the text;
s4, generating the token sequence according to the global features of the aligned images, converting the features of an image mode and a text mode into the same space for measurement, bridging the image mode and the text mode, reducing the distance between the image mode and the text mode by using a new string optimization mode, obtaining a mode invariance feature, generating the text description by using deep semantic features of an input image by using a text generation module, mapping the image feature and the text feature to the same space, increasing token space supervision on the basis of the feature space, reducing the intra-class distance and shortening the distance between the image mode and the text mode;
s5, utilizing a combined token and a frame TFAF of feature alignment, taking cross entropy loss as a reconstruction loss function, and constraining the distance between a generated token sequence and a real token sequence so as to realize token space alignment;
s6, cross-modal fusion and interaction of the image features and the text features, cross-modal interaction module convolution is used for mapping the image high-level global features and the generated text features to respective feature spaces, down-sampling and strengthening processing are performed on the image high-level global features and the generated text features, a weight matrix between the image high-level global features and the generated text features is obtained through processing, normalization and weighting processing are performed on the weight matrix to obtain an applicable attention matrix, the applicable attention matrix is processed through residual connection to obtain an applicable fusion output, a cross-modal projection matching loss function is used as an interaction loss function, the applicable fusion output and the text high-level global features extracted by the text branches in the step S2 are supervised, the distance between the image and the text modes is shortened, and modal differences are reduced;
s7, extracting the image features and the text features according to the steps S1 to S6, and training the natural language pedestrian retrieval model by using an Adam neural network optimizer;
and S8, testing the natural language pedestrian retrieval model to obtain a pedestrian retrieval result.
2. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S1 comprises:
s11, the pyramid vision Transformer comprises four stages, each stage comprises a patch embedding and a Transformer encoder, and in the training stage, a batch of training data is set as follows:
Figure FDA0003789678020000021
wherein N represents the number of image-text pairs that match each other and belong to the same identity;
s12, giving a pedestrian image I, and representing a high-level global feature map generated by a fourth stage of the pyramid vision Transformer by the following logic:
Figure FDA0003789678020000022
wherein, H, W and C respectively represent the height, width and channel number of the characteristic diagram.
3. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S2 comprises:
s21, converting text description into a token sequence and extracting word vectors by using a BERT model in the text branch;
s22, setting a fixed value L to control the sentence length;
s23, in the process of converting the text description into the token sequence, zero filling operation is carried out on the sequence to be converted, of which the sequence length is smaller than a preset length threshold value L;
s24, for the sequences to be converted with the sequence length exceeding the preset length threshold value L, taking the first L tokens to obtain fixed length token sequences, and inputting the fixed length token sequences into the BERT model to obtain word vectors:
Figure FDA0003789678020000023
where D is the dimension of each word vector;
s25, dimension of the word vector is selected from
Figure FDA0003789678020000024
Extend to
Figure FDA0003789678020000025
Extracting a global feature map of the pedestrian description;
s26, converting the dimension D of the word vector by utilizing a convolution layer and batch norm operation, so that the numerical value of the dimension D of the word vector is converted into the channel number C of the image high-level global feature map;
s27, extracting a high-level global feature map described by each sentence by utilizing a deep convolutional neural network
Figure FDA0003789678020000026
Wherein the deep convolutional neural network comprises a text residual bottleneckAnd (5) structure.
4. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein said step S3 comprises:
s31, giving image characteristics and text characteristics of a batch to express image-text pairs as follows:
Figure FDA0003789678020000027
s32, processing the image characteristics and the text characteristics in a global maximum pooling mode to obtain maximum pooled data
Figure FDA0003789678020000028
According to the filtering to obtain important global context information, utilizing the image characteristics f i I And the text feature
Figure FDA0003789678020000031
Scalar projection value therebetween to characterize similarity of image and text feature vectors;
s33, acquiring all feature pairs of the scalar projection value in one batch
Figure FDA0003789678020000032
The ratio of the image feature f is obtained by the following logic processing i I With the text feature
Figure FDA0003789678020000033
The same identity probability of (c):
Figure FDA0003789678020000034
wherein the content of the first and second substances,
Figure FDA0003789678020000035
representing a normalized text feature;
s34, utilizing the following logic, and carrying out object function on each image characteristic in batch processing
Figure FDA00037896780200000310
Associated with the text feature that it correctly matches, and optimizing the objective function:
Figure FDA0003789678020000036
wherein e is used as a numerical problem processing parameter, q i,j Is the image feature f i I And the text feature
Figure FDA0003789678020000037
Normalized correct match probability;
s35, not less than 2 text features and image features f exist in one batch i i When matched, in one batch, the image-to-text projection loss function is defined with the following logic:
Figure FDA0003789678020000038
where the subscript I2T denotes image-to-text, L I2T Is an image-to-text projection loss function, L T2I Representing a text-to-image projection loss function;
the CMPM loss function is derived by logic processing to bi-directionally approximate the image to text modality distance as follows:
L CMPM =L I2T +L T2I (4)
5. the joint token and feature aligned natural language pedestrian retrieval method of claim 4, wherein in said step S35, said correct matching rate is characterized by the following logic:
Figure FDA0003789678020000039
6. the joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S4 comprises:
s41, encoding the input image into a fixed dimension characteristic vector by using an encoder, and converting the fixed dimension characteristic vector into a generated text characteristic by using a decoder;
taking the pyramid vision Transformer in the image branch as an encoder, performing feature extraction by using the backbone network, and processing the input image by using a global maximum pooling layer to obtain the fixed dimension feature vector f i I
Using the fixed-dimension feature vectors with the following logical maximization
Figure FDA0003789678020000041
Probability of generating correct text:
Figure FDA0003789678020000042
where ω represents the parameter of the model, T r Represents and f i I A sequence of real tokens having the same identity;
s42, predicting the current word according to the previous word in the sentence by using the chain rule according to the following logic so as to generate a text:
Figure FDA0003789678020000043
wherein
Figure FDA00037896780200000410
Is the length of this sentence description;
s43, modeling the logic by using a long-short-term memory network LSTM, wherein the long-short-term memory network LSTM comprises the following logic: an input gate IG, an output gate OG, and a forgetting gate FG, for controlling the flow of information. Wherein, the input gate IG and the output gate OG are used for determining whether to input or output information, and the forgetting gate FG is used for determining the proportion of discarded information;
s44, representing the candidate memory unit as
Figure FDA0003789678020000044
To map values to [ -1,1 ] using the Tanh activation function]The interval is used for determining the state of the memory cell at the current moment, processing the information of the memory cell at the previous moment and the candidate memory cell at the current moment under the control of the forgetting gate and the input gate, and determining the memory cell C at the current moment according to the information t
S45, giving the input of the current time t by using the following logic
Figure FDA0003789678020000045
And a hidden state at said previous instant t-1
Figure FDA0003789678020000046
To determine the state of transmission to the hidden state H through the output gate t The amount of information of (2):
Figure FDA0003789678020000047
Figure FDA0003789678020000048
H t =OG t ⊙tanh(C t ) (9)
in the formula, W xc And W hc Is a weight parameter, b c Is a bias parameter.
7. The method for natural language pedestrian retrieval combining token and feature alignment according to claim 1, wherein the distance between the token sequence and the real token sequence is generated with the following logic constraints by using a cross entropy loss function in step S5 to realize the token space alignment:
Figure FDA0003789678020000049
where p (x) is the true distribution of the sample and q (x) is the predicted distribution.
8. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S6 comprises:
s61, convolution processing the image high-level global features and the generated text features to map the image high-level global features and the generated text features to the feature spaces respectively;
s62, downsampling the image high-level global features and the generated text features, and strengthening channel information of the image high-level global features and the generated text features by using a full connection layer and an activation function so as to strengthen attention of the downsampled features;
s63, obtaining a weight matrix between the high-level global features of the image and the generated text features through matrix multiplication;
s64, normalizing the weight matrix by utilizing a Softmax activation function, and weighting and summing the normalized weight matrix and the image characteristics to obtain an applicable attention matrix;
s65, adding the applicable attention matrix to the original image characteristics by utilizing residual connection so as to obtain applicable fusion output;
s66, taking the cross-modal projection matching loss function as an interaction loss function, supervising the fusion output and the text high-level global features extracted by the preset convolutional neural network in the step S2, and reducing the modal difference by shortening the distance between the image and the text mode.
9. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S7 comprises:
s71, searching a data set CUHK-PEDES by using a pedestrian in natural language, and according to the image feature and the text feature;
and S72, training the natural language pedestrian retrieval model under the supervision of loss functions of all modules by using an Adam neural network optimizer.
10. A joint token and feature aligned natural language pedestrian retrieval system, the system comprising:
the pedestrian image feature extraction module is used for processing image branches in a preset double-flow feature learning network, and extracting input pedestrian image features by using a pyramid vision Transformer as a backbone network;
the pedestrian text feature extraction module is used for processing text branches in the double-flow feature learning network so as to extract high-level global features of the text by utilizing a preset convolutional neural network;
a feature space alignment module, configured to align the global feature map extracted from the image branch and the text branch in a preset feature space to obtain an aligned global feature, learn a discriminant visual text feature by using a cross-mode projection matching loss function CMPM to associate an image modality with a text modality and reduce a distance between the image modality and the text modality, where the feature space alignment module is connected to the pedestrian image feature extraction module and the pedestrian text feature extraction module;
a text generation module, configured to generate the token sequence according to the global feature of the aligned image, convert features of an image modality and a text modality into the same space for measurement, bridge the image and text modalities, reduce a distance between the image modality and the text modality by using a new string optimization paradigm, thereby obtaining a modality invariance feature, generate the text description by using a deep semantic feature of an input image by using a text generation module, thereby mapping the image feature and the text feature to the same space, so as to increase token space supervision on the basis of a feature space, thereby reducing an intra-class distance, and reduce the image and text modality distance, where the text generation module is connected to the pedestrian image feature extraction module, the pedestrian text feature extraction module, and the feature space alignment module;
a token space alignment module, configured to utilize a combined token and a feature aligned framework TFAF, and use cross entropy loss as a reconstruction loss function to constrain a distance between a generated token sequence and a real token sequence, so as to implement token space alignment, where the token space alignment module is connected to the text generation module;
a cross-modal fusion interaction module, configured to perform cross-modal fusion interaction on the image features and the text features, map the image high-level global features and the generated text features to respective feature spaces by convolution of the cross-modal interaction module, perform downsampling and enhancement processing on the image high-level global features and the generated text features, process to obtain a weight matrix between the image high-level global features and the generated text features, normalize and weight the weight matrix to obtain an applicable attention matrix, process the applicable attention matrix by using residual connection to obtain an applicable fusion output, use a cross-modal projection matching loss function as an interaction loss function, supervise the applicable fusion output and the text high-level global features extracted by text branching in step S2, and reduce modal differences by reducing a distance between the image and the text modes, where the cross-modal fusion interaction module is connected to the token space alignment module;
the model training module is used for training the natural language pedestrian retrieval model by utilizing an Adam neural network optimizer according to the image features and the text features, and is connected with the pedestrian image feature extraction module, the pedestrian text feature extraction module, the feature space alignment module, the text generation module, the token space alignment module and the cross-modal fusion interaction module;
and the retrieval result acquisition module is used for testing the natural language pedestrian retrieval model so as to acquire a pedestrian retrieval result, and is connected with the model training module.
CN202210951558.2A 2022-08-09 2022-08-09 Natural language pedestrian retrieval method and system combining token and feature alignment Pending CN115311687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210951558.2A CN115311687A (en) 2022-08-09 2022-08-09 Natural language pedestrian retrieval method and system combining token and feature alignment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210951558.2A CN115311687A (en) 2022-08-09 2022-08-09 Natural language pedestrian retrieval method and system combining token and feature alignment

Publications (1)

Publication Number Publication Date
CN115311687A true CN115311687A (en) 2022-11-08

Family

ID=83860984

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210951558.2A Pending CN115311687A (en) 2022-08-09 2022-08-09 Natural language pedestrian retrieval method and system combining token and feature alignment

Country Status (1)

Country Link
CN (1) CN115311687A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797655A (en) * 2022-12-13 2023-03-14 南京恩博科技有限公司 Character interaction detection model, method, system and device
CN116028631A (en) * 2023-03-30 2023-04-28 粤港澳大湾区数字经济研究院(福田) Multi-event detection method and related equipment
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116226434A (en) * 2023-05-04 2023-06-06 浪潮电子信息产业股份有限公司 Multi-element heterogeneous model training and application method, equipment and readable storage medium
CN116682144A (en) * 2023-06-20 2023-09-01 北京大学 Multi-modal pedestrian re-recognition method based on multi-level cross-modal difference reconciliation

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797655A (en) * 2022-12-13 2023-03-14 南京恩博科技有限公司 Character interaction detection model, method, system and device
CN115797655B (en) * 2022-12-13 2023-11-07 南京恩博科技有限公司 Character interaction detection model, method, system and device
CN115827954B (en) * 2023-02-23 2023-06-06 中国传媒大学 Dynamic weighted cross-modal fusion network retrieval method, system and electronic equipment
CN116028631A (en) * 2023-03-30 2023-04-28 粤港澳大湾区数字经济研究院(福田) Multi-event detection method and related equipment
CN116028631B (en) * 2023-03-30 2023-07-14 粤港澳大湾区数字经济研究院(福田) Multi-event detection method and related equipment
CN116226434A (en) * 2023-05-04 2023-06-06 浪潮电子信息产业股份有限公司 Multi-element heterogeneous model training and application method, equipment and readable storage medium
CN116226434B (en) * 2023-05-04 2023-07-21 浪潮电子信息产业股份有限公司 Multi-element heterogeneous model training and application method, equipment and readable storage medium
CN116682144A (en) * 2023-06-20 2023-09-01 北京大学 Multi-modal pedestrian re-recognition method based on multi-level cross-modal difference reconciliation
CN116682144B (en) * 2023-06-20 2023-12-22 北京大学 Multi-modal pedestrian re-recognition method based on multi-level cross-modal difference reconciliation

Similar Documents

Publication Publication Date Title
CN115311687A (en) Natural language pedestrian retrieval method and system combining token and feature alignment
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN113220919B (en) Dam defect image text cross-modal retrieval method and model
CN110263666B (en) Action detection method based on asymmetric multi-stream
CN113806494B (en) Named entity recognition method based on pre-training language model
CN113657115B (en) Multi-mode Mongolian emotion analysis method based on ironic recognition and fine granularity feature fusion
CN116956929B (en) Multi-feature fusion named entity recognition method and device for bridge management text data
CN113190656A (en) Chinese named entity extraction method based on multi-label framework and fusion features
CN109933682B (en) Image hash retrieval method and system based on combination of semantics and content information
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN114168754A (en) Relation extraction method based on syntactic dependency and fusion information
CN114691864A (en) Text classification model training method and device and text classification method and device
CN113779260A (en) Domain map entity and relationship combined extraction method and system based on pre-training model
CN116049450A (en) Multi-mode-supported image-text retrieval method and device based on distance clustering
CN116564355A (en) Multi-mode emotion recognition method, system, equipment and medium based on self-attention mechanism fusion
CN117421591A (en) Multi-modal characterization learning method based on text-guided image block screening
CN115994317A (en) Incomplete multi-view multi-label classification method and system based on depth contrast learning
Dong et al. Refinement Co‐supervision network for real‐time semantic segmentation
CN117539999A (en) Cross-modal joint coding-based multi-modal emotion analysis method
CN112990196A (en) Scene character recognition method and system based on hyper-parameter search and two-stage training
CN112287689A (en) Judicial second-examination case situation auxiliary analysis method and system
CN116579348A (en) False news detection method and system based on uncertain semantic fusion
CN115512357A (en) Zero-sample Chinese character recognition method based on component splitting
CN114692604A (en) Deep learning-based aspect-level emotion classification method
CN114220145A (en) Face detection model generation method and device and fake face detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination