CN115311687A

CN115311687A - Natural language pedestrian retrieval method and system combining token and feature alignment

Info

Publication number: CN115311687A
Application number: CN202210951558.2A
Authority: CN
Inventors: 李成龙; 李尚泽; 鹿安东; 黄岩; 王亮; 程致远
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2022-11-08

Abstract

The invention provides a natural language pedestrian retrieval method and a system combining token and feature alignment, comprising the following steps of: extracting visual features of the input pedestrian images by using image branches in the double-current feature learning network; extracting text features described by input pedestrians by using text branches in the double-flow feature learning network; aligning the global feature map extracted from the image and the text branch in a feature space; generating a token sequence using the aligned image global features in the feature space; performing token alignment between the generated token sequence and the real token sequence; performing cross-modal fusion interaction on the image and text features; training a natural language pedestrian retrieval model combining token and feature alignment; and (4) testing a natural language pedestrian retrieval model combining token and feature alignment. The method solves the technical problems of ambiguity embedding, high complexity, dependence on preset data, poor modal distance and poor intra-class distance optimization effect.

Description

Natural language pedestrian retrieval method and system combining token and feature alignment

Technical Field

The invention relates to the technical field of deep learning and criminal investigation, in particular to a natural language pedestrian retrieval method and system combining token and feature alignment.

Background

In the field of natural language pedestrian retrieval, most of the existing advanced methods are dedicated to mining local features of two modes and then performing fine-grained visual text matching. Therefore, these methods can be roughly classified into two types from the viewpoint of the manner of dividing the local region: manual a priori multi-scale methods and additional model-assisted methods.

A first type of prior art method uses a series of manually designed local regions to construct matches between features of different scales. For example, a natural language pedestrian retrieval method based on text dynamic guidance visual feature extraction is proposed in the prior patent application document with publication number CN113221680A (2021-08-06, university of northwest industry, royal peng, niu kai, gao li, mazehong, bright withdrawal, tan kho). The technique uses MobileNet and Bi-LSTM as image and text feature extraction networks, respectively. In order to obtain the feature representation of a fine-grained image, the technology sequentially divides an input pedestrian image into k horizontal areas from top to bottom, and then sends the horizontal areas into a feature extraction network to obtain the local features of the image. Next, the positions of the visual objects appearing in the natural language description in the horizontal region are given different weights, so that the visual features are extracted by text dynamic guidance to perform natural language pedestrian retrieval. Although the existing first-class methods can obtain performance better than that obtained by only using global features through learning discriminant local feature representation, the problem of ambiguity embedding easily occurs because local features in two modes are difficult to align accurately, so that the further improvement of the performance of the methods is limited.

The second prior art approach attempts to pre-process by means of additional models or natural language processing tools to segment out valuable image regions or text phrases. For example, the prior patent application publication No. CN114036336A, "pedestrian image search method based on semantic division and visual text attribute alignment" (shanghai transportation university, 2022-02-11, new populus, populus) proposes a natural language pedestrian retrieval method based on semantic division and visual text attribute alignment, which not only extracts pedestrian images and global feature expressions of text descriptions, but also processes raw data of image and text modalities. The technology utilizes the existing human body segmentation network to segment the pedestrian image and divide the image blocks of the head, the upper half body, the lower half body, the shoes and the backpack; in addition, the text phrases corresponding to each body part are extracted by means of the natural language processing tool library NLTK. Next, using ResNet50 as an image feature extraction network and Bi-LSTM as a text feature extraction network, extracting global and local features of the image and text modalities, and then performing cross-modal feature alignment at both global and local scales, respectively. The second method greatly increases parameter quantity and network complexity, the preprocessing time cost is high, the steps are complex, and the performance of a subsequent pedestrian retrieval model is seriously dependent on the dividing quality of an image area or a text phrase in a preprocessing stage.

Furthermore, both types of approaches still focus essentially on feature level alignment. The modal distance and the intra-class distance are optimized based on the same characteristics at the same time, and obviously, the optimal result is difficult to obtain.

In conclusion, the prior art has the technical problems of ambiguity embedding, high complexity, dependence on preset data, modal distance and intra-class distance optimization effect and poor optimization effect.

Disclosure of Invention

The invention aims to solve the technical problems of ambiguity embedding, high complexity, dependence on preset data, poor modal distance and poor intra-class distance optimization effect in the prior art.

The invention adopts the following technical scheme to solve the technical problem, and the natural language pedestrian retrieval method combining token and feature alignment comprises the following steps:

s1, processing image branches in a preset double-flow characteristic learning network, and extracting input pedestrian image characteristics by using a pyramid vision Transformer as a backbone network;

s2, processing text branches in the double-flow feature learning network so as to extract high-level global features of the text by utilizing a preset convolutional neural network;

s3, aligning global feature graphs extracted from the image branches and the text branches in a preset feature space to obtain aligned global features, learning discriminant visual text features by using a cross-mode projection matching loss function CMPM, associating two modes of the image and the text according to the learned discriminant visual text features, and reducing the mode distance between the image and the text;

s4, generating a token sequence according to the global features of the aligned images, converting the features of the image mode and the text mode into the same space for measurement, bridging the image mode and the text mode, reducing the distance between the image mode and the text mode by using a new string optimization mode, obtaining the mode invariance features, generating text description by using deep semantic features of an input image by using a text generation module, mapping the image features and the text features to the same space, increasing token space supervision on the basis of the feature space, reducing the intra-class distance and drawing the distance between the image mode and the text mode closer;

s5, utilizing a combined token and a frame TFAF of feature alignment, taking cross entropy loss as a reconstruction loss function, and constraining the distance between a generated token sequence and a real token sequence so as to realize token space alignment;

s6, cross-modal fusion interaction image features and text features, mapping image high-layer global features and generated text features to respective feature spaces by a cross-modal interaction module convolution, down-sampling and strengthening the image high-layer global features and the generated text features, acquiring a weight matrix between the image high-layer global features and the generated text features, normalizing and weighting the weight matrix to acquire an applicable attention matrix, processing the applicable attention matrix by utilizing residual connection to acquire applicable fusion output, supervising the applicable fusion output and the text high-layer global features extracted by text branches in the step S2 by taking a cross-modal projection matching loss function as an interaction loss function, and reducing modal differences by shortening the distance between the image and the text modes;

s7, extracting image features and text features according to the steps from S1 to S6, and training a natural language pedestrian retrieval model by using an Adam neural network optimizer;

and S8, testing the natural language pedestrian retrieval model to obtain a pedestrian retrieval result.

The method aims to guide network learning by aligning two spaces of a token and a feature, firstly, a powerful double-flow feature learning network based on a pyramid vision Transformer and a convolutional neural network is constructed, only global features are used for aligning the feature spaces, and modal distance is effectively shortened. Secondly, a text generation module is designed, and the intra-class distance is reduced in the token space in a cross-modal text generation mode. Finally, a cross-modal interaction module is provided to aggregate the image features and the generated text features, further shorten the distance between the image and text modes and reduce modal differences. The invention avoids the ambiguity embedding problem caused by using local characteristics, does not need additional preprocessing steps and reduces the time and resource expenditure.

In addition, the invention effectively solves the defects of the existing method by carrying out fine-grained natural language pedestrian retrieval by respectively optimizing the modal and the intra-class distance in the feature and token space through a novel framework combining the token and the feature alignment.

In a more specific technical solution, step S1 includes:

s11, the pyramid vision Transformer comprises four stages, each stage comprises a patch embedding and a Transformer encoder, and in the training stage, a batch of training data is set as follows:

wherein N represents the number of image-text pairs that match each other and belong to the same identity;

s12, giving a pedestrian image I, and representing a high-level global feature map generated in a fourth stage of the pyramid vision Transformer by the following logic:

wherein, H, W and C respectively represent the height, width and channel number of the characteristic diagram.

In a more specific technical solution, step S2 includes:

s21, in the text branch, converting the text description into a token sequence by using a BERT model and extracting a word vector;

s22, setting a fixed value L to control the sentence length;

s23, in the process of converting the text description into the token sequence, zero filling operation is carried out on the sequence to be converted, wherein the sequence length of the sequence is smaller than a preset length threshold value L;

s24, for sequences to be converted with sequence lengths exceeding a preset length threshold value L, taking the first L tokens to obtain fixed-length token sequences, and inputting the fixed-length token sequences into a BERT model to obtain word vectors:

where D is the dimension of each word vector;

s25, dimension of word vector is changed from

Extend to

For extracting pedestrian descriptionsA local feature map;

s26, converting the dimension D of the word vector by using a convolution layer and batch norm operation, so that the numerical value of the dimension D of the word vector is converted into the channel number C of the high-level global feature map of the image;

s27, extracting a high-level global feature map described by each sentence by utilizing a deep convolutional neural network

Wherein the deep convolutional neural network comprises a text residual bottleneck structure.

In the text branch, the invention adopts the BERT model widely applied in natural language processing to convert the text description into the token sequence and extract the word vector. The invention sets a fixed value L to control the sentence length so as to facilitate the subsequent processing.

In a more specific technical solution, step S3 includes:

s31, giving image characteristics and text characteristics of a batch to express image-text pairs as follows:

s32, processing the image characteristics and the text characteristics in a global maximum pooling mode to obtain maximum pooling data

Based on the filtering to obtain important global context information, using image characteristics

And text features

Scalar projection value therebetween to characterize similarity of image and text feature vectors;

s33, acquiring all feature pairs of scalar projection values in one batch

The ratio of the total amount of the components in the image to obtain image characteristics by the following logic processing

And text features

The same identity probability of (c):

in the formula (I), the compound is shown in the specification,

representing a normalized text feature;

s34, utilizing the following logic, and carrying out object function on each image feature in batch processing

Associated with its correctly matched text features and optimizing the objective function:

wherein, epsilon is used as a numerical problem processing parameter, q _i,j Is a feature of the image

And text features

Normalized correct match probability;

s35, not less than 2 text features and image features exist in one batch

When there is a match, the correct match rate is characterized by the following logic

In one batch, the image-to-text projection loss function is defined with the following logic:

where the subscript I2T denotes image-to-text, L _I2T Is an image-to-text projection loss function, L _T2I Representing a text-to-image projection loss function;

the following logic processes the CMPM loss function to zoom in the image-to-text modality distance in both directions:

L _CMPM ＝L _I2T +L _T2I (4)

aiming at the problem that a remarkable modal distance exists between an image and a text in a cross-modal pedestrian retrieval task, the invention learns discriminant visual text characteristics by using a cross-modal projection matching loss function (CMPM), and can combine the cross-modal projection into KL divergence to associate the two modes of the image and the text.

In a more specific technical solution, step S4 includes:

s41, encoding the input image into a fixed dimension characteristic vector by using an encoder, and converting the fixed dimension characteristic vector into a generated text characteristic by using a decoder;

taking pyramid vision transformers in image branches as encoders, extracting features by using a backbone network and processing an input image by using a global maximum pooling layer to obtain a fixed dimension feature vector

Using fixed dimension feature vectors with the following logical maximization

Probability of generating correct text:

where ω represents the parameter of the model, T _r Represent and

a sequence of real tokens having the same identity;

s42, predicting the current word according to the previous word in the sentence by using a chain rule according to the following logic so as to generate a text:

where l is the length of this sentence description;

s43, modeling the logic by using a long-short term memory network LSTM, wherein the long-short term memory network comprises the following logic: an input gate IG, an output gate OG, and a forgetting gate FG, for controlling the flow of information. Wherein, the input gate IG and the output gate OG are used for determining whether to input or output information, and the forgetting gate FG is used for determining the proportion of discarded information;

s44, representing the candidate memory unit as

To map values to [ -1,1 ] using the Tanh activation function]The interval is used for determining the state of the memory cell at the current moment, processing the information of the memory cell at the previous moment and the candidate memory cell at the current moment under the control of the forgetting gate and the input gate, and determining the memory cell C at the current moment according to the information _t ；

S45, input of the current time t is given by the following logic

And hidden state at previous time t-1

To determine the transmission to the hidden state H through the output gate _t Information ofQuantity:

in the formula, W _xc And W _hc Is a weight parameter, b _c Is a bias parameter.

The invention tries to bridge the image and text modes, converts the characteristics of the two modes into the same space for measurement, and reduces the distance between the two modes by a new paradigm of string optimization, thereby obtaining the mode invariance characteristics.

The invention provides a text generation module which generates text description by utilizing deep semantic features of an input image and then restricts the difference between the generated text and the real text by using a reconstruction loss function. In this way, visual and textual features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced. The invention adopts the long-short term memory network LSTM to relieve the problem of gradient anomaly and better model the long-term dependence relationship in the time sequence.

In a more specific technical solution, in step S5, using a cross entropy loss function, generating a distance between a token sequence and a real token sequence with the following logic constraint to implement token space alignment:

where p (x) is the true distribution of the sample and q (x) is the predicted distribution.

The invention adopts the cross entropy loss function to restrict the distance between the generated token sequence and the real token sequence, improves the quality of converting image characteristics into text description by a decoder in a long-term and short-term memory network, and ensures that the generated description is more real.

According to the method, the token sequence is generated by utilizing the image characteristics of the pedestrians, the alignment is carried out in the token space, so that the intra-class distance is further reduced, and the cross-modal text generation is utilized to promote the natural language pedestrian retrieval.

In a more specific embodiment, step S6 includes:

s61, convolution processing image high-level global features and generated text features to map the image high-level global features and the generated text features to respective feature spaces;

s62, downsampling the high-level global features and the generated text features of the processed image, and strengthening the channel information of the processed image by using a full connection layer and an activation function so as to strengthen the attention of the downsampled features;

s63, obtaining a weight matrix between the image characteristic and the generated text characteristic through matrix multiplication;

s64, normalizing the weight matrix by utilizing a Softmax activation function, weighting and summing the normalized weight matrix and the image characteristics so as to obtain an applicable attention matrix;

s65, adding the applicable attention matrix to the original image characteristics by utilizing residual connection so as to obtain applicable fusion output;

and S66, taking the cross-modal projection matching loss function as an interaction loss function, supervising the fusion output and the text high-level global features extracted by the preset convolutional neural network in the step S2, and reducing the modal difference by shortening the distance between the image and the text mode.

The invention performs cross-modal feature fusion interaction, and further reduces the distance between the image and the text mode by using multi-stage feature fusion. According to the method, a cross-modal projection matching loss function is used as an interaction loss function, the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network are supervised, the distance between an image mode and a text mode is gradually shortened by the whole model in a gradual mode, the mode difference is reduced, and the model is promoted to be further improved.

In a more specific embodiment, step S7 includes:

s71, searching a data set CUHK-PEDES by using a pedestrian in natural language, and extracting image features and text features;

and S72, training a natural language pedestrian retrieval model under the supervision of loss functions of all modules by using an Adam neural network optimizer.

In a more specific aspect, a natural language pedestrian retrieval system that combines token and feature alignment includes:

the pedestrian image feature extraction module is used for processing image branches in a preset double-flow feature learning network, and extracting input pedestrian image features by using a pyramid vision Transformer as a backbone network;

the pedestrian text feature extraction module is used for processing text branches in the double-flow feature learning network so as to extract high-level global features of the text by utilizing a preset convolutional neural network;

the feature space alignment module is used for aligning the global feature maps extracted from the image branches and the text branches in a preset feature space to obtain aligned global features, learning discriminant visual text features by using a cross-mode projection matching loss function CMPM (China Mobile particulate matter), associating two modes of images and texts according to the learned discriminant visual text features, and reducing the modal distance between the images and the texts, and is connected with the pedestrian image feature extraction module and the pedestrian text feature extraction module;

the system comprises a text generation module, a pedestrian image feature extraction module, a pedestrian text feature extraction module and a feature space alignment module, wherein the text generation module is used for generating a token sequence according to the global features of an aligned image, converting the features of an image mode and a text mode into the same space for measurement, bridging the image mode and the text mode, reducing the distance between the image mode and the text mode by using a string-shaped optimization new mode, acquiring the mode invariance features, generating text description by using deep semantic features of an input image through the text generation module, mapping the image features and the text features to the same space, increasing token space supervision on the basis of the feature space, reducing the intra-class distance and shortening the distance between the image mode and the text mode, and is connected with the pedestrian image feature extraction module, the pedestrian text feature extraction module and the feature space alignment module;

the token space alignment module is used for utilizing the combined token and the frame TFAF of feature alignment, taking cross entropy loss as a reconstruction loss function, and constraining the distance between the generated token sequence and the real token sequence to realize token space alignment, and the token space alignment module is connected with the text generation module;

a cross-modal fusion interaction module, which is used for cross-modal fusion of interaction image features and text features, mapping image high-layer global features and generated text features to respective feature spaces by convolution of the cross-modal interaction module, down-sampling and strengthening the image high-layer global features and the generated text features, acquiring a weight matrix between the image high-layer global features and the generated text features, normalizing and weighting the weight matrix to acquire an applicable attention matrix, processing the applicable attention matrix by using residual connection to acquire applicable fusion output, taking a cross-modal projection matching loss function as an interaction loss function, supervising the applicable fusion output and the text high-layer global features extracted by text branching in the step S2, and reducing modal differences by shortening the distance between the image and the text modes;

the model training module is used for training a natural language pedestrian retrieval model by utilizing an Adam neural network optimizer according to the image features and the text features, and is connected with the pedestrian image feature extraction module, the pedestrian text feature extraction module, the feature space alignment module, the text generation module, the token space alignment module and the cross-modal fusion interaction module;

and the retrieval result acquisition module is used for testing the natural language pedestrian retrieval model so as to acquire a pedestrian retrieval result, and is connected with the model training module.

Compared with the prior art, the invention has the following advantages: the method aims to guide network learning by aligning two spaces of a token and a feature, firstly, a powerful double-flow feature learning network based on a pyramid vision Transformer and a convolutional neural network is constructed, only global features are used for aligning the feature spaces, and modal distance is effectively shortened. Secondly, a text generation module is designed, and the intra-class distance is reduced in the token space in a mode of cross-modal text generation. Finally, a cross-modal interaction module is provided to aggregate the image features and the generated text features, further shorten the distance between the image and the text modalities and reduce modal difference. The invention avoids the ambiguity embedding problem caused by using local characteristics, does not need additional preprocessing steps and reduces the time and resource expenditure.

In addition, the invention effectively solves the defects of the existing method by a novel framework of combining tokens and feature alignment and optimizing the modal and the intra-class distance in the feature and token spaces respectively to search natural language pedestrians with fine granularity.

In the text branch, the invention adopts the BERT model widely applied in natural language processing to convert the text description into the token sequence and extract the word vector. The invention sets a fixed value L to control the sentence length, so as to facilitate the subsequent processing.

The invention provides a text generation module which generates text description by utilizing deep semantic features of an input image. In this way, visual and textual features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced. The invention adopts LSTM to relieve the problem of gradient anomaly and better model the long-term dependence relationship in the time sequence.

The method generates a token sequence by using the image characteristics of the pedestrians, aligns in the token space so as to further reduce the intra-class distance, and promotes natural language pedestrian retrieval by using cross-modal text generation.

The invention performs cross-modal feature fusion interaction, and further reduces the distance between the image and the text modality by utilizing multi-stage feature fusion. According to the method, a cross-modal projection matching loss function is used as an interaction loss function, the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network are supervised, the distance between an image mode and a text mode is gradually shortened by the whole model in a gradual mode, the mode difference is reduced, and the model is promoted to be further improved. The method solves the technical problems of ambiguity embedding, high complexity, dependence on preset data, modal distance and poor intra-class distance optimization effect in the prior art.

Drawings

FIG. 1 is a schematic diagram of an overall network framework of a joint token and feature aligned natural language pedestrian retrieval method according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of basic steps of a joint token and feature aligned natural language pedestrian retrieval method according to embodiment 1 of the present invention

Fig. 3 is a schematic connection diagram of a cross-modal interaction module according to embodiment 1 of the present invention;

fig. 4 is a schematic diagram of specific steps of generating a token sequence in embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present invention applies the PyTorch framework to the field of natural language pedestrian retrieval, and proposes a new framework of Token and Feature Alignment (TFAF) that pursues joint token and feature alignment to reduce modal and intra-class distances between images and text, as shown in fig. 1. Specifically, the method firstly constructs a new double-flow feature learning network, respectively extracts image and text features, and approximates samples of two modes in a feature space to carry out feature alignment. Secondly, a text generation module is designed, the token sequence is generated by using the image features aligned in the feature space, and then token alignment is carried out between the generated token sequence and the real token sequence, so that the intra-class distance is reduced in the token space. Finally, a cross-modality interaction module is proposed that further reduces the distance between the image and the text modality using multi-stage feature fusion.

As shown in fig. 2, the natural language pedestrian retrieval method with aligned joint token and feature provided by the invention comprises the following steps:

s1, extracting visual features of an input pedestrian image by using image branches in a double-current feature learning network;

the invention adopts pyramid vision Transformer as a backbone network to extract an image characteristic graph. It contains four phases, each consisting of a patch embedding and a transform encoder. In the training phase, the invention assumes a batch of training data as

Where N represents the number of image-text pairs that match each other and belong to the same identity. Given a pedestrian image I, the pyramid vision Transformer fourth order is adopted in the inventionThe high-level global feature map generated by the segment is represented as

Where H, W and C represent the height, width and number of channels, respectively, of the feature map.

S2, extracting text features described by input pedestrians by using text branches in the double-current feature learning network;

in the text branch, the invention uses a BERT model, which is widely used in natural language processing, to convert text descriptions into token sequences and extract word vectors. For the convenience of subsequent processing, the invention sets a fixed value L to control the sentence length. In the process of converting the text description into the token sequence, for the sequence with the length less than L, zero filling operation is carried out; for sequences that are longer than L, the present invention takes the first L tokens. Thus, a fixed length token sequence is obtained. Then inputting them into BERT model to obtain word vector

Where D is the dimension of each word vector.

In order to extract the global feature map of the pedestrian description, the invention first extracts the dimension of the word vector from

Extend to

And thus can be processed by a subsequent convolutional neural network. Next, the word vector dimension D is converted to the same number of channels C as the image high level global feature map using a convolutional layer and batch norm operation. Finally, a deep convolutional neural network containing a text residual bottleneck structure is used for extracting a high-level global feature map of each sentence description

S3, aligning the global feature maps extracted from the image and text branches in a feature space;

one of the major challenges for the cross-modal pedestrian retrieval task is the significant modal distance between the image and the text. In order to reduce the distance between the two modes. The invention learns discriminative visual text features using a cross-modal projection matching penalty function (CMPM), which can merge cross-modal projections into KL divergence to correlate image and text modalities.

Given a batch of image and text features, an image-text pair is represented as

In order to filter out important global context information and reduce the sensitivity of the network to modal differences, the invention firstly applies global maximum pooling to image features and text features to obtain

The similarity between two feature vectors can be reflected by the size of the scalar projection, the greater the value of which, the greater the similarity between two feature vectors. Thus, according to

And

all feature pairs in a batch with scalar projection values

Can obtain the proportion of

And

the probability of belonging to the same identity is:

wherein the content of the first and second substances,

representing a standardized text feature. The present invention requires optimizing each image feature in the batch process

The objective function associated with its correctly matched text feature is expressed as:

where e is used to avoid numerical problems, q _i,j Is a feature of the image

And text features

Normalized correct match probability because there may be multiple text features and matches in a batch

Match, can be expressed as

Thus, in a batch, the loss function of the image-to-text projection can be summarized as:

vice versa, the loss function of the text-to-image projection can be expressed as L _T2I . To zoom in the distance between the image and the text modality in both directions, the CMPM loss function can be defined as:

L _CMPM ＝L _I2T +L _T2I (4)

s4, generating a token sequence by using the image global features aligned in the feature space;

to reduce the modal distance between the image and the text, many existing methods segment the image or extract attribute phrases from the sentence description. Visual features are encouraged to match given text features by establishing a variety of granular associations between images and text. However, these methods introduce additional pre-processing steps, resulting in a significant increase in the amount of computation and model complexity. The present invention attempts to bridge the image and text modalities, transform the features of the two modalities into the same space for metrology, and reduce the distance between the two modalities with a new paradigm of string optimization to obtain the modality invariance features.

In view of the above, the present invention proposes a text generation module that uses deep semantic features of an input image to generate a text description, and in this way, visual and text features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced.

As shown in fig. 4, step S4 further includes the following specific steps:

the whole framework can be regarded as one encoder-decoder structure at step S41. An input image is first encoded into fixed-dimension feature vectors using an encoder, and then these feature vectors are converted into a generated text using a decoder. The pyramid vision Transformer in the image branch is used as the role of an encoder, and after the feature extraction of the backbone network and the processing of the global maximum pooling layer, the feature vector with fixed dimensionality can be obtained

The next goal is to maximize the use of feature vectors

Probability of generating correct text, as shown in equation (5), where ω represents a parameter of the model (for brevity, in the figure of brevity)Omitted in subsequent processing), T _r Represent and

a sequence of real tokens having the same identity.

Step S42, taking a sentence description as an example, in order to complete the text generation task, it is necessary to predict the current word from the previous word using the chain rule, and this process is shown in formula (6), where l is the length of this sentence description. A Recurrent Neural Network (RNN) is a network with recurrent connections that can pass information between different times, predicting the state at the current time based on previously remembered information, and thus enabling modeling

However, although the recurrent neural network has a certain memory capacity, it does not deal well with the long-term dependence problem, i.e., when the predicted point is far away from the information on which it depends, it is difficult for the recurrent neural network to accurately learn the relevant information.

In step S43, in order to solve the above problem, the present invention employs a long short term memory network (LSTM). The network introduces three gating mechanisms, input Gate (IG), output Gate (OG) and Forgetting Gate (FG), which are used to control the flow of information. The first two gates are used to determine whether to input or output information, and the last gate is used to determine the proportion of information that should be discarded. In addition, the candidate memory cells are represented as

It uses the Tanh activation function to map values to [ -1,1]And the interval is used for determining the state of the memory unit at the current moment.Memory cell C at the current moment _t The memory cell is determined by the information of the memory cell at the previous moment and the candidate memory cell at the current moment under the control of the forgetting gate and the input gate. Finally, the output gate decides to pass to the hidden state H _t The amount of information of (2). Given the input of the current time t

And hidden state at last time t-1

The above process is expressed as equation (7-9). Wherein, W _xc And W _hc Is a weight parameter, b _c Is a bias parameter. In this way, the long-short term memory network alleviates the gradient anomaly problem and better models long-term dependencies in the time series.

H _t ＝OG _t ⊙tanh(C _t ) (9)

S5, performing token alignment between the generated token sequence and the real token sequence;

in order to improve the quality of converting image features into text descriptions by a decoder in a long-short term memory network and enable the generated descriptions to be more real, the invention adopts a cross entropy loss function to restrict the distance between a generated token sequence and a real token sequence, thereby realizing token space alignment, as shown in formula (10). Where p (x) is the true distribution of the sample and q (x) is the predicted distribution:

s6, performing cross-mode fusion interaction on the image and text features;

as shown in fig. 3, in order to further reduce the distance between the image and text modalities, the present invention designs a new cross-modality interaction module. And respectively mapping the image high-level global features extracted by the pyramid vision Transformer and the generated text features obtained by utilizing the image features in the text generation module to respective feature spaces through convolution operation. And performing attention enhancement on the down-sampled feature vector, namely enhancing the channel information of the input feature by using a module consisting of a full connection layer and an activation function. And obtaining a weight matrix between the image and the text characteristic through matrix multiplication, normalizing the weight matrix by using a Softmax activation function, and carrying out weighted summation on the result and the image characteristic to obtain a final attention matrix. In addition, the idea of residual error connection is introduced here, and the attention moment matrix is added back to the original image characteristics to obtain the final fusion output.

The generation of the text features is converted from the high-level global features of the images through a text generation module, is essentially another stage expression form of the image features, and the adoption of the module for cross-modal feature fusion interaction is equivalent to the multi-stage aggregation of the two forms of image features. And then, taking a cross-modal projection matching loss function as an interaction loss function, supervising the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network, and gradually shortening the distance between the image and the text mode by the whole model in a gradual mode to reduce the mode difference.

S7, training a natural language pedestrian retrieval model combining tokens and feature alignment;

and (3) searching a data set CUHK-PEDES by using a natural language pedestrian, extracting image and text characteristics according to the steps, and training a neural network model under the supervision of loss functions of all modules by adopting an Adam neural network optimizer.

S8, testing a natural language pedestrian retrieval model by combining the token and feature alignment;

and giving a text description of a natural language as a query, and comparing cosine similarity between the global features extracted from the text description and the global features of each pedestrian image in the image library, wherein the image with the highest similarity is the result of the pedestrian retrieval.

In summary, the present invention aims to guide network learning by aligning two spaces of a token and a feature, and first, a powerful double-flow feature learning network based on a pyramid vision Transformer and a convolutional neural network is constructed, and alignment of the feature space is performed only by using global features, thereby effectively reducing modal distance. Secondly, a text generation module is designed, and the intra-class distance is reduced in the token space in a cross-modal text generation mode. Finally, a cross-modal interaction module is provided to aggregate the image features and the generated text features, further shorten the distance between the image and the text modalities and reduce modal difference. The invention avoids the ambiguity embedding problem caused by using local characteristics, does not need additional preprocessing steps and reduces the time and resource expenditure.

The present invention attempts to bridge the image and text modalities, transform the features of the two modalities into the same space for metrology, and reduce the distance between the two modalities with a new paradigm of string optimization to obtain the modality invariance features.

The invention provides a text generation module which generates text description by utilizing deep semantic features of an input image. In this way, visual and textual features are mapped to the same space. By adding extra supervision of token space on the basis of feature space, the modal distance is also pulled closer while the intra-class distance is reduced. The invention adopts LSTM to relieve the problem of gradient anomaly and better model the long-term dependence in the time sequence.

The invention adopts the cross entropy loss function to restrict the distance between the generated token sequence and the real token sequence, improves the quality of converting the image characteristics into text description by a decoder in the long-short term memory network, and ensures that the generated description is more real.

The invention performs cross-modal feature fusion interaction, and further reduces the distance between the image and the text mode by using multi-stage feature fusion. According to the invention, a cross-modal projection matching loss function is taken as an interaction loss function, the fusion output obtained by the module and the text high-level global features extracted by the convolutional neural network are supervised, the distance between the image mode and the text mode is gradually shortened by the whole model in a progressive mode, the modal difference is reduced, and the model is promoted further. The method solves the technical problems of ambiguity embedding, high complexity, dependence on preset data, poor modal distance and poor intra-class distance optimization effect in the prior art.

The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A natural language pedestrian retrieval method combining token and feature alignment, the method comprising:

s1, processing image branches in a preset double-flow feature learning network, and extracting input pedestrian image features according to the image branches by using a pyramid vision Transformer as a backbone network;

s3, aligning the global feature graphs extracted from the image branches and the text branches in a preset feature space to obtain aligned image global features, learning discriminant visual text features by using a cross-mode projection matching loss function CMPM, associating two modes of an image and a text and reducing the modal distance between the image and the text;

s4, generating the token sequence according to the global features of the aligned images, converting the features of an image mode and a text mode into the same space for measurement, bridging the image mode and the text mode, reducing the distance between the image mode and the text mode by using a new string optimization mode, obtaining a mode invariance feature, generating the text description by using deep semantic features of an input image by using a text generation module, mapping the image feature and the text feature to the same space, increasing token space supervision on the basis of the feature space, reducing the intra-class distance and shortening the distance between the image mode and the text mode;

s6, cross-modal fusion and interaction of the image features and the text features, cross-modal interaction module convolution is used for mapping the image high-level global features and the generated text features to respective feature spaces, down-sampling and strengthening processing are performed on the image high-level global features and the generated text features, a weight matrix between the image high-level global features and the generated text features is obtained through processing, normalization and weighting processing are performed on the weight matrix to obtain an applicable attention matrix, the applicable attention matrix is processed through residual connection to obtain an applicable fusion output, a cross-modal projection matching loss function is used as an interaction loss function, the applicable fusion output and the text high-level global features extracted by the text branches in the step S2 are supervised, the distance between the image and the text modes is shortened, and modal differences are reduced;

s7, extracting the image features and the text features according to the steps S1 to S6, and training the natural language pedestrian retrieval model by using an Adam neural network optimizer;

2. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S1 comprises:

s12, giving a pedestrian image I, and representing a high-level global feature map generated by a fourth stage of the pyramid vision Transformer by the following logic:

3. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S2 comprises:

s21, converting text description into a token sequence and extracting word vectors by using a BERT model in the text branch;

s22, setting a fixed value L to control the sentence length;

s23, in the process of converting the text description into the token sequence, zero filling operation is carried out on the sequence to be converted, of which the sequence length is smaller than a preset length threshold value L;

s24, for the sequences to be converted with the sequence length exceeding the preset length threshold value L, taking the first L tokens to obtain fixed length token sequences, and inputting the fixed length token sequences into the BERT model to obtain word vectors:

where D is the dimension of each word vector;

s25, dimension of the word vector is selected from

Extend to

Extracting a global feature map of the pedestrian description;

s26, converting the dimension D of the word vector by utilizing a convolution layer and batch norm operation, so that the numerical value of the dimension D of the word vector is converted into the channel number C of the image high-level global feature map;

Wherein the deep convolutional neural network comprises a text residual bottleneckAnd (5) structure.

4. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein said step S3 comprises:

s32, processing the image characteristics and the text characteristics in a global maximum pooling mode to obtain maximum pooled data

According to the filtering to obtain important global context information, utilizing the image characteristics f _i ^I And the text feature

s33, acquiring all feature pairs of the scalar projection value in one batch

The ratio of the image feature f is obtained by the following logic processing _i ^I With the text feature

The same identity probability of (c):

wherein the content of the first and second substances,

representing a normalized text feature;

s34, utilizing the following logic, and carrying out object function on each image characteristic in batch processing

Associated with the text feature that it correctly matches, and optimizing the objective function:

wherein e is used as a numerical problem processing parameter, q _i,j Is the image feature f _i ^I And the text feature

Normalized correct match probability;

s35, not less than 2 text features and image features f exist in one batch _i ⁱ When matched, in one batch, the image-to-text projection loss function is defined with the following logic:

the CMPM loss function is derived by logic processing to bi-directionally approximate the image to text modality distance as follows:

L _CMPM ＝L _I2T +L _T2I (4)

5. the joint token and feature aligned natural language pedestrian retrieval method of claim 4, wherein in said step S35, said correct matching rate is characterized by the following logic:

6. the joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S4 comprises:

taking the pyramid vision Transformer in the image branch as an encoder, performing feature extraction by using the backbone network, and processing the input image by using a global maximum pooling layer to obtain the fixed dimension feature vector f _i ^I ；

Using the fixed-dimension feature vectors with the following logical maximization

Probability of generating correct text:

where ω represents the parameter of the model, T _r Represents and f _i ^I A sequence of real tokens having the same identity;

s42, predicting the current word according to the previous word in the sentence by using the chain rule according to the following logic so as to generate a text:

wherein

Is the length of this sentence description;

s43, modeling the logic by using a long-short-term memory network LSTM, wherein the long-short-term memory network LSTM comprises the following logic: an input gate IG, an output gate OG, and a forgetting gate FG, for controlling the flow of information. Wherein, the input gate IG and the output gate OG are used for determining whether to input or output information, and the forgetting gate FG is used for determining the proportion of discarded information;

s44, representing the candidate memory unit as

S45, giving the input of the current time t by using the following logic

And a hidden state at said previous instant t-1

To determine the state of transmission to the hidden state H through the output gate _t The amount of information of (2):

H _t ＝OG _t ⊙tanh(C _t ) (9)

7. The method for natural language pedestrian retrieval combining token and feature alignment according to claim 1, wherein the distance between the token sequence and the real token sequence is generated with the following logic constraints by using a cross entropy loss function in step S5 to realize the token space alignment:

8. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S6 comprises:

s61, convolution processing the image high-level global features and the generated text features to map the image high-level global features and the generated text features to the feature spaces respectively;

s62, downsampling the image high-level global features and the generated text features, and strengthening channel information of the image high-level global features and the generated text features by using a full connection layer and an activation function so as to strengthen attention of the downsampled features;

s63, obtaining a weight matrix between the high-level global features of the image and the generated text features through matrix multiplication;

s64, normalizing the weight matrix by utilizing a Softmax activation function, and weighting and summing the normalized weight matrix and the image characteristics to obtain an applicable attention matrix;

s66, taking the cross-modal projection matching loss function as an interaction loss function, supervising the fusion output and the text high-level global features extracted by the preset convolutional neural network in the step S2, and reducing the modal difference by shortening the distance between the image and the text mode.

9. The joint token and feature aligned natural language pedestrian retrieval method of claim 1, wherein the step S7 comprises:

s71, searching a data set CUHK-PEDES by using a pedestrian in natural language, and according to the image feature and the text feature;

and S72, training the natural language pedestrian retrieval model under the supervision of loss functions of all modules by using an Adam neural network optimizer.

10. A joint token and feature aligned natural language pedestrian retrieval system, the system comprising:

a feature space alignment module, configured to align the global feature map extracted from the image branch and the text branch in a preset feature space to obtain an aligned global feature, learn a discriminant visual text feature by using a cross-mode projection matching loss function CMPM to associate an image modality with a text modality and reduce a distance between the image modality and the text modality, where the feature space alignment module is connected to the pedestrian image feature extraction module and the pedestrian text feature extraction module;

a text generation module, configured to generate the token sequence according to the global feature of the aligned image, convert features of an image modality and a text modality into the same space for measurement, bridge the image and text modalities, reduce a distance between the image modality and the text modality by using a new string optimization paradigm, thereby obtaining a modality invariance feature, generate the text description by using a deep semantic feature of an input image by using a text generation module, thereby mapping the image feature and the text feature to the same space, so as to increase token space supervision on the basis of a feature space, thereby reducing an intra-class distance, and reduce the image and text modality distance, where the text generation module is connected to the pedestrian image feature extraction module, the pedestrian text feature extraction module, and the feature space alignment module;

a token space alignment module, configured to utilize a combined token and a feature aligned framework TFAF, and use cross entropy loss as a reconstruction loss function to constrain a distance between a generated token sequence and a real token sequence, so as to implement token space alignment, where the token space alignment module is connected to the text generation module;

a cross-modal fusion interaction module, configured to perform cross-modal fusion interaction on the image features and the text features, map the image high-level global features and the generated text features to respective feature spaces by convolution of the cross-modal interaction module, perform downsampling and enhancement processing on the image high-level global features and the generated text features, process to obtain a weight matrix between the image high-level global features and the generated text features, normalize and weight the weight matrix to obtain an applicable attention matrix, process the applicable attention matrix by using residual connection to obtain an applicable fusion output, use a cross-modal projection matching loss function as an interaction loss function, supervise the applicable fusion output and the text high-level global features extracted by text branching in step S2, and reduce modal differences by reducing a distance between the image and the text modes, where the cross-modal fusion interaction module is connected to the token space alignment module;

the model training module is used for training the natural language pedestrian retrieval model by utilizing an Adam neural network optimizer according to the image features and the text features, and is connected with the pedestrian image feature extraction module, the pedestrian text feature extraction module, the feature space alignment module, the text generation module, the token space alignment module and the cross-modal fusion interaction module;