CN112949534A

CN112949534A - Pedestrian re-identification method, intelligent terminal and computer readable storage medium

Info

Publication number: CN112949534A
Application number: CN202110276486.1A
Authority: CN
Inventors: 韩晓; 林通
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2021-06-11

Abstract

The invention discloses a pedestrian re-identification method, an intelligent terminal and a computer readable storage medium, wherein the method comprises the following steps: acquiring a query image; inputting the query image into a trained feature encoder, and encoding the query image through the feature encoder to obtain an initial feature corresponding to the query image; performing feature fusion on the initial features and all memory features in a preset memory to obtain query features corresponding to the query image; and calculating comparison similarity between the query features and a plurality of preset target features, and determining target classification corresponding to the query image according to the comparison similarity. Aiming at the scene with few samples, the invention provides a memory-enhanced network element learning framework utilizing an external memory, so that the pedestrian re-identification can still achieve high performance under the scene with few samples, and the further popularization and application of the pedestrian re-identification technology are promoted.

Description

Pedestrian re-identification method, intelligent terminal and computer readable storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a pedestrian re-identification method, an intelligent terminal and a computer readable storage medium.

Background

With the increasing development of the technology level and the improvement of public safety awareness, a large number of monitoring cameras are distributed in public places of activities and key traffic roads, such as railway stations, airports and libraries. The monitoring camera is mainly used for recording events occurring in the environment, and can truly record the occurring process when certain accidental events occur. And the most important point in recording events is to record human behavior. In an area with sparse personnel, a certain target can be effectively checked and monitored, but in an area with dense personnel, people's behaviors are often required to be identified one by one, so that a large amount of manpower and material resources are consumed. And often neglect occurs during the process of identifying one by one, resulting in inefficiency. On the other hand, although the monitoring cameras are distributed densely, certain dead angles still exist, only partial features of the target can be recorded, and effective recognition is difficult to perform. In response to this problem, pedestrian re-identification techniques have been developed.

The pedestrian re-identification technology is to utilize a query image to search and match in a massive database so as to obtain discriminative features to distinguish images from the same person. At present, the pedestrian re-identification technology is mainly carried out by adopting supervised learning, the identification accuracy is improved, massive data is required, and a large amount of manpower and financial resources are required for data marking work. Especially for images acquired across cameras, association of pedestrians is a very cumbersome step.

The pedestrian re-identification technical process can be divided into two stages, namely feature extraction and feature matching. In the feature extraction stage, a large number of labeled data samples are utilized to train a neural network as a pedestrian identity classifier. The neural network can extract the features of the pedestrians in the image, so the classifier is used as a feature extractor to extract the embedded features of the query image and the image in the database. In the feature matching stage, a distance metric function, such as Euclidean distance and cosine distance, is used for measuring the similarity between the query image and the images in the database, and sorting and recalling are carried out according to the similarity. However, the deep learning technology has strong dependence on data quantity, and when the data quantity is deficient, although the training effect of the model in the sample is good, the generalization ability is deficient, and the effect in practical application and the effect in training have a large difference. And the high cost of data annotation makes it difficult for supervised methods to extend into real life. Therefore, how to train the model by using a small amount of samples to finally obtain the model with stronger generalization ability and robustness is a key problem in the field of pedestrian re-identification at present.

Disclosure of Invention

The invention mainly aims to provide a pedestrian re-identification method, an intelligent terminal and a computer readable storage medium, and aims to solve the problem that a pedestrian re-identification model in the prior art is lack of generalization capability and robustness.

In order to achieve the above object, the present invention provides a pedestrian re-identification method, including the steps of:

acquiring a query image;

inputting the query image into a trained feature encoder, and encoding the query image through the feature encoder to obtain an initial feature corresponding to the query image;

performing feature fusion on the initial features and all memory features in a preset memory to obtain query features corresponding to the query image;

and calculating comparison similarity between the query features and a plurality of preset target features, and determining target classification corresponding to the query image according to the comparison similarity.

Optionally, the method for re-identifying pedestrians, wherein the performing feature fusion on the initial features and each memory feature in a preset memory to obtain query features corresponding to the query image specifically includes:

calculating memory similarity values between the initial features and all memory features in a preset memory storage;

and fusing the memory features and the initial features according to the memory similarity values corresponding to the memory features to obtain query features corresponding to the query image.

Optionally, the method for re-identifying pedestrians, wherein before the step of inputting the query image into a trained feature encoder and encoding the query image by the feature encoder to obtain an initial feature corresponding to the query image, the method further includes:

acquiring a training image set, wherein the training image set comprises a plurality of training subsets;

inputting the anchor training images and the sample images in the training subsets into a preset initial encoder for each training subset, and encoding the anchor training images and the sample images through the initial encoder to obtain anchor features corresponding to the anchor training images and sample features corresponding to the sample images, wherein the sample images comprise positive training images and/or negative training images;

calculating a training similarity value between the anchor training feature and the sample feature, and determining a prediction label corresponding to the anchor training image based on the training similarity value;

and adjusting parameters of the initial encoder based on the prediction label and the training labels in the training subset until the initial encoder converges to obtain the feature encoder.

Optionally, the pedestrian re-identification method, wherein the encoding the anchor training image by the initial encoder to obtain the anchor feature corresponding to the anchor training image specifically includes:

inputting the training image into a preset first encoder aiming at the training image in each training image set, and encoding the training image through the first encoder to obtain an intermediate feature corresponding to the training image;

inputting the intermediate features into a preset first decoder, and coding the intermediate features through the first decoder to obtain a prediction label corresponding to the training image;

performing parameter adjustment on the first encoder and the first decoder based on the prediction label and the label feature corresponding to the training image until the first encoder and the first decoder converge to obtain a trained second encoder and a trained second decoder, wherein the label feature is an embedded vector corresponding to the training label corresponding to the training image;

and based on the second encoder, encoding the anchor training image to obtain the anchor characteristics corresponding to the anchor training image.

Optionally, the method for re-identifying pedestrians, wherein after calculating a training similarity value between the anchor training feature and the sample feature and determining a prediction label corresponding to the anchor training image based on the training similarity value, further includes:

calculating an evaluation value between the predicted label and the training label;

determining whether the anchor training image is a low recognition image or not according to the evaluation value and a preset evaluation threshold value;

and if the anchor training image is the low recognition image, taking the anchor training image and the anchor training image as memory characteristics, and writing the memory characteristics and training labels corresponding to the memory characteristics into the memory.

Optionally, the method for re-identifying pedestrians, wherein the performing parameter adjustment on the initial encoder based on the prediction tag and the training tags in the training subset until the initial encoder converges to obtain the feature encoder further includes:

and updating the memory characteristics in the memory storage based on the characteristic encoder.

Optionally, the method for re-identifying pedestrians, wherein the fusing the memory features and the initial features according to the memory similarity values corresponding to the memory features to obtain query features corresponding to the query image specifically includes:

classifying the memory characteristics according to the memory similarity values corresponding to the memory characteristics, and determining related characteristics related to the initial characteristics in the memory characteristics;

inputting the initial features, the relevant features, training labels corresponding to the relevant features and memory similarity values into a trained feature fusion model, and performing feature fusion on the initial features, the relevant features and the training labels corresponding to the relevant features through the feature fusion model according to the memory similarity values to obtain query features corresponding to the query images.

Optionally, the method for re-identifying pedestrians, wherein the inputting the initial feature, the relevant feature, the training label corresponding to the relevant feature, and a memory similarity value into a trained feature fusion model, and performing feature fusion on the initial feature, the relevant feature, and the training label corresponding to the relevant feature according to the memory similarity value through the feature fusion model to obtain the query feature corresponding to the query image specifically includes:

splicing the initial features, the related features and training labels corresponding to the related features to obtain a splicing matrix corresponding to the initial features;

inputting the splicing matrix into a feature decoder in the feature fusion model, and adjusting the feature value of the splicing matrix through the feature decoder to obtain a target matrix corresponding to the splicing matrix; and the number of the first and second groups,

calculating a weight value corresponding to each relevant feature according to the memory similarity value;

and performing inner product on the target matrix according to the weight value to obtain the initial characteristic corresponding to the query image.

Optionally, in the pedestrian re-identification method, a network structure of the feature decoder is a residual network structure; the residual network structure comprises a plurality of convolution modules and residual modules.

Optionally, in the pedestrian re-identification method, the convolution module includes a multi-attention layer.

In addition, to achieve the above object, the present invention further provides an intelligent terminal, wherein the intelligent terminal includes: the pedestrian re-identification method comprises a memory, a processor and a pedestrian re-identification program stored on the memory and capable of running on the processor, wherein the pedestrian re-identification program realizes the steps of the pedestrian re-identification method when being executed by the processor.

Further, to achieve the above object, the present invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a pedestrian re-identification program, which when executed by a processor implements the steps of the pedestrian re-identification method as described above.

In the invention, an inquiry image to be inquired is obtained firstly. And then inputting the query image into a trained feature encoder, wherein the feature encoder can perform feature extraction and compression on the input query image to achieve an encoding effect, so that initial features corresponding to the query image are obtained. And then fusing the initial characteristics with the preset memory characteristics in the memory storage. The memory characteristics are characteristics corresponding to training images which cannot be effectively identified in the process of training the model. Through the fusion with the memory characteristics, the expression of the initial characteristics can be enhanced, so that the discrimination between the target characteristics in the follow-up and database is improved, and the effect of improving the accuracy of the identification is achieved.

Drawings

FIG. 1 is a first flowchart of a preferred embodiment of a pedestrian re-identification method of the present invention;

FIG. 2 is a second flowchart of a preferred embodiment of the pedestrian re-identification method of the present invention;

FIG. 3 is a diagram of a ResNet series network structure used by an encoder according to a preferred embodiment of the present invention;

FIG. 4 is a sample set of training samples for meta-learning and meta-testing in the preferred embodiment provided by the pedestrian re-identification method of the present invention;

FIG. 5 is a process of storing memory characteristics according to a preferred embodiment of the pedestrian re-identification method of the present invention;

FIG. 6 is a process of feature fusion of initial features according to a preferred embodiment of the present invention;

fig. 7 is a schematic operating environment diagram of an intelligent terminal according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The pedestrian re-identification method in the preferred embodiment of the invention can be executed by an intelligent terminal, and can also be realized by certain software, plug-in and the like of the intelligent terminal, and the intelligent terminal comprises terminals such as an intelligent television, an intelligent mobile phone and the like. The pedestrian re-identification software installed in the intelligent terminal is used as an example to describe the pedestrian re-identification process. As shown in fig. 1 and 2, the pedestrian re-identification method includes the following steps:

step S100, acquiring a query image.

Specifically, the person re-identification software acquires the query image in a scanning, importing and other modes. The query image in this embodiment refers to an image including a pedestrian acquired by a monitoring device such as a camera. Since the image acquired by the monitoring device includes the image with and without the pedestrian, before acquiring the query image, if the original data is the monitoring video shot by the monitoring device, each frame image in the monitoring video is identified to determine whether each frame image includes the person identification, and if so, the frame image is determined as the person image including the pedestrian. And then, extracting pedestrians from the pedestrian images to obtain an inquiry image containing the whole single pedestrian, so that the pedestrians in the monitoring video are extracted. And acquiring the query image by the pedestrian re-identification software in modes of scanning and the like.

Step S200, inputting the query image into a trained feature encoder, and encoding the query image through the feature encoder to obtain an initial feature corresponding to the query image.

Specifically, the feature encoder can encode and convert input data into data of another form. In the aspect of feature extraction, the main role of the encoder is to perform feature extraction and compression on the query image, so as to encode the query image into a low-dimensional feature vector.

In this embodiment, the encoder is a mapping function, and for input data xt, which is a query image in application, the encoder maps the data xt into a feature vector et with a lower dimension, i.e., a query feature. The encoder can be built by adopting the conventional convolution network structure, and in order to better realize the encoding effect, the network structure can be selected and adjusted according to the mapping object. The encoder used in this embodiment is preferably a residual network structure as an encoder, for example, a ResNet series in a residual network, as shown in fig. 3. When the encoder is built, the neural network of the encoder can be finally built according to the self effect of the residual error network and the characteristics of the mapping object.

Further, in the present embodiment, in order to enhance the learning effectiveness of the learning on the few samples, the learning on the few samples based on the meta learning framework is adopted. The learning process is divided into a meta-learning (meta-learning) stage and a meta-testing (meta-test) stage, and the corresponding training data set and test set are also specially processed. The few-sample learning is generally divided into an N-way K-shot problem, where in a pedestrian re-identification scenario, N is the number of identity ID categories of pedestrians, and K means that there are K samples under each pedestrian ID. Fig. 4 shows a 6-way1-shot scene, in which a meta-learning training set comprises 6 pedestrian pictures of different ID types and a training image (with type label) of a known real type, which form a small batch of input, and in the meta-testing stage, new pedestrian pictures of 6 ID types and a query image (without type label) are given to identify the type of the query image. Therefore, in this embodiment, each training label in the training image set corresponds to a plurality of training images, for example, six training labels are included, and are named as pedestrian a, pedestrian B, pedestrian C, pedestrian D, pedestrian E, and pedestrian F. The training process specifically comprises the following steps:

step A10, a training image set is obtained.

First, a large number of training images are obtained, and the training images, i.e., a training image set including a plurality of training subsets, in this embodiment, training is performed in a meta-training manner, such as a triple Network (Triplet Network) and a twin Network (simple Network). In this embodiment, a three-tuple network is taken as an example for specific description, based on the three-tuple network, the training subset includes three training images, the three training images include an anchor training image, a positive training image and a negative training image, the training labels corresponding to the anchor training image and the positive training image are the same, for example, all the training labels are pedestrians a, and the training labels corresponding to the other training image and the other training image are different, for example, pedestrians B. In the training process, assuming that the training label corresponding to the anchor training image is unknown, the positive training image is marked as x⁺Labeling the negative training image as x^-. If the adopted network is a twin network, the training subset comprises two image groups, each image group comprises an anchor training image, and in one image group, the other training image is a positive training image; in the other image group, the other training image is a negative training image. Thus, the positive and negative training images are summarized as sample images, each training subset including the anchor training image and the sample image, and the sample image including the positive and negative training images.

Step A20, inputting the anchor training image and the sample image in the training subset into a preset initial encoder for each training subset, and encoding the anchor training image and the sample image through the initial encoder to obtain the anchor feature corresponding to the anchor training image and the sample feature corresponding to the sample image.

Specifically, for each training subset, an anchor training image and a sample image in the training subset are input into a preset initial encoder, and the initial encoder encodes the anchor training image and the sample image respectively to obtain an anchor feature corresponding to the anchor training image and a sample feature corresponding to the sample image respectively.

Further, the initial encoder used in this embodiment is obtained by training a stack of encoder (encoder) and decoder (decoder) combinations. The encoder and decoder are trained as a combination before the initial encoder is used. In this embodiment, in order to distinguish the encoder and the decoder at the time of training, they are referred to as a first encoder and a first decoder.

And inputting the training images into a preset first encoder aiming at the training images in each training image set, and encoding the training images through the first encoder to obtain the intermediate features corresponding to the training images.

And inputting the intermediate feature into a preset first decoder, and then coding the intermediate feature by the first decoder to obtain a prediction feature corresponding to the intermediate feature, and determining a prediction label corresponding to the training image according to the prediction feature.

And then calculating a second loss value between the predicted label and a training label corresponding to the training image based on a preset second loss function. And then, based on a second loss value, performing parameter adjustment on the first encoder and the first decoder until the first encoder and the first decoder converge to obtain a trained second encoder and a trained second decoder.

And finally, coding the anchor training image or the sample training image by adopting a second coder obtained by training so as to obtain the anchor training characteristics corresponding to the anchor training image and the sample training characteristics corresponding to the sample training image.

Step A30, calculating a training similarity value between the anchor training feature and the sample feature, and determining a prediction label corresponding to the anchor training image based on the training similarity value.

Specifically, training similarity values between the anchor training features and the sample training features are then calculated. The method for calculating the training similarity value between the two can adopt cosine similarity value, Euclidean distance and other similarity measurement methods. And calculating the similarity between the obtained anchor training features and the sample training features through the similarity measurement modes.

Still taking the triplet network as an example, the training similarity value between the anchor training features and the positive training features is the positive distance, while the negative distance between the anchor training image and the negative training features is the negative distance.

Then, the training similarity values are compared, and the training label corresponding to the sample training feature corresponding to the maximum training similarity value is generally selected as the prediction label corresponding to the anchor training image.

And A40, adjusting parameters of the initial encoder based on the prediction label and the training labels in the training subset until the initial encoder converges to obtain the feature encoder.

Specifically, a first loss value between the prediction tag and the training tag is calculated based on a preset first loss function, and then the first loss value is reversely transmitted back to the initial encoder, so that parameters of the initial encoder are adjusted until the initial encoder converges, and the converged initial encoder is used as a feature encoder.

Since the negative training images are not of the same class as the anchor training images, theoretically the negative distance values should be large and the positive distances should be small. An interval threshold is preset, and a loss value is calculated based on the positive distance, the negative distance, and the interval threshold. The larger the difference between the positive distance and the negative distance is, the more the features mapped by the current encoder can effectively distinguish different classification labels, so that the larger the difference between the two is, the better the difference between the two is, and the interval threshold is the minimum interval of the differences between the two, and 1 is generally used.

In the process of training the encoder, based on the anchor training images in each training subset, there is a prediction label according to the positive distance and the negative distance corresponding to the anchor training images, that is, a label obtained by classifying the anchor training images based on the current encoder. If the difference between the predicted label and the real label of a certain anchor training image is larger, the situation that the anchor training image cannot be accurately identified currently is indicated, and the real label is the classification label actually corresponding to the anchor training image. Generally, training based on a large sample optimizes the part with low accuracy through a large amount of samples, but because the embodiment needs to solve a small amount of samples, if samples which cannot be accurately identified are ignored, the generalization capability of the samples is lacked.

Therefore, in this embodiment, as shown in fig. 5, where e is the second encoder, d is the second decoder, M is the memory storage, and C is the controller, the memory module determines whether the currently observed anchor training image should be stored in the memory module, and the memory module does not need to perform back propagation. In the training process, when the recognition result of a certain anchor training image has low accuracy, the anchor training image can be stored in the memory storage.

Wherein, the evaluation of the recognition accuracy of the anchor training image can be calculated by using an evaluation function, and the evaluation function can be a conventional loss function, such as a cross entropy function:

wherein, y is a training label,

is a predictive tag. Intuitively, the more accurate the prediction of the anchor training image, the lower the evaluation value obtained by the evaluation function.

An evaluation threshold is preset, and may be σ ∞ -ln (N) as a confidence level, where N is the number of types of training labels included in the training image set, i.e., the number of classification categories in the foregoing, and σ is the evaluation threshold. If the evaluation value is greater than the evaluation threshold value, the prediction execution degree is smaller than the probability of the tie prediction, namely the classification accuracy of the current model to the anchor training image is low, the anchor training image is a low recognition image, and therefore the anchor training image needs to be memorized and stored, the anchor training image and the anchor training image are used as memory features, and the memory features and training labels corresponding to the memory features are written into the memory; if the evaluation value is smaller than the evaluation threshold value, the classification accuracy of the current model to the anchor training image is higher, and therefore storage is not needed. The process is as follows:

M_key[index]＝e_t；

M_value[index]＝l_t；

wherein e is_tIndicating a memory characteristic,/_tIndicating memory characteristics e_tCorresponding label, M_keyFor storing memory characteristics, while M_valueFor storing a label corresponding to each memory characteristic. Size M of memory_size>Batchsize X number of iterations, if dex>M_sizeThen index is 0, i.e. restart from position 0; otherwise, index is equal to index + 1.

In the process of memorizing and storing, the adopted storage mode is a key-value storage mode. Wherein, the key represents the anchor training image, and the value represents the training label corresponding to the anchor training image. Further, in order to improve the utilization efficiency of the memory, the anchor training features corresponding to the anchor training images are stored as keys. After the feature encoder is obtained, updating the anchor training features stored as keys based on the feature encoder so that the memory features in the memory storage which is applied last are the latest features.

And step S300, performing feature fusion on the initial features and all memory features in a preset memory to obtain query features corresponding to the query image.

Specifically, a memory storage is preset and used for storing memory characteristics, and the memory characteristics in this embodiment refer to training characteristics with poor coding effect in a process of training an initial encoder to obtain a characteristic encoder, or training characteristics with poor characteristic fusion effect in a process of training a model for characteristic fusion. That is to say, the memory feature is a training feature with a poor classification effect of the index in the model training process.

Since the classification of the memory features is poor, it also means that when there are features similar to the memory features in the initial features, the recognition result of the initial features is likely to be poor. Therefore, in this embodiment, the memory feature is fused with the initial feature to enhance the features of the initial feature that are similar to the memory feature, so as to obtain the query feature.

In a first implementation manner of this embodiment, a memory similarity value between the initial feature and each memory feature in a preset memory is calculated, and for each memory feature, feature fusion is performed on the memory feature according to the size of the memory similarity value between the initial feature and the memory feature. The larger the similarity value, the more fused parts; the smaller the similarity value, the less part is fused.

In a second implementation of this embodiment, the memory characteristics are first filtered. Because the memory features with small similarity do not help much in determining the classification categories corresponding to the query image, in the embodiment, the memory features with small similarity in the memory features are firstly excluded to obtain the relevant features helpful for classifying the query image. The exclusion may be performed using a classification algorithm, such as a K-nearest neighbor classification algorithm, to classify the memory features into features related to the initial features and features unrelated to the initial features, and to classify the memory features related to the initial features as the relevant features for subsequent fusion.

The manner of excluding the memory feature may be implemented using the distance between the memory feature and the initial feature. And calculating the distance between the initial feature and each memory feature, and recording as the corresponding memory similarity of the memory feature. The memory similarity calculation method can adopt Euclidean distance calculation. And then selecting the first K memory characteristics in the memory similarity value as related characteristics according to the size of the memory similarity value. Can be formulated as:

{e₁,e₂,…,e_m}＝top_k(d(e_i,q_t))；

wherein e is_iRepresenting a memory characteristic, q_tAs an initial feature, d (e)_i,q_t) Denotes e_iAnd q is_tThe memory similarity value between them, if Euclidean distance is adopted, then

k＝K。

And then inputting the initial features, the relevant features, training labels corresponding to the relevant features and memory similarity values into a trained feature fusion model, and performing feature fusion on the initial features, the relevant features and the training labels corresponding to the relevant features through the feature fusion model according to the memory similarity values to obtain query features corresponding to the query images.

For the set { q_t,e₁,…,e_m,l₁,…,l_m,d₁,…,d_mAnd firstly, splicing the initial features, the related features and the real labels corresponding to the related features to obtain a splicing matrix in order to unify the mode of fusing each initial feature. Wherein e is_t(t∈[1,m]) For a related feature,/_t(t∈[1,m]) As a related feature e_tEmbedding features of the corresponding label, d_t(t∈[1,m]) As a related feature e_tCorresponding memory similarity values. The splicing method can be implemented by storing the data as a single data in a certain sequence, a blank matrix is preset, the related features, the training labels corresponding to each related feature and the initial features are written into the blank matrix according to the position relationship shown in fig. 6, so as to obtain a spliced matrix, and E ═ q_t,e_1…m,l_1…m]。

And then inputting the splicing matrix into a feature decoder in the feature fusion model, and adjusting the feature value of the splicing matrix through the feature decoder to obtain a target matrix corresponding to the splicing matrix. The selected related features are all features which are associated with the initial features, and the main purpose of feature adjustment of the splicing matrix is to enhance the features associated between the related features and the initial features, so that the associated features can be effectively utilized in subsequent classification.

Further, in order to reduce the loss of information and improve the accuracy of the model, the network architecture of the decoder also adopts a residual error network architecture. The decoder comprises a plurality of residual modules and convolution modules. The input of the next convolution module includes the output of the last convolution module and the residual value of the output of the last residual module. The residual error network structure comprises a plurality of convolution modules and residual error modules.

Further, the convolution module in this embodiment may adopt a multi-head Attention layer (multi-head Attention), which may realize multi-dimensional Attention to the input value, thereby improving the effectiveness of subsequent fusion. Further, in the convolution module, a Multi-layer Perceptron (MLP) is also applied. The splicing matrix is firstly input into a multi-attention layer, the multi-attention layer compares all elements, and a new tensor with the same shape is returned; the new tensor is then passed through a layer of activation functions to compute the nonlinear activation of the model. And repeating the subsequent multiple attention layers for multiple times in a residual error mode, wherein the dimensionality of the tensor is unchanged in the process, and finally obtaining a target matrix. The process of obtaining the target matrix is the process of extracting and reprocessing the features of each vector, and it is finally determined that each correlation vector is fused to the initial vector, and it is the correlation between the initial vector and the correlation vector, that is, the memory similarity value as described above.

Therefore, in the process of processing the splicing matrix by the decoder, the memory similarity value is converted into the weight value as a whole. In this embodiment, functions such as a softmax function may be used for implementation, and the softmax function may implement normalization of the memory similarity value and map the memory similarity value to a numerical value of (0, 1). And finally, performing inner product on the weighted value and the target matrix to obtain the query characteristics corresponding to the query image.

In this embodiment, the feature decoder in the feature fusion model may adopt the foregoing second decoder, and if the foregoing second decoder is adopted, in the process of training the first decoder based on the training image, when the first decoder encodes the intermediate feature, the intermediate feature and the memory feature are fused, so as to obtain the prediction label corresponding to the training image.

Step S400, calculating the comparative similarity between the query feature and a plurality of preset target features, and determining the target classification corresponding to the query image according to the comparative similarity.

Specifically, a plurality of target features are stored in advance, the target features are obtained by processing a target image in the same way as the query image, the input feature encoder is included, and a memory feature fusion process is performed, wherein the target image is an image corresponding to a target person. For example, the target person includes a pedestrian G. Inputting the target image corresponding to the pedestrian G into the feature encoder, and fusing the features output by the feature encoder with the memory features to obtain the target features.

And then calculating the comparative similarity between the query feature and the target feature. There is one query feature and six target features, as exemplified by the 6-way1-shot scenario above. And calculating the similarity value between the query feature and each target feature to obtain the comparative similarity corresponding to each target feature. And then determining the target feature closest to the query feature according to the comparison similarity, and taking the label classification corresponding to the target feature as the target classification corresponding to the query feature.

The present embodiment proposes a memory-enhanced, end-to-end training-based pedestrian re-identification framework that enables low-sample learning by exploiting the concept of meta-learning. The task training is not at a data level, but at a task level, so that the task training can also be applied to a cross-domain data set, and has a strong practical application value. In addition, a memory storage is adopted to cache and record the past characteristics, the storage can be flexibly set, and the storage is more suitable for storing the characteristics of multiple classes in pedestrian re-identification. In addition, the embodiment also constructs a self-attention mechanism fusion network to measure the similarity between the memory of the extracted memory and the characteristics of the query image, and realizes the integration of characteristic extraction and matching in the process of pedestrian re-identification reasoning.

Further, as shown in fig. 7, based on the above pedestrian re-identification method, the present invention further provides an intelligent terminal, where the intelligent terminal includes a processor 10, a memory 20, and a display 30. Fig. 7 shows only some of the components of the smart terminal, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 20 may be an internal storage unit of the intelligent terminal in some embodiments, such as a hard disk or a memory of the intelligent terminal. The memory 20 may also be an external storage device of the Smart terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the Smart terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the smart terminal. The memory 20 is used for storing application software installed in the intelligent terminal and various data, such as program codes of the installed intelligent terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In one embodiment, the memory 20 stores a pedestrian re-identification program 40, and the pedestrian re-identification program 40 can be executed by the processor 10 to implement the pedestrian re-identification method in the present application.

The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor or other data Processing chip, and is configured to execute program codes stored in the memory 20 or process data, such as executing the pedestrian re-identification method.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the intelligent terminal and for displaying a visual user interface. The components 10-30 of the intelligent terminal communicate with each other via a system bus.

In one embodiment, the following steps are implemented when the processor 10 executes the pedestrian re-identification program 40 in the memory 20:

acquiring a query image;

The present invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a pedestrian re-identification program, and the pedestrian re-identification program is executed by a processor to realize the steps of the pedestrian re-identification method.

Of course, it will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by instructing relevant hardware (such as a processor, a controller, etc.) through a computer program, and the program can be stored in a computer readable storage medium, and when executed, the program can include the processes of the embodiments of the methods described above. The computer readable storage medium may be a memory, a magnetic disk, an optical disk, etc.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A pedestrian re-identification method is characterized by comprising the following steps:

acquiring a query image;

2. The pedestrian re-identification method according to claim 1, wherein the feature fusion is performed on the initial features and each memory feature in a preset memory to obtain query features corresponding to the query image, and specifically comprises:

3. The pedestrian re-identification method according to claim 1, wherein before inputting the query image into a trained feature encoder and encoding the query image by the feature encoder to obtain an initial feature corresponding to the query image, the method further comprises:

4. The pedestrian re-identification method according to claim 3, wherein the encoding the anchor training image by the initial encoder to obtain the anchor features corresponding to the anchor training image specifically comprises:

5. The pedestrian re-identification method according to claim 3, wherein after calculating a training similarity value between the anchor training feature and the sample feature and determining the prediction label corresponding to the anchor training image based on the training similarity value, the method further comprises:

6. The pedestrian re-identification method according to claim 3, wherein the parameter adjustment of the initial encoder is performed based on the prediction tag and the training tags in the training subset until the initial encoder converges to obtain the feature encoder, and further comprising:

7. The pedestrian re-identification method according to claim 2, wherein the obtaining of the query feature corresponding to the query image by fusing the memory feature and the initial feature according to the memory similarity value corresponding to each memory feature specifically comprises:

8. The method according to claim 7, wherein the inputting the initial feature, the relevant feature, the training label corresponding to the relevant feature, and a memory similarity value into a trained feature fusion model, and performing feature fusion on the initial feature, the relevant feature, and the training label corresponding to the relevant feature according to the memory similarity value through the feature fusion model to obtain the query feature corresponding to the query image specifically comprises:

9. The pedestrian re-identification method according to claim 8, wherein the network structure of the feature decoder is a residual network structure; the residual network structure comprises a plurality of convolution modules and residual modules.

10. The pedestrian re-identification method of claim 9, wherein the convolution module includes a multi-attention layer.

11. An intelligent terminal, characterized in that, intelligent terminal includes: memory, a processor and a pedestrian re-identification program stored on the memory and executable on the processor, the pedestrian re-identification program when executed by the processor implementing the steps of the pedestrian re-identification method according to any one of claims 1 to 11.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a pedestrian re-identification program, which when executed by a processor implements the steps of the pedestrian re-identification method according to any one of claims 1 to 11.