CN117789253A

CN117789253A - Video pedestrian re-identification method based on double networks

Info

Publication number: CN117789253A
Application number: CN202410201994.7A
Authority: CN
Inventors: 陈东岳; 陈英革; 邓诗卓; 贾同
Original assignee: Foshan Graduate School Of Innovation Northeastern University
Current assignee: Foshan Graduate School Of Innovation Northeastern University
Priority date: 2024-02-23
Filing date: 2024-02-23
Publication date: 2024-03-29
Anticipated expiration: 2044-02-23
Also published as: CN117789253B

Abstract

The invention belongs to the technical field of pedestrian recognition, and discloses a double-network-based video pedestrian re-recognition method. The invention provides a video pedestrian re-identification model based on a multi-branch diversified feature mining space-time attention network and an aggregation redistribution convolutional neural coding and decoding network. The model solves the problems of information redundancy, shielding, background interference, light change, visual angle change, low resolution and the like commonly existing in the video pedestrian re-identification task to a certain extent. Through the two methods with different emphasis, the basic model is optimized and improved in multiple aspects, so that the model can more effectively complete the task of re-identifying the video pedestrians.

Description

Video pedestrian re-identification method based on double networks

Technical Field

The invention relates to the technical field of pedestrian recognition, in particular to a double-network-based video pedestrian re-recognition method.

Background

Identifying and understanding the behavior of a particular pedestrian in the surveillance video may help the supervisory personnel track his or her trajectory and early warn of dangerous behavior. The method is not only helpful for detecting and tracking illegal crimes, but also provides technical support for public place safety. In this way, the monitoring device can more effectively assist public safety supervision and provide more comprehensive guarantee for city management and administration. Currently, it is a more common scenario to retrieve a particular pedestrian using monitoring information of a monitoring network. For example, monitoring videos are used to track criminals, find lost children, and so on.

And the pedestrian re-identification is to search and track the specific pedestrians by using monitoring data of different places and different moments. The main task of the method is to take the pedestrian pictures or videos acquired under one camera as a query target, and then take all the pictures or videos acquired under all the cameras as candidate sets, so as to retrieve the pictures or videos with the same identity as the query target. Pedestrian re-recognition is widely regarded as a sub-problem of image retrieval. The technology can be used as an important supplement to the face recognition technology. In most cases, the monitoring camera cannot capture clear front shots of pedestrians, and the pedestrian re-recognition technology can continuously track pedestrians across the camera when clear faces cannot be obtained.

The biggest challenge of pedestrian re-recognition technology is cross-camera retrieval. Because different camera positions, perspectives, scenes, illumination, pedestrian gestures, etc., affect the characteristics of the pedestrian image, an efficient algorithm is needed to find the same pedestrian in the candidate set as the query target identity. In addition, there are often problems of occlusion, posture change, etc. in the pedestrian image, and these factors also increase the difficulty of re-recognition of the pedestrian. To address these problems, past pedestrian re-recognition algorithms have typically employed ways to design new manual features, objective functions, or network structures to improve model performance. In recent years, with the rapid development of deep learning, pedestrian re-recognition technology based on deep learning becomes a new research hotspot. These techniques extract more discriminative pedestrian features through end-to-end training, thereby significantly improving the ability of the model to retrieve across cameras. Meanwhile, the release of a plurality of large pedestrian re-identification data sets promotes the development of deep learning in the field, so that the performance of the model is greatly improved.

At present, the research in the pedestrian re-recognition field is concentrated in the pedestrian re-recognition based on pictures, but the storage of monitoring data in real life adopts a video format, so that the research on the pedestrian re-recognition based on video can reduce the labor cost, simplify the application flow, and enable the research to be closer to the actual scene. In addition, the pedestrian picture sequence provided by the video has more abundant information in time and space, and if the information can be fully utilized, the video pedestrian re-recognition can also obtain more excellent pedestrian retrieval capability in the practical application scene.

The existing pedestrian re-recognition method is mainly divided into two types, namely, picture-based pedestrian re-recognition and video-based pedestrian re-recognition. At present, most methods are carried out based on pictures, but the time information of data cannot be well utilized by adopting a research method based on pictures, and in real life, most monitoring data are stored in a video form, so that pedestrian re-identification based on video is closer to an actual use scene, and the method has more practical application value. The pedestrian re-identification based on the pictures takes one or more images without time continuity as input, and focuses on extracting the appearance characteristics of colors, textures and the like of clothes without considering time information among the pictures. However, these appearance-based methods are difficult to accurately re-identify if a given image has a large number of noise points or large area occlusions. In contrast, video-based pedestrian re-recognition takes a sequence of pictures with temporal continuity as input, and can utilize richer appearance information and additional temporal information, and if one frame is occluded, the other frame can be selected as a reference. The video-based method can even utilize the motion gesture or gait change of the pedestrian to identify the pedestrian, so that the influence of the appearance characteristic on the weight identification is reduced.

The existing video pedestrian re-identification method also has some problems. Since the simplest method of generating video features is to fuse frame-level features together, and frame-level features are based on two-dimensional convolution operations, the temporal information of the frame sequence is completely ignored, so such a fusion approach may miss a time cue that is critical for re-recognition. Second, videos have different frame rates and different sequence lengths, which makes it difficult to uniformly compare samples to each other. Finally, not all frames provide authentication information, and some outlier frames tend to be detrimental to the ability to promote model feature expression. In many existing research methods, only spatial information of video data is fully mined, and additional time information in the video data is not paid attention in many researches.

The remarkable characteristic of video data is that one video segment contains a large number of picture frames, the characteristics of adjacent pictures in the same video segment are very similar due to short time intervals, and two frames with long time intervals may have obvious characteristic differences. If the features of the upper jackets of the previous frames are obvious and the features of the lower jackets of the next frames are obvious, pedestrian parts shielded by different frames are different. Such feature differences make the feature information on different time sequences relatively independent, and the underlying network does not employ special means for feature sharing, exchanging part of useful information. Not only is there a problem in the time dimension, but similar problems exist in the space dimension and the channel dimension. In the spatial dimension, partial occlusion of the torso may not effectively utilize the surrounding non-occluded torso to make predictive inferences of the occluded portion. Information sharing between different features is also lacking in the channel dimension. The common feature fusion sharing modes at present include splicing, adding, multiplying and the like, but the methods all need to pre-specify how features are fused and shared, have low flexibility and poor adaptability, and are difficult to effectively share the features in complex problems.

Disclosure of Invention

The invention aims at a pedestrian re-identification task based on video, and provides a double-network-based video pedestrian re-identification method aiming at the problems of information redundancy, shielding, light change, visual angle change, background interference, low resolution and the like in video data, so that a model can better utilize time information and space information, more diversified and high-quality pedestrian features are extracted, multi-frame information in the video is fully utilized, and the features in the whole video are aggregated and then re-combined and distributed, so that each feature can be fused and shared with other features of the whole video, thereby more fully mining useful information of the whole video and improving the pedestrian re-identification accuracy.

The technical scheme of the invention is as follows: a double-network-based video pedestrian re-identification method comprises the following steps:

step 1: acquiring a pedestrian video image with a pedestrian identity tag, and preprocessing;

step 2: constructing a video pedestrian re-identification model, obtaining pedestrian characteristics by the preprocessed pedestrian video image data through the video pedestrian re-identification model, and obtaining a pedestrian identity prediction result through characteristic comparison;

the overall architecture of the video pedestrian re-identification model is as follows;

The backbone network of the video pedestrian re-recognition model is a ResNet50 depth convolutional neural network based on 3D transformation; 3D transformation of res net50 to replace all 2D network layers with corresponding 3D network layers; the time super-dimension parameter of the 3D network layer is the same as the super-parameter value of other dimensions of the 2D version; the 2D weight is used as a time dimension weight, and the rest time dimension weights are consistent with the time dimension weight;

ResNet50 is comprised of conv1, conv2_x, conv3_x, conv4_x, conv5_x and the last full link layer; the multi-branch diversity feature mining spatiotemporal attention network is located after conv4_x and the convolutional neural warp decoding network based on aggregate reassignment is located after conv5_x;

step 3: designing a loss function, and training through the difference between the pedestrian identity prediction result and the data set with the pedestrian identity label until the loss function converges;

step 4: and extracting features of the pedestrian video data by the trained video pedestrian re-recognition model, and sorting the similarity of pedestrians to be searched to obtain a re-recognition result.

The pretreatment is specifically as follows:

fixing the size of the pedestrian video image with the pedestrian identity tag, and uniformly adjusting the size of the pedestrian video image to be consistent with the input requirement of the video pedestrian re-identification model;

The pedestrian video image after the size adjustment is standardized, scaled and translated, so that the input distribution of each layer of the video pedestrian re-identification model has the same statistical characteristics;

（1）

wherein,for the input data pixel value +.>Is a standardized output; the input pedestrian video image is RGB three-channel data, different standardized parameters are respectively used for the three channels, and the mean value and the variance of the three channels are obtained by sampling and calculating;

the standardized pedestrian video image is horizontally turned over and randomly erased, and the input pedestrian video image data is downsampled in the time dimension; and uniformly segmenting the input pedestrian video image data into video fragments, randomly selecting a frame from each video fragment, and inputting the selected frame as a video pedestrian re-identification model.

The multi-branch diversified feature mining space-time attention network comprises a multi-branch heterogeneous space attention module and a multi-branch isomorphic time attention module; the output characteristics of the upper layer conv4_x are input to a multi-branch heterogeneous space attention module; the multi-branch heterogeneous space attention module comprises a plurality of heterogeneous space attention sub-modules, wherein the output features of the upper layer conv4_x respectively enter different heterogeneous space attention sub-modules, and different pedestrian features are mined; each heterogeneous spatial attention sub-module uses a soft spatial attention mechanism, namely a spatial attention module based on convolution and a spatial attention module based on mean value calculation, which are distributed in a crossed manner; the soft spatial attention mechanism focuses on different areas to different degrees; and obtaining the final pedestrian characteristic expression by integrating the attribute information output by each heterogeneous space attention submodule.

The input of the convolution-based spatial attention module is the characteristic of the output of the upper layer networkThe number of characteristic channels is->The method comprises the steps of carrying out a first treatment on the surface of the The spatial attention weight based on convolution is obtained by two linear transformations, one nonlinear transformation and one softmax calculation;

first linear transformation to obtainAs shown in formula (2):

（2）

wherein the method comprises the steps ofFor inputting features +.>Which is the weight, ++>Which is an offset value;

relu activation function pair with combination of linear and nonlinear featuresPerforming activation to obtain an activation function outputAs shown in formula (3):

（3）

after obtaining the output of the activation function, linear transformation is used again to obtainAs shown in formula (4):

（4）

in the aboveWhich is the weight, ++>For the length of the input data time series, < >>Which is an offset value;

obtainingThereafter, a softmax function is used to obtain convolution-based spatial attention weights for all locations, as in equation (5):

（5）

is->Initial spatial attention fraction of the location, +.>For position->Spatial attention score of>Is->Spatial attention weight of the location, note +.>A convolution-based spatial attention weight matrix for the entire space; h and W are boundary values of spatial positions;

after the convolution-based spatial attention weight matrix is obtained, the convolution-based spatial attention weight of each position is multiplied by the characteristic intensity thereof, and the spatial attention hyper-parameters are introduced when the convolution-based spatial attention weight is multiplied by the characteristic value of each position The characteristic intensity controlling the spatial attention output is given by the convolution-based spatial attention characteristic +.>The method comprises the steps of carrying out a first treatment on the surface of the The formula is shown as formula (6);

（6）

the input of the spatial attention module based on the mean value calculation is also the characteristic of the output of the upper layer networkThe number of characteristic channels is->The method comprises the steps of carrying out a first treatment on the surface of the The spatial attention parameter based on the mean value is obtained by one average transformation, one nonlinear transformation and one softmax calculation;

first, for input featuresCalculating the mean value of channel dimension>The method comprises the steps of carrying out a first treatment on the surface of the The formula is shown as formula (7):

（7）

wherein,is->Channel characteristics of the individual channels; />A current initial spatial attention score; />Obtaining +.>Using the softmax function will +.>Conversion to a mean-based spatial attention weight matrix +.>The formula is shown as formula (8):

（8）

multiplying the spatial attention weight based on the mean value with the corresponding position of the input feature after the spatial attention weight based on the mean value is obtained; introducing spatial attention superparameterControlling the characteristic intensity of the spatial attention based on the mean value calculation, and finally obtaining the final output of the spatial attention module based on the mean value calculation, wherein the spatial attention characteristic based on the mean value As shown in formula (9):

（9）

after feature mining by each heterogeneous spatial attention submodule, convolution-based spatial attention featuresAnd mean-based spatial attention profile +.>Collectively referred to as output characteristics->And sending the multi-branch isomorphic time attention module.

The multi-branch isomorphic time attention module comprises a plurality of isomorphic but weighted different convolution-based time attention sub-modules; the convolution-based time attention submodule receives different heterogeneous space attention submodule outputs;

the time attention weight parameter is calculated by two linear transformations, one nonlinear transformation and a softmax function; the time attention calculation mode is the same as the space attention weight based on convolution, but the calculation dimension is different;

first linear transformation outputAs shown in formula (10), the output of one heterogeneous spatial attention submodule is characterized by +.>；

（10）

Wherein the method comprises the steps ofWhich is a weight value,/->Which is an offset value; after obtaining the first linear transformation output, introducing a relu activation function with a combination of linear and non-linear characteristics to obtain +.>The formula is shown as formula (11):

（11）

after the activation function output is obtained, the linear transformation is used again to obtain the initial time attention score As shown in formula (12):

（12）

wherein,which is a weight value,/->Which is an offset value, wherein +.>The method comprises the steps of carrying out a first treatment on the surface of the Obtain->After that, use the softmax functionNumber to obtain a time attention weighting parameter matrix +.>The formula is shown as formula (13):

（13）

after the time attention weight parameter matrix is obtained, the time attention superparameter is introducedTo control the time attention feature intensity; multiplying the time attention super parameter, the time attention weight parameter matrix and the input characteristic to obtain the final time attention output characteristic +.>The formula is shown as formula (14):

（14）

the multi-branch diversified feature mining space-time attention network is fused after the output of each convolution-based time attention module, and the fusion mode is that feature compression is performed firstly and then the fusion is performed on the channel dimension;

after the feature compression is carried out, the output of the whole multi-branch diversified feature mining space-time attention network is obtained in a channel dimension by using a splicing mode;

the feature compression is as shown in equation (15):

（15）

wherein the method comprises the steps ofFor a convolution-based temporal attention sub-module, the temporal attention output feature +.>For the output of this feature compression, +.>Which is the weight, ++>Which is an offset value, " >For the number of channels after compression, +.>Is the number of channels before compression.

The convolutional neural warp decoding network based on aggregation redistribution comprises a multi-stage feature aggregation encoder and a multi-stage feature redistribution decoder;

the multi-stage feature aggregation encoder comprises a space sub-encoder, a channel sub-encoder and a time sub-encoder, and input data is subjected to feature compression according to three dimensions of space, channel and time in sequence according to each stage of sub-encoder; the space sub-encoder firstly compresses the features in the space dimension to obtain the space dimension feature compression valueAs shown in equation (16);

（16）

which is the weight, ++>Which is the bias value, +.>The characteristic size after compressing for space dimension characteristic; />，/>Aggregating input features of an encoder for multi-level features; />Means BN operation;

wherein the output of each stage of spatial sub-encoder in equation (16) is used as both the input to the next stage channel sub-encoder and as part of the input to the multi-stage feature reassignment decoder; then carrying out feature compression on the channel dimension, and obtaining a channel dimension feature compression value by using a compression mode which is the same as the space dimensionAs shown in formula (17):

（17）

wherein the method comprises the steps ofWhich is a weight; / >Which is an offset value, ">The number of the characteristic channels after channel compression;

the output of each channel sub-encoder is used as the input of the next time sub-encoder and is used as part of the input to the stage characteristic reassignment decoder for the recombination and assignment of the multi-stage characteristics; performing feature compression in the time dimension, and obtaining a time dimension feature compression value by using the same compression modeThe formula is shown as formula (18);

（18）

wherein the method comprises the steps ofWhich is the weight, ++>Which is an offset value, ">The length of the characteristic sequence after time compression; wherein->The final output of the multi-stage characteristic aggregation encoder is directly sent to a multi-stage characteristic reassignment decoder;

the multi-stage feature redistribution decoder recombines and distributes the compressed features according to the sequence of time, channels and space to finish information fusion;

the multi-stage characteristic redistribution decoder firstly directly decodes the last stage output of the multi-stage characteristic aggregation encoder in the time dimension to obtainThe decoding is shown in formula (19):

（19）

wherein the method comprises the steps ofWhich is a weight; />Which is an offset value; after obtaining the output of the first-stage decoder, performing a second-stage feature decoding, wherein the second-stage feature decoding is performed in the channel dimension to obtain +. >The method comprises the steps of carrying out a first treatment on the surface of the The decoding formula of the second stage decoder is shown in formula (20):

（20）

wherein cat is the characteristic stitching operation,which is the weight, ++>Which is an offset value; after obtaining the output of the second-stage decoder, the output of the second-stage decoder is firstly fused with the input of the space subcode and then input to the third-stage decoder for decoding to obtain +.>The third level decoder is performed in the spatial dimension as shown in equation (21):

（21）

wherein the method comprises the steps ofWhich is the weight, ++>Which is an offset value; after the output of the final stage decoder, the output and input characteristics of the final stage encoder are +.>The final output of the whole multi-stage characteristic redistribution decoder is obtained after splicing and fusion; for subsequent classification prediction.

The loss function comprises a cross entropy loss function, a triplet loss function and a divergence loss function;

the cross entropy loss functionAs shown in equation (22):

（22）

wherein,for the total number of samples, +.>For category number->Sample ∈>True category equals +.>1, otherwise take 0,>for sample->Belongs to category->Probability of (2);

the triplet loss functionAs shown in formula (23):

（23）

wherein the method comprises the steps ofFor the total number of samples, +.>Representing an anchor sample, +. >Representing a positive sample, +.>Represents a negative sample, +.>Represents->Positive samples->Represents->Positive sample features; />Represents->Individual anchor sample feature,/->Indicate->Negative sample features; />Is a super parameter;

the divergence loss function is used in a multi-branch diversified feature mining space-time attention network, and cosine similarity is used for measuring similarity of output features of the space-time attention submodule; the formula is shown as formula (24):

（24）

wherein the method comprises the steps ofAnd->Respectively represent->Space-time attention network and first of multiple branch diversity feature miningExcavating output characteristics of the space-time attention network by using a plurality of multi-branch diversified characteristics; defining the variability of two branch diversity feature mining spatiotemporal attention network features using equation (25)>；

（25）

The larger the feature difference of the space-time attention network is excavated by the diversified features of the two branches, the larger the feature difference is; the total loss of the whole spatiotemporal attention sub-module is represented by formula (26)>；

（26）

Wherein the method comprises the steps ofExcavating the total number of space-time attention networks for the branch diversity features; when the diversity of branch feature mining spatiotemporal attention network output feature becomes more diverse, the +.>The value of (2) will also become larger by minimizing the divergence loss function To enhance the feature variability of each branch diversity feature mining spatiotemporal attention network;

the total loss function is shown in equation (27):

（27）

is a super parameter.

The invention has the beneficial effects that: (1) The space-time attention network designed by the invention based on multi-branch diversified feature mining consists of multi-branch heterogeneous space attention and multi-branch isomorphic time attention. The multi-branch heterogeneous space attention module can enable the model to pay attention to more important areas on one hand, lighten the problems of shielding, background interference and the like, and enable the model to more fully excavate diversified features on the other hand. The multi-branch isomorphic time attention module focuses different characteristics mined by each heterogeneous space attention sub-module to different degrees in the time dimension, so that the model can acquire higher-quality characteristic expression as far as possible aiming at the characteristics of video data.

(2) Convolutional neural warp decoding networks based on aggregate reassignment are presented. According to the method, the characteristic is adaptively fused and shared in three dimensions by carrying out multi-stage coding and multi-stage decoding in a specific sequence in the three dimensions, so that information communication between different layers of the model is enhanced, and finally, the characteristics mined by the model are more complete and have higher quality.

The invention provides a video pedestrian re-identification model based on a multi-branch diversified feature mining space-time attention network and an aggregation redistribution convolutional neural coding and decoding network. The model solves the problems of information redundancy, shielding, background interference, light change, visual angle change, low resolution and the like commonly existing in the video pedestrian re-identification task to a certain extent. Through the two methods with different emphasis, the basic model is optimized and improved in multiple aspects, so that the model can more effectively complete the task of re-identifying the video pedestrians.

Drawings

FIG. 1 is a diagram of the overall structure of a video pedestrian re-recognition model;

FIG. 2 is a general framework of a multi-branch diversity feature mining spatiotemporal attention network;

FIG. 3 is a convolution-based spatial attention sub-module;

FIG. 4 is a spatial attention sub-module based on mean calculation;

FIG. 5 is a multi-level aggregate reassignment codec overall framework;

FIG. 6 is a multi-level feature aggregation encoder architecture;

fig. 7 is a multi-level feature reassignment decoder structure.

Detailed Description

The video pedestrian re-identification method based on the double networks generally comprises the following steps:

step 1: and acquiring a pedestrian video image with the pedestrian identity tag, and preprocessing.

Step 1-1: the convolutional neural network adopted by the method requires the input image to be fixed in size, and the size of the input image is uniformly adjusted to be consistent with the input requirement of the model. The input image size is all adjusted to a height 256, width 128.

Step 1-2: the input data is scaled and translated according to a certain rule by standardization, so that the input distribution of each layer has the same statistical characteristics, thereby reducing the change of the input distribution, improving the training stability and standardizing the data set;

wherein,for inputting pixel values +.>Is a standardized output. The input video is RGB three-channel data, so that different standardized parameters are respectively used for the three channels, and the mean value and the variance of the standardized parameters are obtained by sampling calculation from the ImageNet.

Step 1-3: the input image is flipped horizontally and erased randomly, and then the input video data is downsampled in the time dimension. Firstly, uniformly segmenting input video data into 6 video clips, then randomly selecting a frame from each video clip, and inputting the selected 6 frames as a model. The sampling mode can not only reserve the time information of the video data on the full scale to the greatest extent, but also greatly reduce the calculation resource requirements of model training and reasoning, and simultaneously reduce the information redundancy of the video data.

Step 2: and constructing a video pedestrian re-identification model based on a space-time attention network and a convolutional neural warp decoding network, obtaining pedestrian characteristics from the preprocessed data through the model, and obtaining pedestrian identity prediction through a classifier.

Step 2-1: the backbone network of the method is based on a ResNet50 deep convolutional neural network after 3D transformation. The innovation of ResNet is the introduction of a residual connection, or referred to as a jump connection, which is a way to introduce a connection directly in the network that bypasses the middle layer.

ResNet is composed of a plurality of residual basic units, and the basic units are composed of a convolution layer, a pooling layer and an activation layer. The residual block introduces a skip-stage connection in the forward neural network, which first performs forward neural network operationsDirectly adding the original input of the forward neural network, and finally using𝑟𝑒l𝑢And outputting after nonlinear mapping. Through the residual connection, information can be transmitted in a network in a cross-layer manner, so that the problem of gradient disappearance is relieved, and a deeper network can be trained more easily. ResNet produces a variety of different depthsThe network structure of ResNet50 is shown in Table 1.

TABLE 1 ResNet50 Structure

Although the res net50 has excellent performance in processing image problems, since the convolution kernels of the model are all 2D convolution kernels, only two-dimensional picture data cannot be processed, and three-dimensional data such as video cannot be processed directly. Therefore, 3D modification of the res net50 is required. The 3D reconstruction of the res net50 is mainly to replace all 2D convolutions, 2D pooling, etc. with corresponding 3D convolutions, 3D pooling, etc. The temporal super-dimension parameters of the 3D component all use super-parameter values of other dimensions, such as convolution kernel size, filling size, etc., which are the same as the 2D version. When using ResNet50-3D, it is still necessary to use a pre-trained ResNet50 weight on ImageNet, but this weight is a 2D model weight, and when used, the 2D weight is taken as the weight of one time dimension, with the remaining time dimensions remaining consistent with that of the other.

As shown in fig. 1, the multi-branch diversity feature mining spatiotemporal attention network designed by the present method is located after conv4_x, while the convolutional neural codec network based on aggregate reassignment is located after conv5_x.

Step 2-2: the multi-branch diversity feature mining spatiotemporal attention network is mainly composed of two major parts, namely multi-branch heterogeneous spatial attention and multi-branch isomorphic temporal attention, as shown in fig. 2. After the multi-branch diversified feature mining space-time attention network receives the output features of the upper layer conv4_x, the multi-branch heterogeneous space attention is firstly carried out. The module is provided with a plurality of heterogeneous space attention sub-modules, input features enter different attention sub-modules respectively, and different pedestrian features are fully excavated. Each heterogeneous spatial attention sub-module uses a soft spatial attention mechanism. Soft attention mechanisms are concerned with different areas to different degrees, but do not ignore others entirely. Each spatial attention sub-module represents a pedestrian attribute mining sub-network, so that potential pedestrian attributes are mined, the model is enabled to more purposefully mine pedestrian features, and finally, the attribute information is integrated to execute final pedestrian feature expression. Where the attention mechanism is used is convolution-based spatial attention and mean-based spatial attention.

The input of the convolution-based spatial attention module is the characteristic of the output of the upper layer networkThe number of the characteristic channels is. The spatial attention weight is obtained by two linear transformations, one nonlinear transformation and one softmax calculation, and the operation process is shown in fig. 3.

The first linear transformation is shown as follows:

wherein the method comprises the steps ofFor inputting features +.>，/>。

Reusing a relu activation function pair with a combination of linear and non-linear characteristicsActivation is performed as shown in the following formula:

after the activation function output is obtained, linear transformation is used again as shown in the following formula:

in the above，/>，/>Is the time series length.

After obtainingThereafter, the attention weight for all spatial locations was obtained using a softmax function as follows:

is->Initial spatial attention fraction of the location, +.>Namely +.>Spatial attention parameter of position, note->Is a matrix of attention weights for the whole space. After the spatial attention weight is obtained, the spatial attention weight of each position is multiplied by the characteristic intensity of the spatial attention weight to obtain the spatial attention characteristic. Introducing a spatial attention hyper-parameter when multiplying the spatial attention weight with the eigenvalue of each position>To control the emptyCharacteristic intensity of inter-attention output. Finally, the formula is shown as follows.

Spatial attention input based on mean value calculation is also the characteristic of the output of the upper layer networkThe number of characteristic channels is->. The mean-based spatial attention parameter is obtained by an average transformation, a nonlinear transformation and a softmax calculation, the calculation process of which is shown in fig. 4.

First, for input featuresAnd (5) calculating the average value of the channel dimension. The formula is shown as follows:

wherein,is->Channel characteristics of the individual channels. />I.e. the current initial spatial attention score.

For the same reasons as the attention of the convolution,obtaining +.>Using a softmax functionWill->The spatial attention weight matrix is converted into a spatial attention weight matrix based on the mean value, and the formula is shown as follows:

after the average attention weight is obtained, the average attention weight is multiplied by the corresponding position of the input feature. Introducing spatial attention superparameterTo control the characteristic intensity of the spatial attention based on the mean value calculation, and finally obtain the final output of the spatial attention sub-module as shown in the following formula:

after each spatial attention sub-module performs feature mining, the output features are sent to a plurality of isomorphic but different-weight temporal attention sub-modules. The time attention submodules receive different space attention submodule outputs, and each heterogeneous space attention submodule is connected in series with one convolution-based time attention submodule. These time attention sub-modules, while structurally identical, do not share weights.

The time-attention parameter is also calculated from two linear changes, one nonlinear transformation and a softmax function. The calculation is similar to the convolution-based spatial attention, but the dimension of the calculation is different.

The first linear transformation is shown as a formula, and the output of the spatial attention submodule is recorded as the characteristic。

Wherein the method comprises the steps of，/>. After the first linear transformation output is obtained, the relu activation function is introduced to improve the nonlinear expression capacity of the model, and the modeling capacity of the model on complex features is enhanced, wherein the formula is shown as follows:

wherein the method comprises the steps ofFor the initial time attention score,/I>，/>Wherein->. ObtainingThereafter, the temporal attention weight matrix is obtained using the softmax function for the same reasons as heterogeneous spatial attention, as shown in the following equation:

after the time attention weight matrix is obtained, the time attention superparameter is still introducedTo control the time-attention-feature intensity. And multiplying the time attention super parameter, the time attention weight and the input characteristic to obtain a final time attention output characteristic. The formula is shown as follows:

the multi-branch diversified feature mining space-time attention network is fused after each time attention is output, and the fusion mode is that feature compression is firstly carried out, and then the multi-branch diversified feature mining space-time attention network is spliced in the channel dimension. The feature compression modes of all the space-time attention sub-networks are the same, and feature fusion is carried out on the channel dimension. Taking the feature compression of one of the spatio-temporal attention sub-networks as an example, the following formula is shown:

/>

Wherein the method comprises the steps ofFor the output of a time attention sub-module, < ->For the output of the present feature compression,，/>，/>for the number of channels after compression, +.>Is the number of channels before compression.

After feature compression, the output of the whole multi-branch diversified feature mining space-time attention network is acquired in the channel dimension by using a splicing mode.

Step 2-3: the output of the multi-branch diversity feature mining spatiotemporal attention network, after passing conv5_x, enters a convolutional neural warp decoding network based on aggregate reassignment. The overall framework of the convolutional neural warp decoding network based on aggregation redistribution proposed by the method is shown in fig. 5.

The convolutional neural warp decoding network based on aggregation redistribution consists of two parts, namely a multi-stage feature aggregation encoder and a multi-stage feature redistribution decoder.

The multi-level feature aggregation encoder adopts a multi-level encoding structure as shown in fig. 6. The encoder first compresses the features in the spatial dimension as shown in the following equation.

Wherein,the operation form is shown as the following formula +.>For mean value calculation, ->Is a variance operation.

，/>，/>，/>Is the feature size after compression. The output of this stage of sub-encoder will not only serve as input to the next stage of sub-encoder, but will also be input as part of the multi-stage feature reassignment decoder. And then carrying out characteristic compression on the channel dimension, and using the compression mode which is the same as the space dimension, wherein the compression mode is shown as the following formula:

Wherein the method comprises the steps of，/>，/>The number of characteristic channels after compression. The output of this stage encoder will also be fed as part of the input into the decoder for the recombination allocation of the multi-stage features. And finally, performing characteristic compression in the time dimension, wherein the compression mode is the same as the compression mode, and the compression mode is shown in the following formula.

Wherein the method comprises the steps of，/>，/>The number of feature sequences after compression. Wherein->I.e. the last stage output of the multi-stage feature-aggregation encoding network, the feature vector will be sent directly to the decoder. The multi-level feature reassignment decoder is employed as shown in fig. 7.

The decoder first decodes the last stage output of the encoder directly in the time dimension, as shown in the following equation:

wherein the method comprises the steps of，/>. After the output of the first-stage decoder is obtained, the second-stage feature decoding is performed, and the second-stage feature decoding is performed in the channel dimension. The decoding formula of the second level sub-decoder is shown as follows:

/>

wherein cat is the characteristic stitching operation,，/>. After obtaining the output of the second-stage decoder, the output of the second-stage decoder is firstly fused with the input of the second-stage encoder and then input to the third-stage decoder for decoding, and the third-stage decoder is in a space dimension, and the formula is shown as follows:

wherein,，/>. After the last stage decoder output, the next stage decoder is no longer present later, but in view of the advantages of the residual structure, And splicing and fusing the output of the last stage of encoder and the input of the first stage of encoder to serve as the final output of the whole multi-stage characteristic redistribution decoder. The pedestrian features thus obtained can be used for subsequent classification prediction.

Step 3: the loss function is designed, and the model is trained by taking the data set with the pedestrian identity label, so that the loss is gradually reduced until convergence.

And calculating losses of the obtained pedestrian identity prediction, the pedestrian characteristics output by the trunk model in the pedestrian re-recognition model and the pedestrian identity label, and reducing the losses by using random gradient descent so as to help model convergence. The loss function of the method is a cross entropy loss function, a triplet loss function and a divergence loss function.

The cross entropy loss function is shown as follows:

the triplet loss function is shown as follows:

wherein the method comprises the steps ofFor the total number of samples, +.>Representing an anchor sample, +.>Representing a positive sample, +.>Represents a negative sample, +.>Represents->Positive samples->Represents->Positive sample features; />Represents- >Individual anchor sample feature,/->Indicate->Negative sample features; />Is a super parameter;

the divergence loss function number is used in the multi-branch diversified feature mining spatiotemporal attention network, and cosine similarity is used for measuring the similarity of the output features of the spatiotemporal attention sub-network. The formula is shown as follows:

wherein the method comprises the steps ofAnd->Respectively represent->Space-time attention network and first of multiple branch diversity feature miningThe multiple multi-branch diversity feature mines the output features of the spatio-temporal attention network. The variability of the two spatiotemporal attention subnetwork features is defined using the following equation. />

It can be seen that the light source is,the larger represents the greater the feature variability of the two spatiotemporal attention subnetworks. The following equation is used to represent the total loss of the entire spatiotemporal attention network.

Wherein the method comprises the steps ofIs a branchThe diversity feature mines the total number of spatiotemporal attention networks. When the variability of the spatiotemporal attention sub-network output characteristics becomes large, the +.>The value of (2) will also become greater and +.>Will become smaller and thus can be minimizedTo enhance the characteristic variability of each spatiotemporal attention sub-network.

The method uses the three losses to train the model. The total loss function is designed as follows:

Because the divergence loss is used for controlling the characteristic variability of the space-time attention sub-network, the divergence loss is too strong, so that the model is too pursued for the characteristic variability to ignore the characteristic quality, and if the divergence loss is too weak, the model cannot effectively extract various diversity characteristic, so that the divergence loss strength is controlled based on different tasks by a super parameter gamma.

The method adopts an Adam optimizer in the training process. Adam is an adaptive optimization algorithm for deep learning model optimization, which adds a momentum optimization algorithm on the basis of REMSprop and performs deviation correction. The main idea of Adam's algorithm is to maintain an adaptive learning rate for each parameter, by iteratively updating the first moment estimate of the gradient, i.e. the average value of the gradient, and the second moment estimate, i.e. the average value of the square of the gradient.

According to the above steps, the trained and tested network model is obtained, and the video pedestrian re-recognition capability of the model can be effectively improved.

Step 4: and extracting features of the pedestrian video data by using the trained model, and sorting the similarity of pedestrians needing to be searched to obtain a re-identification result. After training is completed, the function of video pedestrian re-identification can be performed. After the video image set to be searched is uniformly processed, inputting a pedestrian video image query to be searched and a video image library gamma to a model to obtain two groups of features, calculating the similarity with a gallery, and sequencing from high to low to obtain a search image list.

Claims

1. The double-network-based video pedestrian re-identification method is characterized by comprising the following steps of:

2. The dual-network-based video pedestrian re-identification method of claim 1, wherein the preprocessing specifically comprises:

（1）

3. The dual network-based video pedestrian re-recognition method of claim 1 wherein the multi-branch diversification feature mining spatio-temporal attention network comprises a multi-branch heterogeneous spatial attention module and a multi-branch isomorphic temporal attention module; the output characteristics of the upper layer conv4_x are input to a multi-branch heterogeneous space attention module; the multi-branch heterogeneous space attention module comprises a plurality of heterogeneous space attention sub-modules, wherein the output features of the upper layer conv4_x respectively enter different heterogeneous space attention sub-modules, and different pedestrian features are mined; each heterogeneous spatial attention sub-module uses a soft spatial attention mechanism, namely a spatial attention module based on convolution and a spatial attention module based on mean value calculation, which are distributed in a crossed manner; the soft spatial attention mechanism focuses on different areas to different degrees; and obtaining the final pedestrian characteristic expression by integrating the attribute information output by each heterogeneous space attention submodule.

4. A dual network based video pedestrian re-recognition method in accordance with claim 3 wherein the input of the convolution based spatial attention module is a feature of the upper layer network output The number of characteristic channels is->The method comprises the steps of carrying out a first treatment on the surface of the The spatial attention weight based on convolution is obtained by two linear transformations, one nonlinear transformation and one softmax calculation;

first linear transformation to obtainAs shown in formula (2):

（2）

relu activation function pair with combination of linear and nonlinear featuresPerforming activation to obtain activation function output->As shown in formula (3):

（3）

（4）

（5）

（6）

（7）

wherein,is->Channel characteristics of the individual channels; />A current initial spatial attention score;

obtaining +.>Using the softmax function will +.>Conversion to a mean-based spatial attention weight matrix +.>The formula is shown as formula (8):

（8）

multiplying the spatial attention weight based on the mean value with the corresponding position of the input feature after the spatial attention weight based on the mean value is obtained; introducing spatial attention superparameterControlling the characteristic intensity of the spatial attention based on the mean value calculation, and finally obtaining the final output of the spatial attention module based on the mean value calculation, wherein the spatial attention characteristic based on the mean value is ≡ >As shown in formula (9):

（9）

5. The dual network-based video pedestrian re-recognition method of claim 4 wherein the multi-branch isomorphic temporal attention module comprises a plurality of isomorphic but differently weighted convolution-based temporal attention sub-modules; the convolution-based time attention submodule receives different heterogeneous space attention submodule outputs;

（10）

（11）

After the activation function output is obtained, the linear transformation is used again to obtain the initial time attention scoreAs shown in formula (12):

（12）

wherein,which is a weight value,/->Which is an offset value, wherein +.>The method comprises the steps of carrying out a first treatment on the surface of the Obtain->After that, a time attention weighting parameter matrix +.>The formula is shown as formula (13):

（13）

after the time attention weight parameter matrix is obtained, the time attention superparameter is introducedTo control the time attention feature intensity; the time attention super parameter, the time attention weight parameter matrix and the inputMultiplication of the input features to obtain the final temporal attention output feature +.>The formula is shown as formula (14):

（14）

the feature compression is as shown in equation (15):

（15）

wherein the method comprises the steps ofFor a convolution-based temporal attention sub-module, the temporal attention output feature +. >For the output of this feature compression, +.>Which is the weight, ++>Which is an offset value, ">For the number of channels after compression, +.>Is the number of channels before compression.

6. The dual network-based video pedestrian re-recognition method of claim 5 wherein the aggregate re-allocation-based convolutional neural warp-decoded network comprises a multi-level feature aggregate encoder and a multi-level feature re-allocation decoder;

（16）

wherein the output of each stage of spatial sub-encoder in equation (16) is used as both the input to the next stage channel sub-encoder and as part of the input to the multi-stage feature reassignment decoder; then carrying out feature compression on the channel dimension, and obtaining a channel dimension feature compression value by using a compression mode which is the same as the space dimension As shown in formula (17):

（17）

wherein the method comprises the steps ofWhich is a weight; />Which is an offset value, ">The number of the characteristic channels after channel compression;

（18）

（19）

wherein the method comprises the steps ofWhich is a weight; />Which is an offset value; after obtaining the output of the first-stage decoder, performing a second-stage feature decoding, wherein the second-stage feature decoding is performed in the channel dimension to obtain +. >The method comprises the steps of carrying out a first treatment on the surface of the Of a second stage decoderThe decoding formula is shown in formula (20):

（20）

（21）

7. The dual network-based video pedestrian re-recognition method of claim 6 wherein the loss functions include a cross entropy loss function, a triplet loss function, and a divergence loss function;

the cross entropy loss functionAs shown in equation (22):

（22）

the triplet loss function As shown in formula (23):

（23）

wherein the method comprises the steps ofFor the total number of samples, +.>Representing an anchor sample, +.>Representing a positive sample, +.>Represents a negative sample, +.>Represents->Positive samples->Represents->Positive sample features; />Represents->Individual anchor sample feature,/->Indicate->Negative sample features; />Is a super parameter;

（24）

wherein the method comprises the steps ofAnd->Respectively represent->Multiple multi-branch diversity feature mining spatiotemporal attention network and +.>Excavating output characteristics of the space-time attention network by using a plurality of multi-branch diversified characteristics; defining the variability of two branch diversity feature mining spatiotemporal attention network features using equation (25)>；

（25）

The larger the feature difference of the space-time attention network is excavated by the diversified features of the two branches, the larger the feature difference is; using equation (26) to represent the total loss of the entire spatiotemporal attention sub-moduleMalnutrition of the heart>；

（26）

Wherein the method comprises the steps ofExcavating the total number of space-time attention networks for the branch diversity features; when the diversity of branch feature mining spatiotemporal attention network output feature becomes more diverse, the +. >The value of (2) will also become larger by minimizing the divergence loss function +.>To enhance the feature variability of each branch diversity feature mining spatiotemporal attention network;

the total loss function is shown in equation (27):

（27）

is a super parameter.