CN111680602A

CN111680602A - Pedestrian re-identification method based on double-flow hierarchical feature correction and model architecture

Info

Publication number: CN111680602A
Application number: CN202010486379.7A
Authority: CN
Inventors: 高英; 林文根
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2020-09-18

Abstract

A pedestrian re-identification method based on double-current classification feature correction comprises the following steps: the method comprises the following steps: respectively inputting an RGB sequence and an optical flow sequence at the input end of the double-current special extractor, and respectively extracting appearance features and optical flow features; step two: inputting the appearance features and the optical flow features extracted in the step one into a frame-level feature corrector, and correcting information frame by frame according to the video flow to obtain frame-level correction features; step three: obtaining the feature representation of the segment-level appearance continuity and the motion mode; step four: and fusing the frame-level correction features and the segment-level correction features to obtain final video representation, and classifying the videos. A pedestrian re-identification method system based on double-current hierarchical feature correction is composed of a double-current feature extractor, an appearance section level feature corrector, an optical flow section level feature corrector, a frame level feature corrector and a channel fusion module which are connected.

Description

Pedestrian re-identification method based on double-flow hierarchical feature correction and model architecture

Technical Field

The invention relates to the technical field of image processing, in particular to a pedestrian re-identification method and a model framework based on double-current hierarchical feature correction.

Background

And (3) pedestrian re-identification: the method refers to accurately inquiring and matching the same pedestrian under a plurality of cameras. The pedestrian re-identification method mainly comprises image-based pedestrian re-identification and video-based pedestrian re-identification, and has the characteristics of continuous frames and complex motion aiming at the video-based pedestrian re-identification.

The existing video pedestrian re-identification technology has the following defects:

1. most existing video re-recognition architectures lack the consideration of multi-modal fusion. With the wide application of time sequence feature pooling and space-time attention mechanisms, most of the re-identification architectures capture important time sequence information by selecting key frames and extracting key features of the key frames. The idea can obtain good effect under a single input mode, but also limits the upper limit of the effect of rich information of the video. The video serves as a data source with rich space-time information, the content of the video can be analyzed from multiple angles, multiple data modalities are built, and the characteristics of different aspects of the video content are described, so that people in the video can be modeled more comprehensively and carefully.

2. Existing multimodal fusion architectures lack cross-modal learning and feature correction considerations. In a few technical architectures considering multi-mode fusion, a dual-flow network architecture is mostly adopted, and the two modes are used as dual-flow input for feature extraction, and the two modes mainly comprise a front fusion mode and a rear fusion mode. Pre-fusion means that before inputting a feature extraction network, two modes are fused on a channel dimension and used as an input for feature extraction; the post-fusion is to add two modes in a mode of adding corresponding elements in the process of extracting the model and output the characteristics of mode fusion. However, both of these approaches lack the information interaction between the two modalities, i.e., the effect of appearance information on capturing motion features and the effect of motion information on distinguishing appearance features.

3. Existing models based on spatiotemporal attention mechanism lack contextual connections within and between frames of video features. For the video motion process, each part of the body has a synergistic effect, the key features extracted by the same attention mechanism have time sequence context relations in different frames, and the two kinds of association relations can effectively distinguish the synergistic effect of different people and promote the distinguishing effect of the features. However, most methods based on the space-time attention mechanism cannot achieve the effect.

Disclosure of Invention

The invention provides a double-current cross-modal multi-stage pedestrian re-identification algorithm aiming at the defects of the prior art, which is used for capturing the characteristic correlation among video stream multi-dimensional information, increasing the distinguishing force and the robustness of characteristics, and is used for a pedestrian re-identification method and a model architecture based on double-current hierarchical characteristic correction for video pedestrian re-identification and pedestrian retrieval, and the specific technical scheme is as follows:

a pedestrian re-identification method based on double-current classification feature correction comprises the following steps:

the method comprises the following steps: respectively inputting an RGB sequence and an optical flow sequence at the input end of the double-current special extractor, and respectively extracting appearance features and optical flow features;

step two: inputting the appearance features and the optical flow features extracted in the step one into a frame-level feature corrector, and correcting information frame by frame according to the video flow to obtain frame-level correction features;

step three: extracting the appearance features and the optical flow features extracted in the step one and the appearance features and the optical flow features after being corrected in the step two by using an attention mechanism to extract the association relationship among all the features under the condition of considering the appearance and the motion information, capturing the association relationship among the appearance features and all the frames of the whole video segment of the motion mode, and correcting the features in a weight coefficient mode to obtain the segment-level appearance continuity and the feature characterization of the motion mode;

step four: and fusing the frame-level correction features and the segment-level correction features to obtain final video representation, and classifying the videos.

Preferably, the method comprises the following steps: the first step is specifically that the double-flow feature extractor adopts a depth convolution model after large-scale data set pre-training to extract features, for RGB images and optical flow images, initial inputs of the RGB images and the optical flow images have different dimensions, the RGB image inputs have three dimensions, the optical flow is two dimensions, and the dimensions of the RGB images and the optical flow are aligned.

A pedestrian re-identification method system based on double-current hierarchical feature correction is provided with a double-current feature extractor, wherein the double-current feature extractor consists of an appearance feature extractor and a motion feature extractor, and the input ends of the appearance feature extractor and the motion feature extractor are respectively connected with an RGB sequence and an optical flow sequence input port;

the first output end of the appearance feature extractor is connected with the appearance section-level feature corrector, and the first output end of the motion feature extractor is connected with the optical flow section-level feature corrector; second output ends of the appearance characteristic extractor and the motion characteristic extractor are respectively connected with the input end of the frame level characteristic corrector; the RGB characteristic output end of the frame level characteristic corrector is respectively connected with the appearance segment level characteristic corrector and the channel fusion module; the light stream characteristic output end of the frame-level characteristic corrector is respectively connected with the light stream section-level characteristic corrector and the channel fusion module; and the output ends of the appearance section-level feature corrector and the optical flow section-level feature corrector are respectively connected with the channel fusion module.

The invention has the beneficial effects that: 1. the double-flow model is utilized to process two modes of RGB sequence input and optical flow sequence input, appearance characteristics and motion characteristics of a motion figure are considered at the same time, and the multi-mode information fusion can enable the robustness of the model to be stronger. 2. The feature correction is a further supplement of feature learning, and stronger feature characterization can be obtained on the basis of the features extracted by the basic depth model. 3. The hierarchical feature processing mode can extract feature relations in different time sequence lengths, the feature relations between front frames and rear frames can be obtained through frame-by-frame information transmission of the same mode, representative characterization information is reserved, redundant noise is removed, the relation between corresponding frames of the modes can be obtained through cross-mode frame-by-frame information interactive learning, the outstanding characterization of appearance information on a motion mode is measured, the prominent key observation position of the motion information on the whole appearance is also obtained, the information interval among the modes is broken through by the cooperative learning mode, and real cross-mode learning is achieved. 4. A hierarchical feature processing mode plays an important role in obtaining the relationship between long sequence frames, the traditional long sequence feature extraction mostly adopts a mode of selecting important frames and extracting important information of a single frame to construct a long sequence time sequence representation, but ignores the space-time association relationship of features between multiple continuous frames, for example, an action occurs when each part of a body collaborates on the time span of multiple frames, different key features of different frames possibly have an inter-association relationship, and the long sequence feature representation with more discriminative power can be obtained by correcting on the basis of the original features through the feature association relationship learning of the long sequence. 5. By using the attention mechanism, important feature positions and association relations can be extracted, and a good feature alignment effect can be achieved.

Drawings

Fig. 1 is a schematic diagram of the framework of the present invention.

Fig. 2 is a schematic diagram of the dual-stream input initial network layer processing in the present invention.

FIG. 3 is a schematic diagram of the cross-mode frame-level feature modifier based on LSTM design in the present invention.

Fig. 4 is a schematic structural diagram of the intermediate stage feature corrector in the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

As shown in fig. 1, 2, 3 and 4: a pedestrian re-identification method based on double-current classification feature correction comprises the following steps:

specifically, the basic parallel double-current feature extractor is used for respectively extracting features of an RGB image sequence and an optical flow sequence through two parallel basic networks to realize the representation of the features of the basic image of each frame, and the basic feature extractor is used for extracting the features by adopting a deep convolution model pre-trained in a large-scale data set, such as ImageNet and the like. For an RGB image and an optical flow image, initial inputs of the RGB image and the optical flow image have different scales, the RGB input has three dimensions, and the optical flow has two dimensions, and the present embodiment performs processing in the manner of fig. 2, so that the dimensions of the RGB image and the optical flow image are aligned. The basic feature extraction network of the RGB mode is ImageNet pre-trained ResNet50, the basic feature extraction network of the optical flow sequence is ImageNet pre-trained ResNet50 with an input convolution layer modified, for the RGB image sequence, the most initial network layer input of the original network is kept unchanged, for the optical flow image sequence, the convolution layer with one input dimension being two output dimensions and the same dimension is adopted to replace the initial convolution layer of the extraction network, and the unification of double-current input dimensions is achieved.

specifically, at this stage, we input the two streams into the features obtained by the feature extractor in the previous stage to modify the information frame by frame according to the video stream. The characteristics of each frame are modified under the characteristics of the previous modified frame and the corresponding frame of the other mode, and the modified characteristics of the bimodal information and the continuous frame context information are fused. Notably, for an initial frame, we take the frame itself as its context continuation frame since there is no previous frame. The frame-level feature modifier of fig. 2 is an RNN-type tandem structure, which is a design structure improved on the basis of LSTM, the detailed structure of each time stamp unit is shown in fig. 3, and it is assumed that the RGB feature and the optical flow feature X of the current time stamp_t，F_tIntermediate cellular state C_tThe former cell state is C_t-1Cell state for context information memory and transfer, hidden state H_tIs the correction information per timestamp, is the information optimized under supervision of the context information and cross-modal learning, H_t-1The state is a hidden state of the last timestamp, the memory and the transmission of the time sequence information are realized, and the sigma is a Sigmoid function. The unit of the frame level modifier can also be divided into three gates, a forgetting gate, an input gate and an output gate. Taking RGB mode as an example, the information of the last cell state is retained and discarded at the forgetting gate, and the learning parameter of the forgetting gate is W_fThe input is the previous hidden state and the characteristics of the current RGB frame,the weight obtained by forgetting the gate is

f_t＝σ(W_f*[H_t-1，X_t]+b_f)

The forgetting gate is consistent with the LSTM, and the input gate and the output gate are different from the LSTM in that the feature of the current frame of another modality is added, and for the RGB model, the feature corresponding to the optical flow modality is added. The input gate is based on the previous hidden state H_t-1Current two modal characteristics X_t，F_tLearn which information to update, using the last hidden state H_t-1And feature X of the current modality frame_tObtaining new candidate cell states

Representing updated candidate information.

The calculation of the two steps is

i_t＝σ(W_i*[H_t-1，X_t，F_t]+b_i)

By forgetting the gate to select the last cell state plus entering the gate's update information, a new cell state C can be obtained that contains context information and cross-modal learning_t

The output gate thereafter also differs from the output gate of LSTM in that it is required to have a previous hidden state H according to the previous hidden state_t-1Current two modal characteristics X_t，F_tDetermining final information of output cells, i.e. generating determination weights for output information of current state

o_t＝σ(W_o*[H_t-1，X_t，F_t]+b₀)

The final hidden state of the current timestamp is output as

h_t＝o_t*tanh(C_t)

The frame-level cross-modal feature corrector adds multi-modal supervision on the basis of the LSTM, so that the frame-level cross-modal feature corrector has double information optimization effects of context information perception and multi-modal interactive learning, and realizes frame-by-frame information correction.

specifically, at this stage, features extracted by an original double-current image sequence in a first-stage parallel feature extractor and features modified by a second-stage frame-level are input, the features obtained at the first stage are regarded as original features representing appearance and motion, the features obtained at the second stage are regarded as features considering appearance continuity and motion correlation of each frame, the third stage is characterized by using attention mechanism to extract the association relations between all the features in the segment under the condition of considering appearance and motion information through feature supervision at the second stage, capturing the association relations between the appearance features and all the frames of the whole video segment in the motion mode, and modifying the features in a weight coefficient mode to obtain segment-level appearance continuity and feature representation of the motion mode. The segment-level feature corrector in fig. 2 is independent as shown in fig. 4, which is a double-current spatial attention structure improved according to an attention mechanism, the frame-level correction features are used as input, two convolutional neural networks are used to generate weight Mask matrixes corresponding to the modal segment level, the weight Mask matrixes are respectively an RGB Mask and an optical flow Mask, the basic features are multiplied by masks after passing through the convolutional neural networks, and the segment-level optimization features of the modal can be obtained.

Step four: the frame-level correction features and the segment-level correction features are fused to obtain final video representations, video classification is carried out, the frame-level correction features and the segment-level correction features are fused along a channel to obtain the final video representations, pedestrian re-identification classification is carried out through the full-connection layer, and good performance can be achieved.

Claims

1. A pedestrian re-identification method based on double-current classification feature correction is characterized by comprising the following steps:

2. The pedestrian re-identification method based on the double-flow classification feature correction according to claim 1, characterized in that: the first step is specifically that the double-flow feature extractor adopts a depth convolution model after large-scale data set pre-training to extract features, for RGB images and optical flow images, initial inputs of the RGB images and the optical flow images have different dimensions, the RGB image inputs have three dimensions, the optical flow is two dimensions, and the dimensions of the RGB images and the optical flow are aligned.

3. A pedestrian re-identification method system based on double-current classification feature correction is characterized in that: the method comprises the following steps of arranging a double-flow feature extractor, wherein the double-flow feature extractor consists of an appearance feature extractor and a motion feature extractor, wherein the input ends of the appearance feature extractor and the motion feature extractor are respectively connected with an RGB sequence and an optical flow sequence input port;

the first output end of the appearance feature extractor is connected with the appearance section-level feature corrector, and the first output end of the motion feature extractor is connected with the optical flow section-level feature corrector;

second output ends of the appearance characteristic extractor and the motion characteristic extractor are respectively connected with the input end of the frame level characteristic corrector;

the RGB characteristic output end of the frame level characteristic corrector is respectively connected with the appearance segment level characteristic corrector and the channel fusion module;

the light stream characteristic output end of the frame-level characteristic corrector is respectively connected with the light stream section-level characteristic corrector and the channel fusion module;

and the output ends of the appearance section-level feature corrector and the optical flow section-level feature corrector are respectively connected with the channel fusion module.