CN112418127A

CN112418127A - Video sequence coding and decoding method for video pedestrian re-identification

Info

Publication number: CN112418127A
Application number: CN202011378786.2A
Authority: CN
Inventors: 潘啸; 罗浩; 姜伟
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-02-26
Anticipated expiration: 2040-11-30
Also published as: CN112418127B

Abstract

The invention discloses a video sequence coding and decoding method for video pedestrian re-identification, which comprises the steps of inputting label picture characteristics and video characteristics into a generator after being fused in a training stage, using a label picture as a reconstruction label, and constraining a key frame generated by the generator by image reconstruction loss; and then, sending the generated key frame into an image feature extraction module for video feature recovery, and constraining the recovered video features through feature reconstruction loss to enable the video features to be consistent with the original video features in performance. In the application stage, a K frame picture is selected by using an HSV-Top-K method to generate a key frame, and then the generated key frame is stored in equipment to reduce storage cost. When retrieval is needed, the image feature extraction module is used for recovering the video features of the generated key frames, and the recovered features retain the performance of the video features and are used for retrieval and matching of pedestrians.

Description

Video sequence coding and decoding method for video pedestrian re-identification

Technical Field

The invention belongs to the field of computer vision image retrieval, and particularly relates to a video sequence coding and decoding method for video pedestrian re-identification.

Background

Pedestrian re-identification aims at retrieving pedestrians appointed by a user from a series of monitoring videos crossing cameras; the method is widely applied to smart cities and security monitoring.

According to the number of different input pictures, the pedestrian re-identification can be divided into video pedestrian re-identification and image pedestrian re-identification. Compared with image pedestrian re-identification using a single-frame image as input, video pedestrian re-identification uses a video sequence as input and has higher robustness to environmental interference. However, the video pedestrian re-identification needs to store a large number of video sequences, which can cause huge storage overhead in practical application and increase the application cost of the video pedestrian re-identification. Meanwhile, in the application stage, the length of each video sequence is different, so that the method is not suitable for batch processing, and the calculation overhead is large.

Disclosure of Invention

The present invention is directed to a method for encoding and decoding a video sequence for pedestrian video re-identification.

The purpose of the invention is realized by the following technical scheme: a video sequence encoding and decoding method for video pedestrian re-identification, comprising the steps of:

(1) building a neural network:

(11) building a video feature extraction module:

(111) the step size of the last down-sampling of the first convolutional network is set to 1.

(112) And sequentially adding a time average pooling module, a first space average pooling module and a first batch of standardization modules behind the first convolution network.

(12) Building a generator: the generator comprises a plurality of layers of upsampling convolution layers and a convolution layer, the number of upsampling is the same as that of downsampling of the first convolution network, and the sizes of input and output characteristic graphs of the convolution layer are the same.

(13) Constructing an image feature extraction module:

(131) the step size of the last down-sampling of the second convolutional network is set to 1.

(132) And the second convolution network is sequentially connected with a second space average pooling module and a second batch of standardization modules.

(2) Taking K frames in the video sequence, taking one of the K frames as a label frame, and using the label frame for training the neural network built in the first step; the input of the video feature extraction module is K frames of a video sequence, the output of the time average pooling module is video features, and the video features are output after passing through the first space average pooling module and the first batch of standardization modules; the input of the generator is the video characteristics output by the time average pooling module and the output of the label frame in the first convolution network, and the output is a key frame; the input of the image feature extraction module is a key frame, and the output is the video feature in the key frame.

(3) Taking K frames in the video sequence to be identified, designating one frame as a label frame, inputting the K frame video sequence into the trained video feature extraction module and the generator in the step two, and storing the key frame output by the generator; and when retrieval is needed, inputting the stored key frames into the image feature extraction module trained in the step two to recover the video features in the key frames for pedestrian retrieval.

Further, the step (2) includes the sub-steps of:

(21) and randomly selecting K pictures from the video sequence and inputting the K pictures into the video feature extraction module.

(22) Selecting one frame from the K selected pictures as a label frame, fusing the video characteristics and the label frame characteristics, sending the fused video characteristics and the label frame characteristics into a generator for up-sampling, outputting the generated key frame, and using an image reconstruction loss function L_irecLeading the reconstruction of key frames.

(23) And (4) sending the key frame generated in the step (22) to an image feature extraction module. In the image feature extraction module, the features before and after batch normalization are respectively f_ibfrAnd f_iaft. By f_ibfrComputing a triplet loss function L_itri，f_iaftSending into the full connection layer to calculate the Softmax classification loss L_iid。

(24) Last down sampling output by time average pooling module in video feature extraction moduleThe video characteristics of the sample layer are sent to a first space average pooling module, and the characteristics f are output_vbfr(ii) a Then sending the data into a first batch of standardized modules and outputting the characteristic f_vaft. By f_vbfrComputing a triplet loss function L_vtri，f_vaftSending into the full connection layer to calculate the Softmax classification loss function L_vid。

(25) Normalizing the batch of step (23) to obtain a feature f_iaftAnd (24) extracting the video features f after batch standardization by the video feature extraction module_vaftUsing L₁And (4) carrying out characteristic reconstruction loss constraint on the loss, and recording a characteristic reconstruction loss function as L_frec。

(26) Using the classification loss function L for both the video feature extraction module and the image feature extraction module_vidAnd a triplet loss function L_vtriTraining discrimination ability and image reconstruction loss function L_irecAnd a characteristic reconstruction loss function L_frecThe synchronization is performed. Finally according to the total loss function L_loss＝L_vtri+L_vid+L_itri+L_iid+L_irec+L_frecThe entire neural network is trained.

Further, the step (22) comprises the sub-steps of:

(221) sending the randomly selected K pictures in the video sequence to a first convolution network in a video feature extraction module to obtain a video feature set of each picture,

wherein the content of the first and second substances,

and the video characteristics of the ith picture output at the jth downsampling layer of the first convolutional network are shown, i is 1-K, and J is the number of downsampling layers of the first convolutional network.

(222) Randomly selecting one picture L from K pictures as a label frame, wherein the label frame is characterized in that

(223) Video feature set F of K pictures_iSending the video data to a time average pooling module to obtain video characteristics of all the down-sampling layers

(224) F is to be_LAnd F_avgAnd splicing in the channel dimension, and sending the channel dimension to a generator to generate a key frame.

(225) For the generated key frame, using the image L as a label frame and using L₁Loss as a function of loss for image reconstruction L_irecAnd reconstructing an image.

Further, the step (224) is specifically: the generator has J layers in total, wherein the first J-1 layer is up-sampling, and the size of the feature map of the last layer is kept unchanged. The set of all layers of the generator is represented as

Where p-1 sequentially corresponds to the 1 st to (J-1) th layers of the generator, and p-0 corresponds to the last layer of the generator. Input of each layer of the generator I^pThe following were used:

wherein G is^p(I^p) To the output of each layer of the generator; g⁰(I⁰) A key frame generated for the generator; []Indicating a splice in the channel dimension.

Further, the step (3) includes the sub-steps of:

(31) through an HSV-Top-K method, K pictures are selected in advance from a video sequence to be identified, then video feature extraction and key frame generation are carried out, and the video sequence is stored in equipment, and the method comprises the following substeps:

(311) and calculating HSV histogram characteristics of each picture of the video sequence, then calculating characteristic centers of the video sequence, and selecting K pictures closest to the characteristic centers to replace the whole video sequence. And optionally one of them as a label frame.

(312) Sending the picked K pictures into the trained video feature extraction module in the step two to obtain video features and label frame features; and then sent to a generator together to generate the key frame.

(313) And storing the generated key frame into the equipment.

(32) And when retrieval is needed, recovering the video features in the key frames by using the image feature extraction module trained in the step two, and using the image feature extraction module for retrieval matching of video pedestrian re-identification.

Further, in the step (311), the feature center is an average value of HSV histogram features of each picture.

Further, in the step (311), the distance refers to an L2 euclidean distance.

The invention has the beneficial effects that:

(1) the invention replaces the whole video sequence with a generated key frame embedded with video characteristics, reduces the storage cost and simultaneously retains the performance of the video characteristics.

(2) The invention fuses the label characteristics after each down sampling with the video characteristics and then sends the label characteristics and the video characteristics into the generator, thereby ensuring that the generated key frame has high imaging quality while being embedded with the video characteristics.

(3) The invention uses the image characteristic extraction network to recover the video characteristics from the key frames, and uses the characteristic reconstruction loss to restrain the performance of the recovered video characteristics to be consistent with the performance of the original video characteristics, thereby reducing the performance loss of the recovered video characteristics.

(4) In the application stage, the most representative K pictures are selected by using an HSV-Top-K method to replace the original whole video sequence for generating the key frames. Compared with the method using all the frames in the video sequence, fewer pictures are used, and the method is more suitable for batch processing, so that the calculation cost is reduced.

Drawings

FIG. 1 is a schematic diagram of the overall structure of a network during a training phase;

fig. 2 is a schematic flow diagram of the application phase.

Detailed Description

The invention relates to a video sequence coding and decoding method for video pedestrian re-identification. And then, sending the generated key frames to an image feature extraction module for video feature recovery, and constraining the recovered video features and the original video features by feature reconstruction loss. In the application stage, a K frame picture is selected by using an HSV-Top-K method to generate a key frame, and then the generated key frame is stored in equipment to reduce storage cost. When retrieval is needed, the image feature extraction module is used for recovering the video features of the generated key frames, and the recovered features retain the performance of the video features and are used for retrieval and matching of pedestrians. The method specifically comprises the following steps:

step one, building a neural network for training, specifically comprising the following steps:

(11) the method comprises the following steps of constructing a video feature extraction module, specifically:

(111) the step size of the last down-sampling of ResNet50 is set to 1.

(112) And a time average pooling module, a space average pooling module and a batch standardization module are sequentially added behind the ResNet50 to serve as video feature extraction modules.

(12) A generator composed of upsampling convolution is built and used as an encoder, and the method specifically comprises the following steps: the generator is a convolutional layer with the same size of the input characteristic diagram and the multi-layer upsampling convolution, the number of upsampling is the same as that of ResNet50 downsampling, and the input characteristic diagram and the output characteristic diagram of the convolutional layer are the same in size.

(13) The image feature extraction module is constructed and used as a decoder, and specifically comprises the following steps:

(131) the step size of the last down-sampling of ResNet50 is set to 1.

(132) The ResNet50 (which may be replaced by other convolutional networks) is followed by a spatial averaging pooling module, a batch normalization module, as an image feature extraction module.

Step two, as shown in fig. 1, training the neural network built in step one, wherein the training stage specifically comprises:

(22) Randomly selecting one frame from the K selected pictures as a tag frame, fusing video characteristics and tag frame characteristics, sending the fused video characteristics and tag frame characteristics into a generator for upsampling, outputting the generated key frame, and guiding the reconstruction of the key frame by using an image reconstruction loss function; the method specifically comprises the following steps:

(221) sending the randomly selected K pictures in the video sequence into ResNet50 in the video feature extraction module to obtain the video feature set of each picture,

wherein the content of the first and second substances,

shows the ith picture in

J is 1-5, the video characteristics output by the jth downsampling layer.

(222) Randomly selecting one picture L from K pictures as a label frame, wherein the label frame output by the first convolution network is characterized in that

(224) F is to be_LAnd F_avgAnd splicing in the channel dimension, and sending the channel dimension to a generator to generate a key frame. The generator has 5 layers, wherein the first 4 layers are up-sampling, and the size of the feature map of the last layer is kept unchanged. The set of all layers of the generator is represented as

Wherein p is 4-1 in sequenceThe 1 st to 4 th layers of the generator correspond to the last layer of the generator when p is 0. Input of each layer of the generator I^pThe following were used:

wherein G is^p(I^p) For the output of each layer of the generator, p is 0-4; []Indicating a splice in the channel dimension.

(225) For the generated key frame G⁰(I⁰) Using the image L as a label, using L₁Loss as a function of loss for image reconstruction L_irecAnd reconstructing an image.

(23) And (3) sending the key frame generated in the step (224) to an image feature extraction module. In the image feature extraction module, the features before and after batch normalization are respectively f_ibfrAnd f_iaft. By f_ibfrComputing a triplet loss function L_itri，f_iaftSending into the full connection layer to calculate the Softmax classification loss L_iid。

(24) The video characteristics of the down-sampling layer output by the time average pooling module in the step (223)

Sent to a spatial averaging pooling module to output a characteristic f_vbfr(ii) a Then sending the data into a batch standardization module to output the characteristic f_vaft. By f_vbfrComputing a triplet loss function L_vtri，f_vaftSending into the full connection layer to calculate the Softmax classification loss function L_vid。

(26) Using the classification loss function L for both the video feature extraction module and the image feature extraction module_vidAnd a triplet loss function L_vtriTraining discriminant ability, and drawingImage reconstruction loss function L_recAnd a characteristic reconstruction loss function L_frecThe synchronization is performed. Final total loss function L_loss＝L_vtri+L_vid+L_itri+L_iid+L_irec+L_frec。

Step three, as shown in fig. 2, the application stage specifically is:

(31) through an HSV-Top-K method, K pictures are selected in advance from a video sequence to be identified, and then video feature extraction and key frame generation are carried out and stored in equipment. The method comprises the following specific steps:

(311) and calculating HSV histogram characteristics of each picture of the video sequence, then calculating characteristic centers of the video sequence, and selecting K pictures closest to the characteristic centers to replace the whole video sequence. And optionally one of them as a label frame. The feature center is the average value of HSV histogram features of each picture; the distance refers to the L2 euclidean distance.

(313) And storing the generated key frame into the equipment.

(32) And when retrieval is needed, restoring the video features in the key frames by using the image feature extraction module trained in the step two, and performing retrieval matching of video pedestrian re-identification by using the restored features.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A video sequence encoding and decoding method for video pedestrian re-identification, comprising the steps of:

(1) building a neural network:

(11) building a video feature extraction module:

(13) Constructing an image feature extraction module:

2. The method for encoding and decoding a video sequence for video pedestrian re-identification as claimed in claim 1, characterized in that said step (2) comprises the sub-steps of:

(24) The video features of the last down-sampling layer output by the time average pooling module in the video feature extraction module are sent to the first space average pooling module, and the output feature f_vbfr(ii) a Then sending the data into a first batch of standardized modules and outputting the characteristic f_vaft. By f_υbfrComputing a triplet loss function L_vtri，f_vaftSending into the full connection layer to calculate the Softmax classification loss function L_vid。

(26) Using both the classification loss function Lvid and the triplet loss function L for the video feature extraction module and the image feature extraction module_vtriTraining discrimination ability and image reconstruction loss function L_irecAnd a characteristic reconstruction loss function L_frecThe synchronization is performed. Finally according to the total loss function L_loss＝L_vtri+L_vid+L_itri+L_iid+L_irec+L_frecThe entire neural network is trained.

3. A method for coding and decoding a video sequence for pedestrian video re-identification according to claim 2, characterized in that said step (22) comprises the sub-steps of:

wherein the content of the first and second substances,

4. A method for encoding and decoding a video sequence for pedestrian video re-identification as claimed in claim 3, characterized in that said step (224) is specifically: the generator has J layers in total, wherein the first J-1 layer is up-sampling, and the size of the feature map of the last layer is kept unchanged. The set of all layers of the generator is represented as

Wherein p ═ 1 to 1 are generated in sequenceThe 1 st to (J-1) th layers of the device, p 0 corresponds to the last layer of the generator. Input of each layer of the generator I^pThe following were used:

5. The method for encoding and decoding a video sequence for video pedestrian re-identification as claimed in claim 1, characterized in that said step (3) comprises the sub-steps of:

(313) And storing the generated key frame into the equipment.

6. The method according to claim 5, wherein in the step (311), the feature center is an average value of HSV histogram features of each picture.

7. The method for encoding and decoding a video sequence for pedestrian video re-identification as claimed in claim 5, wherein in said step (311), said distance is L2 euclidean distance.