CN112418127A - Video sequence coding and decoding method for video pedestrian re-identification - Google Patents

Video sequence coding and decoding method for video pedestrian re-identification Download PDF

Info

Publication number
CN112418127A
CN112418127A CN202011378786.2A CN202011378786A CN112418127A CN 112418127 A CN112418127 A CN 112418127A CN 202011378786 A CN202011378786 A CN 202011378786A CN 112418127 A CN112418127 A CN 112418127A
Authority
CN
China
Prior art keywords
video
feature extraction
extraction module
generator
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011378786.2A
Other languages
Chinese (zh)
Other versions
CN112418127B (en
Inventor
潘啸
罗浩
姜伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202011378786.2A priority Critical patent/CN112418127B/en
Publication of CN112418127A publication Critical patent/CN112418127A/en
Application granted granted Critical
Publication of CN112418127B publication Critical patent/CN112418127B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a video sequence coding and decoding method for video pedestrian re-identification, which comprises the steps of inputting label picture characteristics and video characteristics into a generator after being fused in a training stage, using a label picture as a reconstruction label, and constraining a key frame generated by the generator by image reconstruction loss; and then, sending the generated key frame into an image feature extraction module for video feature recovery, and constraining the recovered video features through feature reconstruction loss to enable the video features to be consistent with the original video features in performance. In the application stage, a K frame picture is selected by using an HSV-Top-K method to generate a key frame, and then the generated key frame is stored in equipment to reduce storage cost. When retrieval is needed, the image feature extraction module is used for recovering the video features of the generated key frames, and the recovered features retain the performance of the video features and are used for retrieval and matching of pedestrians.

Description

Video sequence coding and decoding method for video pedestrian re-identification
Technical Field
The invention belongs to the field of computer vision image retrieval, and particularly relates to a video sequence coding and decoding method for video pedestrian re-identification.
Background
Pedestrian re-identification aims at retrieving pedestrians appointed by a user from a series of monitoring videos crossing cameras; the method is widely applied to smart cities and security monitoring.
According to the number of different input pictures, the pedestrian re-identification can be divided into video pedestrian re-identification and image pedestrian re-identification. Compared with image pedestrian re-identification using a single-frame image as input, video pedestrian re-identification uses a video sequence as input and has higher robustness to environmental interference. However, the video pedestrian re-identification needs to store a large number of video sequences, which can cause huge storage overhead in practical application and increase the application cost of the video pedestrian re-identification. Meanwhile, in the application stage, the length of each video sequence is different, so that the method is not suitable for batch processing, and the calculation overhead is large.
Disclosure of Invention
The present invention is directed to a method for encoding and decoding a video sequence for pedestrian video re-identification.
The purpose of the invention is realized by the following technical scheme: a video sequence encoding and decoding method for video pedestrian re-identification, comprising the steps of:
(1) building a neural network:
(11) building a video feature extraction module:
(111) the step size of the last down-sampling of the first convolutional network is set to 1.
(112) And sequentially adding a time average pooling module, a first space average pooling module and a first batch of standardization modules behind the first convolution network.
(12) Building a generator: the generator comprises a plurality of layers of upsampling convolution layers and a convolution layer, the number of upsampling is the same as that of downsampling of the first convolution network, and the sizes of input and output characteristic graphs of the convolution layer are the same.
(13) Constructing an image feature extraction module:
(131) the step size of the last down-sampling of the second convolutional network is set to 1.
(132) And the second convolution network is sequentially connected with a second space average pooling module and a second batch of standardization modules.
(2) Taking K frames in the video sequence, taking one of the K frames as a label frame, and using the label frame for training the neural network built in the first step; the input of the video feature extraction module is K frames of a video sequence, the output of the time average pooling module is video features, and the video features are output after passing through the first space average pooling module and the first batch of standardization modules; the input of the generator is the video characteristics output by the time average pooling module and the output of the label frame in the first convolution network, and the output is a key frame; the input of the image feature extraction module is a key frame, and the output is the video feature in the key frame.
(3) Taking K frames in the video sequence to be identified, designating one frame as a label frame, inputting the K frame video sequence into the trained video feature extraction module and the generator in the step two, and storing the key frame output by the generator; and when retrieval is needed, inputting the stored key frames into the image feature extraction module trained in the step two to recover the video features in the key frames for pedestrian retrieval.
Further, the step (2) includes the sub-steps of:
(21) and randomly selecting K pictures from the video sequence and inputting the K pictures into the video feature extraction module.
(22) Selecting one frame from the K selected pictures as a label frame, fusing the video characteristics and the label frame characteristics, sending the fused video characteristics and the label frame characteristics into a generator for up-sampling, outputting the generated key frame, and using an image reconstruction loss function LirecLeading the reconstruction of key frames.
(23) And (4) sending the key frame generated in the step (22) to an image feature extraction module. In the image feature extraction module, the features before and after batch normalization are respectively fibfrAnd fiaft. By fibfrComputing a triplet loss function Litri,fiaftSending into the full connection layer to calculate the Softmax classification loss Liid
(24) Last down sampling output by time average pooling module in video feature extraction moduleThe video characteristics of the sample layer are sent to a first space average pooling module, and the characteristics f are outputvbfr(ii) a Then sending the data into a first batch of standardized modules and outputting the characteristic fvaft. By fvbfrComputing a triplet loss function Lvtri,fvaftSending into the full connection layer to calculate the Softmax classification loss function Lvid
(25) Normalizing the batch of step (23) to obtain a feature fiaftAnd (24) extracting the video features f after batch standardization by the video feature extraction modulevaftUsing L1And (4) carrying out characteristic reconstruction loss constraint on the loss, and recording a characteristic reconstruction loss function as Lfrec
(26) Using the classification loss function L for both the video feature extraction module and the image feature extraction modulevidAnd a triplet loss function LvtriTraining discrimination ability and image reconstruction loss function LirecAnd a characteristic reconstruction loss function LfrecThe synchronization is performed. Finally according to the total loss function Lloss=Lvtri+Lvid+Litri+Liid+Lirec+LfrecThe entire neural network is trained.
Further, the step (22) comprises the sub-steps of:
(221) sending the randomly selected K pictures in the video sequence to a first convolution network in a video feature extraction module to obtain a video feature set of each picture,
Figure BDA0002807918670000021
wherein the content of the first and second substances,
Figure BDA0002807918670000022
and the video characteristics of the ith picture output at the jth downsampling layer of the first convolutional network are shown, i is 1-K, and J is the number of downsampling layers of the first convolutional network.
(222) Randomly selecting one picture L from K pictures as a label frame, wherein the label frame is characterized in that
Figure BDA0002807918670000023
(223) Video feature set F of K picturesiSending the video data to a time average pooling module to obtain video characteristics of all the down-sampling layers
Figure BDA0002807918670000031
(224) F is to beLAnd FavgAnd splicing in the channel dimension, and sending the channel dimension to a generator to generate a key frame.
(225) For the generated key frame, using the image L as a label frame and using L1Loss as a function of loss for image reconstruction LirecAnd reconstructing an image.
Further, the step (224) is specifically: the generator has J layers in total, wherein the first J-1 layer is up-sampling, and the size of the feature map of the last layer is kept unchanged. The set of all layers of the generator is represented as
Figure BDA0002807918670000032
Where p-1 sequentially corresponds to the 1 st to (J-1) th layers of the generator, and p-0 corresponds to the last layer of the generator. Input of each layer of the generator IpThe following were used:
Figure BDA0002807918670000033
wherein G isp(Ip) To the output of each layer of the generator; g0(I0) A key frame generated for the generator; []Indicating a splice in the channel dimension.
Further, the step (3) includes the sub-steps of:
(31) through an HSV-Top-K method, K pictures are selected in advance from a video sequence to be identified, then video feature extraction and key frame generation are carried out, and the video sequence is stored in equipment, and the method comprises the following substeps:
(311) and calculating HSV histogram characteristics of each picture of the video sequence, then calculating characteristic centers of the video sequence, and selecting K pictures closest to the characteristic centers to replace the whole video sequence. And optionally one of them as a label frame.
(312) Sending the picked K pictures into the trained video feature extraction module in the step two to obtain video features and label frame features; and then sent to a generator together to generate the key frame.
(313) And storing the generated key frame into the equipment.
(32) And when retrieval is needed, recovering the video features in the key frames by using the image feature extraction module trained in the step two, and using the image feature extraction module for retrieval matching of video pedestrian re-identification.
Further, in the step (311), the feature center is an average value of HSV histogram features of each picture.
Further, in the step (311), the distance refers to an L2 euclidean distance.
The invention has the beneficial effects that:
(1) the invention replaces the whole video sequence with a generated key frame embedded with video characteristics, reduces the storage cost and simultaneously retains the performance of the video characteristics.
(2) The invention fuses the label characteristics after each down sampling with the video characteristics and then sends the label characteristics and the video characteristics into the generator, thereby ensuring that the generated key frame has high imaging quality while being embedded with the video characteristics.
(3) The invention uses the image characteristic extraction network to recover the video characteristics from the key frames, and uses the characteristic reconstruction loss to restrain the performance of the recovered video characteristics to be consistent with the performance of the original video characteristics, thereby reducing the performance loss of the recovered video characteristics.
(4) In the application stage, the most representative K pictures are selected by using an HSV-Top-K method to replace the original whole video sequence for generating the key frames. Compared with the method using all the frames in the video sequence, fewer pictures are used, and the method is more suitable for batch processing, so that the calculation cost is reduced.
Drawings
FIG. 1 is a schematic diagram of the overall structure of a network during a training phase;
fig. 2 is a schematic flow diagram of the application phase.
Detailed Description
The invention relates to a video sequence coding and decoding method for video pedestrian re-identification. And then, sending the generated key frames to an image feature extraction module for video feature recovery, and constraining the recovered video features and the original video features by feature reconstruction loss. In the application stage, a K frame picture is selected by using an HSV-Top-K method to generate a key frame, and then the generated key frame is stored in equipment to reduce storage cost. When retrieval is needed, the image feature extraction module is used for recovering the video features of the generated key frames, and the recovered features retain the performance of the video features and are used for retrieval and matching of pedestrians. The method specifically comprises the following steps:
step one, building a neural network for training, specifically comprising the following steps:
(11) the method comprises the following steps of constructing a video feature extraction module, specifically:
(111) the step size of the last down-sampling of ResNet50 is set to 1.
(112) And a time average pooling module, a space average pooling module and a batch standardization module are sequentially added behind the ResNet50 to serve as video feature extraction modules.
(12) A generator composed of upsampling convolution is built and used as an encoder, and the method specifically comprises the following steps: the generator is a convolutional layer with the same size of the input characteristic diagram and the multi-layer upsampling convolution, the number of upsampling is the same as that of ResNet50 downsampling, and the input characteristic diagram and the output characteristic diagram of the convolutional layer are the same in size.
(13) The image feature extraction module is constructed and used as a decoder, and specifically comprises the following steps:
(131) the step size of the last down-sampling of ResNet50 is set to 1.
(132) The ResNet50 (which may be replaced by other convolutional networks) is followed by a spatial averaging pooling module, a batch normalization module, as an image feature extraction module.
Step two, as shown in fig. 1, training the neural network built in step one, wherein the training stage specifically comprises:
(21) and randomly selecting K pictures from the video sequence and inputting the K pictures into the video feature extraction module.
(22) Randomly selecting one frame from the K selected pictures as a tag frame, fusing video characteristics and tag frame characteristics, sending the fused video characteristics and tag frame characteristics into a generator for upsampling, outputting the generated key frame, and guiding the reconstruction of the key frame by using an image reconstruction loss function; the method specifically comprises the following steps:
(221) sending the randomly selected K pictures in the video sequence into ResNet50 in the video feature extraction module to obtain the video feature set of each picture,
Figure BDA0002807918670000051
wherein the content of the first and second substances,
Figure BDA0002807918670000052
shows the ith picture in
Figure BDA0002807918670000058
J is 1-5, the video characteristics output by the jth downsampling layer.
(222) Randomly selecting one picture L from K pictures as a label frame, wherein the label frame output by the first convolution network is characterized in that
Figure BDA0002807918670000053
(223) Video feature set F of K picturesiSending the video data to a time average pooling module to obtain video characteristics of all the down-sampling layers
Figure BDA0002807918670000054
(224) F is to beLAnd FavgAnd splicing in the channel dimension, and sending the channel dimension to a generator to generate a key frame. The generator has 5 layers, wherein the first 4 layers are up-sampling, and the size of the feature map of the last layer is kept unchanged. The set of all layers of the generator is represented as
Figure BDA0002807918670000055
Wherein p is 4-1 in sequenceThe 1 st to 4 th layers of the generator correspond to the last layer of the generator when p is 0. Input of each layer of the generator IpThe following were used:
Figure BDA0002807918670000056
wherein G isp(Ip) For the output of each layer of the generator, p is 0-4; []Indicating a splice in the channel dimension.
(225) For the generated key frame G0(I0) Using the image L as a label, using L1Loss as a function of loss for image reconstruction LirecAnd reconstructing an image.
(23) And (3) sending the key frame generated in the step (224) to an image feature extraction module. In the image feature extraction module, the features before and after batch normalization are respectively fibfrAnd fiaft. By fibfrComputing a triplet loss function Litri,fiaftSending into the full connection layer to calculate the Softmax classification loss Liid
(24) The video characteristics of the down-sampling layer output by the time average pooling module in the step (223)
Figure BDA0002807918670000057
Sent to a spatial averaging pooling module to output a characteristic fvbfr(ii) a Then sending the data into a batch standardization module to output the characteristic fvaft. By fvbfrComputing a triplet loss function Lvtri,fvaftSending into the full connection layer to calculate the Softmax classification loss function Lvid
(25) Normalizing the batch of step (23) to obtain a feature fiaftAnd (24) extracting the video features f after batch standardization by the video feature extraction modulevaftUsing L1And (4) carrying out characteristic reconstruction loss constraint on the loss, and recording a characteristic reconstruction loss function as Lfrec
(26) Using the classification loss function L for both the video feature extraction module and the image feature extraction modulevidAnd a triplet loss function LvtriTraining discriminant ability, and drawingImage reconstruction loss function LrecAnd a characteristic reconstruction loss function LfrecThe synchronization is performed. Final total loss function Lloss=Lvtri+Lvid+Litri+Liid+Lirec+Lfrec
Step three, as shown in fig. 2, the application stage specifically is:
(31) through an HSV-Top-K method, K pictures are selected in advance from a video sequence to be identified, and then video feature extraction and key frame generation are carried out and stored in equipment. The method comprises the following specific steps:
(311) and calculating HSV histogram characteristics of each picture of the video sequence, then calculating characteristic centers of the video sequence, and selecting K pictures closest to the characteristic centers to replace the whole video sequence. And optionally one of them as a label frame. The feature center is the average value of HSV histogram features of each picture; the distance refers to the L2 euclidean distance.
(312) Sending the picked K pictures into the trained video feature extraction module in the step two to obtain video features and label frame features; and then sent to a generator together to generate the key frame.
(313) And storing the generated key frame into the equipment.
(32) And when retrieval is needed, restoring the video features in the key frames by using the image feature extraction module trained in the step two, and performing retrieval matching of video pedestrian re-identification by using the restored features.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (7)

1. A video sequence encoding and decoding method for video pedestrian re-identification, comprising the steps of:
(1) building a neural network:
(11) building a video feature extraction module:
(111) the step size of the last down-sampling of the first convolutional network is set to 1.
(112) And sequentially adding a time average pooling module, a first space average pooling module and a first batch of standardization modules behind the first convolution network.
(12) Building a generator: the generator comprises a plurality of layers of upsampling convolution layers and a convolution layer, the number of upsampling is the same as that of downsampling of the first convolution network, and the sizes of input and output characteristic graphs of the convolution layer are the same.
(13) Constructing an image feature extraction module:
(131) the step size of the last down-sampling of the second convolutional network is set to 1.
(132) And the second convolution network is sequentially connected with a second space average pooling module and a second batch of standardization modules.
(2) Taking K frames in the video sequence, taking one of the K frames as a label frame, and using the label frame for training the neural network built in the first step; the input of the video feature extraction module is K frames of a video sequence, the output of the time average pooling module is video features, and the video features are output after passing through the first space average pooling module and the first batch of standardization modules; the input of the generator is the video characteristics output by the time average pooling module and the output of the label frame in the first convolution network, and the output is a key frame; the input of the image feature extraction module is a key frame, and the output is the video feature in the key frame.
(3) Taking K frames in the video sequence to be identified, designating one frame as a label frame, inputting the K frame video sequence into the trained video feature extraction module and the generator in the step two, and storing the key frame output by the generator; and when retrieval is needed, inputting the stored key frames into the image feature extraction module trained in the step two to recover the video features in the key frames for pedestrian retrieval.
2. The method for encoding and decoding a video sequence for video pedestrian re-identification as claimed in claim 1, characterized in that said step (2) comprises the sub-steps of:
(21) and randomly selecting K pictures from the video sequence and inputting the K pictures into the video feature extraction module.
(22) Selecting one frame from the K selected pictures as a label frame, fusing the video characteristics and the label frame characteristics, sending the fused video characteristics and the label frame characteristics into a generator for up-sampling, outputting the generated key frame, and using an image reconstruction loss function LirecLeading the reconstruction of key frames.
(23) And (4) sending the key frame generated in the step (22) to an image feature extraction module. In the image feature extraction module, the features before and after batch normalization are respectively fibfrAnd fiaft. By fibfrComputing a triplet loss function Litri,fiaftSending into the full connection layer to calculate the Softmax classification loss Liid
(24) The video features of the last down-sampling layer output by the time average pooling module in the video feature extraction module are sent to the first space average pooling module, and the output feature fvbfr(ii) a Then sending the data into a first batch of standardized modules and outputting the characteristic fvaft. By fυbfrComputing a triplet loss function Lvtri,fvaftSending into the full connection layer to calculate the Softmax classification loss function Lvid
(25) Normalizing the batch of step (23) to obtain a feature fiaftAnd (24) extracting the video features f after batch standardization by the video feature extraction modulevaftUsing L1And (4) carrying out characteristic reconstruction loss constraint on the loss, and recording a characteristic reconstruction loss function as Lfrec
(26) Using both the classification loss function Lvid and the triplet loss function L for the video feature extraction module and the image feature extraction modulevtriTraining discrimination ability and image reconstruction loss function LirecAnd a characteristic reconstruction loss function LfrecThe synchronization is performed. Finally according to the total loss function Lloss=Lvtri+Lvid+Litri+Liid+Lirec+LfrecThe entire neural network is trained.
3. A method for coding and decoding a video sequence for pedestrian video re-identification according to claim 2, characterized in that said step (22) comprises the sub-steps of:
(221) sending the randomly selected K pictures in the video sequence to a first convolution network in a video feature extraction module to obtain a video feature set of each picture,
Figure FDA0002807918660000021
wherein the content of the first and second substances,
Figure FDA0002807918660000022
and the video characteristics of the ith picture output at the jth downsampling layer of the first convolutional network are shown, i is 1-K, and J is the number of downsampling layers of the first convolutional network.
(222) Randomly selecting one picture L from K pictures as a label frame, wherein the label frame is characterized in that
Figure FDA0002807918660000023
(223) Video feature set F of K picturesiSending the video data to a time average pooling module to obtain video characteristics of all the down-sampling layers
Figure FDA0002807918660000024
(224) F is to beLAnd FavgAnd splicing in the channel dimension, and sending the channel dimension to a generator to generate a key frame.
(225) For the generated key frame, using the image L as a label frame and using L1Loss as a function of loss for image reconstruction LirecAnd reconstructing an image.
4. A method for encoding and decoding a video sequence for pedestrian video re-identification as claimed in claim 3, characterized in that said step (224) is specifically: the generator has J layers in total, wherein the first J-1 layer is up-sampling, and the size of the feature map of the last layer is kept unchanged. The set of all layers of the generator is represented as
Figure FDA0002807918660000025
Wherein p ═ 1 to 1 are generated in sequenceThe 1 st to (J-1) th layers of the device, p 0 corresponds to the last layer of the generator. Input of each layer of the generator IpThe following were used:
Figure FDA0002807918660000026
wherein G isp(Ip) To the output of each layer of the generator; g0(I0) A key frame generated for the generator; []Indicating a splice in the channel dimension.
5. The method for encoding and decoding a video sequence for video pedestrian re-identification as claimed in claim 1, characterized in that said step (3) comprises the sub-steps of:
(31) through an HSV-Top-K method, K pictures are selected in advance from a video sequence to be identified, then video feature extraction and key frame generation are carried out, and the video sequence is stored in equipment, and the method comprises the following substeps:
(311) and calculating HSV histogram characteristics of each picture of the video sequence, then calculating characteristic centers of the video sequence, and selecting K pictures closest to the characteristic centers to replace the whole video sequence. And optionally one of them as a label frame.
(312) Sending the picked K pictures into the trained video feature extraction module in the step two to obtain video features and label frame features; and then sent to a generator together to generate the key frame.
(313) And storing the generated key frame into the equipment.
(32) And when retrieval is needed, recovering the video features in the key frames by using the image feature extraction module trained in the step two, and using the image feature extraction module for retrieval matching of video pedestrian re-identification.
6. The method according to claim 5, wherein in the step (311), the feature center is an average value of HSV histogram features of each picture.
7. The method for encoding and decoding a video sequence for pedestrian video re-identification as claimed in claim 5, wherein in said step (311), said distance is L2 euclidean distance.
CN202011378786.2A 2020-11-30 2020-11-30 Video sequence coding and decoding method for video pedestrian re-identification Active CN112418127B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011378786.2A CN112418127B (en) 2020-11-30 2020-11-30 Video sequence coding and decoding method for video pedestrian re-identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011378786.2A CN112418127B (en) 2020-11-30 2020-11-30 Video sequence coding and decoding method for video pedestrian re-identification

Publications (2)

Publication Number Publication Date
CN112418127A true CN112418127A (en) 2021-02-26
CN112418127B CN112418127B (en) 2022-05-03

Family

ID=74828951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011378786.2A Active CN112418127B (en) 2020-11-30 2020-11-30 Video sequence coding and decoding method for video pedestrian re-identification

Country Status (1)

Country Link
CN (1) CN112418127B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033697A (en) * 2021-04-15 2021-06-25 浙江大学 Automatic model evaluation method and device based on batch normalization layer
CN116563895A (en) * 2023-07-11 2023-08-08 四川大学 Video-based animal individual identification method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008804A (en) * 2018-12-12 2019-07-12 浙江新再灵科技股份有限公司 Elevator monitoring key frame based on deep learning obtains and detection method
US20200098085A1 (en) * 2018-09-20 2020-03-26 Robert Bosch Gmbh Monitoring apparatus for person recognition and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200098085A1 (en) * 2018-09-20 2020-03-26 Robert Bosch Gmbh Monitoring apparatus for person recognition and method
CN110008804A (en) * 2018-12-12 2019-07-12 浙江新再灵科技股份有限公司 Elevator monitoring key frame based on deep learning obtains and detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DAPENG CHEN等: "《Video Person Re-identification with Competitive Snippet-similarity Aggregation and Co-attentive Snippet Embedding》", 《 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》 *
李梦静 等: "《视频行人重识别研究进展》", 《南京师大学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033697A (en) * 2021-04-15 2021-06-25 浙江大学 Automatic model evaluation method and device based on batch normalization layer
CN113033697B (en) * 2021-04-15 2022-10-04 浙江大学 Automatic model evaluation method and device based on batch normalization layer
CN116563895A (en) * 2023-07-11 2023-08-08 四川大学 Video-based animal individual identification method

Also Published As

Publication number Publication date
CN112418127B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN111062892B (en) Single image rain removing method based on composite residual error network and deep supervision
CN109087258B (en) Deep learning-based image rain removing method and device
CN110569814B (en) Video category identification method, device, computer equipment and computer storage medium
CN112418127B (en) Video sequence coding and decoding method for video pedestrian re-identification
CN111815509B (en) Image style conversion and model training method and device
CN114936605A (en) Knowledge distillation-based neural network training method, device and storage medium
CN115311720B (en) Method for generating deepfake based on transducer
CN111429466A (en) Space-based crowd counting and density estimation method based on multi-scale information fusion network
CN112818955B (en) Image segmentation method, device, computer equipment and storage medium
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN112581409A (en) Image defogging method based on end-to-end multiple information distillation network
CN112241939A (en) Light-weight rain removing method based on multi-scale and non-local
CN113379858A (en) Image compression method and device based on deep learning
CN116434241A (en) Method and system for identifying text in natural scene image based on attention mechanism
CN114255456A (en) Natural scene text detection method and system based on attention mechanism feature fusion and enhancement
CN115496919A (en) Hybrid convolution-transformer framework based on window mask strategy and self-supervision method
CN114943937A (en) Pedestrian re-identification method and device, storage medium and electronic equipment
CN115331083B (en) Image rain removing method and system based on gradual dense feature fusion rain removing network
WO2023159765A1 (en) Video search method and apparatus, electronic device and storage medium
CN112801912B (en) Face image restoration method, system, device and storage medium
CN113222016B (en) Change detection method and device based on cross enhancement of high-level and low-level features
CN112784838A (en) Hamming OCR recognition method based on locality sensitive hashing network
CN113920317A (en) Semantic segmentation method based on visible light image and low-resolution depth image
Pei et al. UVL: A Unified Framework for Video Tampering Localization
Chen et al. A video key frame extraction method based on multiview fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant