CN116343265A - Full-supervision video pedestrian re-identification method, system, equipment and medium - Google Patents

Full-supervision video pedestrian re-identification method, system, equipment and medium Download PDF

Info

Publication number
CN116343265A
CN116343265A CN202310327791.8A CN202310327791A CN116343265A CN 116343265 A CN116343265 A CN 116343265A CN 202310327791 A CN202310327791 A CN 202310327791A CN 116343265 A CN116343265 A CN 116343265A
Authority
CN
China
Prior art keywords
frame
video
pedestrian
target person
person
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310327791.8A
Other languages
Chinese (zh)
Inventor
王乐
仵鹏飞
周三平
陈仕韬
辛景民
郑南宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Shun'an Artificial Intelligence Research Institute
Xian Jiaotong University
Original Assignee
Ningbo Shun'an Artificial Intelligence Research Institute
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Shun'an Artificial Intelligence Research Institute, Xian Jiaotong University filed Critical Ningbo Shun'an Artificial Intelligence Research Institute
Priority to CN202310327791.8A priority Critical patent/CN116343265A/en
Publication of CN116343265A publication Critical patent/CN116343265A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a full-surveillance video pedestrian re-recognition method, a system, equipment and a medium, wherein the full-surveillance video pedestrian re-recognition method comprises the following steps: acquiring a video clip containing a target person and a video clip to be re-identified by the person; and based on the obtained video clips containing the target person and the video clips to be re-identified by the pedestrian, performing pedestrian re-identification processing by utilizing a pre-trained pedestrian re-identification model, and outputting a pedestrian re-identification result. The invention discloses a full-supervision video pedestrian re-recognition method, in particular to a video pedestrian re-recognition method based on time sequence correlation decomposition, which removes the characteristics of non-target persons through different relative states of target persons and shielding persons and can eliminate the influence of shielding persons on model characteristic learning under the shielding condition; in addition, the staggered video frames are realigned through a correlation filtering algorithm to recover the semantic consistency of the video segments, so that the retrieval accuracy can be improved.

Description

Full-supervision video pedestrian re-identification method, system, equipment and medium
Technical Field
The invention belongs to the technical field of computer vision, relates to the field of pedestrian re-identification, and in particular relates to a full-supervision video pedestrian re-identification method, system, equipment and medium.
Background
The target of the video pedestrian re-identification task is to search video clips with target people from a large number of video clips through a video of a target person, and the task has a plurality of application scenes with practical significance, such as: intelligent video monitoring system, intelligent security, target tracking across cameras, etc.
At present, the existing video pedestrian re-identification method has the following technical defects:
(1) Shielding among people can occur in a dense personnel scene, and the model is difficult to accurately learn the characteristics of the target person under the condition that the target person is seriously shielded, so that retrieval errors are caused; particularly, when the shielding person and the target person have similar apparent characteristics, the model can pay attention to the wrong part so as to cause retrieval failure;
(2) Due to the defect of the target detection result, the problem of inaccurate positioning of the video segments of the same pedestrian can occur, so that different semantics are provided on the same spatial position of the continuous frames; when the video clip features are fused, the dislocated portions may destroy the final video feature, resulting in reduced retrieval accuracy.
Disclosure of Invention
The invention aims to provide a full-supervision video pedestrian re-identification method, system, equipment and medium, so as to solve one or more of the technical problems. The invention discloses a full-supervision video pedestrian re-recognition method, in particular to a video pedestrian re-recognition method based on time sequence correlation decomposition, which removes the characteristics of non-target persons through different relative states of target persons and shielding persons and can eliminate the influence of shielding persons on model characteristic learning under the shielding condition; in addition, the staggered video frames are realigned through a correlation filtering algorithm to recover the semantic consistency of the video segments, so that the retrieval accuracy can be improved.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention provides a full-supervision video pedestrian re-identification method, which comprises the following steps:
acquiring a video clip containing a target person and a video clip to be re-identified by the person;
based on the obtained video clips containing the target person and the video clips to be re-identified by the pedestrian, performing pedestrian re-identification processing by utilizing a pre-trained pedestrian re-identification model, and outputting a pedestrian re-identification result; wherein, the liquid crystal display device comprises a liquid crystal display device,
the pedestrian re-recognition result at least comprises whether the video clip to be subjected to pedestrian re-recognition contains a target person or not;
the pedestrian re-recognition model includes:
the encoder module is used for inputting an original video frame to perform feature extraction and outputting a frame level feature map; wherein the encoder module is based on a classical Vision Transformer architecture;
the characteristic alignment module is used for performing deviation calculation processing on an input original video frame by adopting a kernel correlation filtering algorithm and outputting the cross-frame position deviation of a target person;
the decoder module is used for inputting the frame level feature map and the cross-frame position deviation of the target person, carrying out feature alignment processing and obtaining an aligned frame level feature map; carrying out local characteristic removal processing on the aligned frame level characteristic images by utilizing different relative states of the target person and the shielding person to obtain a frame level characteristic image with local characteristics of the shielding person removed; based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map with the local features of the shielding person removed, and outputting video-level features; wherein the decoder module is a multi-headed self-attention mechanism based decoder.
The method of the invention is further improved in that the training step of the pre-trained pedestrian re-recognition model comprises the following steps:
acquiring a training sample set; each training sample in the training sample set comprises a sampled video clip containing pedestrians, and the video clip contains ID numbers of the pedestrians;
during training, for a selected training sample, inputting a sampled video segment containing pedestrians in the training sample into a pedestrian re-recognition model, predicting to obtain the ID number of the pedestrian in the video segment, and taking the ID number as a prediction result; and comparing the predicted result with the ID numbers of pedestrians contained in the video clips in the training samples, performing supervision training by adopting cross entropy, triples and mutual information loss functions, updating parameters, and obtaining the pre-trained pedestrian re-recognition model after reaching a preset convergence condition.
The method is further improved in that in the feature alignment module, a kernel correlation filtering algorithm is adopted to calculate and process the deviation of the input original video frame, and the step of outputting the cross-frame position deviation of the target person comprises the following steps:
calculating the cross-frame position deviation of the target person in each video segment by using a correlation filtering algorithm; wherein, for each frame X in the video clip t Averaging along the channel dimension, converting into
Figure BDA0004153846640000031
Initializing a correlation filter, comprising: using the first frame X' 1 Initializing a correlation filter, expressionIn order to achieve this, the first and second,
Figure BDA0004153846640000032
Figure BDA0004153846640000033
where DFT (·) represents the discrete Fourier transform, IDFT (·) represents the inverse discrete Fourier transform, y is the Gaussian regression target, λ is the regularization coefficient, DFT (X ')' 1 ) * Is DFT (X' 1 ) Is the complex conjugate of, +.; alpha 1 Is a correlation filter calculated over a first frame;
calculating a cross-frame position deviation on the next frame, and then updating filter parameters; wherein the cross-frame position deviation on frame 2 is calculated using a correlation filter, and the filter parameters are updated using an exponential moving average, expressed as,
Figure BDA0004153846640000034
M 2 =IDFT(DFT(k 1,2 )⊙α t );
in the method, in the process of the invention,
Figure BDA0004153846640000035
is the response diagram of the 2 nd frame on the correlation filter by calculating M 2 The distance between the maximum response point of the target person and the center can obtain the cross-frame position deviation of the target person on the 2 nd frame;
the 2 nd frame pixel is aligned in a rolling way according to the cross-frame position deviation of the target person, the filter parameters are updated by using the aligned 2 nd frame, the expression is,
Figure BDA0004153846640000036
Figure BDA0004153846640000037
α 2 =βα 1 +(1-β)α 2
wherein, beta is the index moving average step length, and the filter parameters are updated through the index moving average;
repeating the steps until the cross-frame position deviation of the target person on all frames is calculated; wherein the filter alpha updated on the 2 nd frame is used 2 To calculate the target person cross-frame positional deviation on frame 3 and update filter alpha using frame 3 after pixel roll alignment 2 Parameters of (2); and repeatedly calculating the frame-to-frame position deviation of the target person and updating the filter parameters frame by frame until the frame-to-frame position deviation of the target person on all frames is calculated.
The method is further improved in that the frame level feature map and the cross-frame position deviation of the target person are input into the decoder module, feature alignment processing is carried out, and an aligned frame level feature map is obtained; carrying out local characteristic removal processing on the aligned frame level characteristic images by utilizing different relative states of the target person and the shielding person to obtain a frame level characteristic image with local characteristics of the shielding person removed; based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map with the local features of the shielding person removed, and outputting video-level features, wherein the steps comprise:
based on the obtained cross-frame position deviation of the target person in the adjacent frames, enabling the response graph M t The horizontal and vertical deviations of the maximum response point of (c) from the center are
Figure BDA0004153846640000041
And->
Figure BDA0004153846640000042
From a biased scrolling profile Z t To align the feature images to obtain an aligned feature image Z' t
The aligned characteristic diagram Z' t Averaging along the time dimension concerns sequencesIn a relatively stationary part, to obtain a characteristic map
Figure BDA0004153846640000043
The expression is->
Figure BDA0004153846640000044
Computing frame level feature map Z' t And (3) with
Figure BDA0004153846640000045
The cosine similarity of (c) is calculated by the expression,
Figure BDA0004153846640000046
in the method, in the process of the invention,
Figure BDA0004153846640000047
takes a value between 0 and 1, which represents Z' t The feature map of each pixel block and the averaged pixel block>
Figure BDA0004153846640000048
Cosine similarity of (c); < · > represents a vector inner product operation;
eliminating the characteristics of the shielding person according to the self-adaptive generation threshold value of the cosine similarity by respectively aiming at c t Taking the average value along the time dimension and the time space dimension to generate parameters gamma and delta;
local features with cosine similarity less than an adaptively generated threshold are eliminated, expressed as,
Figure BDA0004153846640000051
by masking m t And feature map Z' t Multiplying the element levels to obtain a feature map for eliminating the local features of the blocked person
Figure BDA0004153846640000052
Successively to the characteristic diagram
Figure BDA0004153846640000053
Performing time interaction and space interaction; wherein local features having the same spatial position along the temporal dimension are time-interacted with each other by,
Figure BDA0004153846640000054
Figure BDA0004153846640000055
in which W is q ,W k And W is v The learnable weights of query, key and value,
Figure BDA0004153846640000056
is a feature map after time interaction;
for characteristic diagram
Figure BDA0004153846640000057
Performing spatial interaction, performing spatial interaction on all local features in the same frame, wherein the expression is,
Figure BDA0004153846640000058
Figure BDA0004153846640000059
In the method, in the process of the invention,
Figure BDA00041538466400000510
is a characteristic diagram after space-time interaction of a decoder;
feature map after time-space interaction using global averaging pooling
Figure BDA00041538466400000511
And fusing to obtain the final video level characteristics.
The method is further improved in that in the training step of the pre-trained pedestrian re-recognition model, the expression of the adopted loss function is,
L=L v +λL f the method comprises the steps of carrying out a first treatment on the surface of the Wherein L is a model overall loss function; λ is a hyper-parameter balancing frame level feature loss function and video level loss function weights;
Figure BDA00041538466400000512
wherein L is x Representing a triplet loss function, L i Representing a cross entropy loss function, v representing a class label in the video level feature;
Figure BDA0004153846640000061
wherein L is m Representing mutual information loss function, Z t[cls] Representing class labels in the frame-level features.
The invention provides a full-supervision video pedestrian re-identification system, which comprises:
the data acquisition module is used for acquiring video clips containing target persons and video clips to be re-identified by the persons;
the pedestrian re-recognition result acquisition module is used for carrying out pedestrian re-recognition processing by utilizing a pre-trained pedestrian re-recognition model based on the acquired video segment containing the target person and the video segment to be subjected to pedestrian re-recognition and outputting a pedestrian re-recognition result; the pedestrian re-recognition result at least comprises whether the video clip to be subjected to pedestrian re-recognition contains a target person or not;
The pedestrian re-recognition model includes:
the encoder module is used for inputting an original video frame to perform feature extraction and outputting a frame level feature map; wherein the encoder module is based on a classical Vision Transformer architecture;
the characteristic alignment module is used for performing deviation calculation processing on an input original video frame by adopting a kernel correlation filtering algorithm and outputting the cross-frame position deviation of a target person;
the decoder module is used for inputting the frame level feature map and the cross-frame position deviation of the target person, carrying out feature alignment processing and obtaining an aligned frame level feature map; carrying out local characteristic removal processing on the aligned frame level characteristic images by utilizing different relative states of the target person and the shielding person to obtain a frame level characteristic image with local characteristics of the shielding person removed; based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map with the local features of the shielding person removed, and outputting video-level features; wherein the decoder module is a multi-headed self-attention mechanism based decoder.
An electronic device provided in a third aspect of the present invention includes:
at least one processor; the method comprises the steps of,
A memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the full surveillance video pedestrian re-recognition method of any one of the first aspect of the invention.
A fourth aspect of the present invention provides a computer readable storage medium storing a computer program which when executed by a processor implements the full surveillance video pedestrian re-recognition method according to any one of the first aspect of the present invention.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a full-supervision video pedestrian re-recognition method, in particular to a novel video pedestrian re-recognition method based on time correlation decomposition, which can effectively distinguish target people and shielding people and accurately conduct pedestrian re-recognition. Further specifically explaining, aiming at the technical problems that the influence of the shielding person on the feature learning, the inconsistent semantics caused by insufficient detection results and the like in the existing pedestrian re-recognition method are difficult to eliminate, the method provided by the invention firstly provides that the shielding person and the target person are distinguished through correlation, the local features of the shielding person are removed before the space-time interaction fusion, and the related filtering algorithm is utilized to realign the video frames to recover the semantic consistency among frames, so that the retrieval accuracy can be improved.
In the model, a classical encoder and decoder structure is adopted, video level characteristics are globally modeled through Vision Transformer, frame level characteristics are extracted through a ViT-Base encoder, and then multi-head self-attention layers are utilized for space-time interaction; the relative states are used for distinguishing the target person from the shielding person, and the related filter is utilized for recovering the semantic consistency, so that additional learnable parameters are not introduced, and the model training burden is not increased; the two modules are complementary, so that the final video level representation is enhanced, and the accuracy of pedestrian re-identification is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the drawings used in the description of the prior art will make a brief description; it will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the invention and that other drawings may be derived from them without undue effort.
Fig. 1 is a schematic flow chart of a full-surveillance video pedestrian re-recognition method provided by an embodiment of the invention;
FIG. 2 is a schematic diagram of a training process of a model in an embodiment of the invention;
FIG. 3 is a schematic block diagram of a model in an embodiment of the invention;
FIG. 4 is a schematic view of a visual result of removing local features of an occluded person by relative status in an embodiment of the present invention;
FIG. 5 is a diagram of a visual result of recovering inter-frame semantic consistency using a correlation filter in an embodiment of the present invention;
FIG. 6 is a graphical illustration of GradCAM thermodynamic diagram visualization in comparison to a baseline model, in an embodiment of the invention;
fig. 7 is a schematic diagram of a full-surveillance video pedestrian re-recognition system provided by an embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The invention is described in further detail below with reference to the attached drawing figures:
referring to fig. 1, the method for re-identifying pedestrians in a full-surveillance video provided by the embodiment of the invention comprises the following steps:
step 1, obtaining a video clip containing a target person and a video clip to be re-identified by the person;
step 2, based on the obtained video clips containing the target person and the video clips to be re-identified by the pedestrian, performing pedestrian re-identification processing by utilizing a pre-trained pedestrian re-identification model, and outputting a pedestrian re-identification result; the pedestrian re-recognition result at least comprises a judgment result of whether the video clip to be subjected to pedestrian re-recognition contains a target person or not;
in the pre-trained pedestrian re-recognition model of the embodiment of the invention, a model architecture comprises:
the encoder module is based on a classical Vision Transformer architecture and is used for inputting an original video frame to perform feature extraction and outputting a frame level feature map;
the characteristic alignment module adopts a kernel correlation filtering algorithm and is used for inputting an original video frame to perform deviation calculation processing and outputting the cross-frame position deviation of a target person;
the decoder module is used for firstly inputting the frame level characteristic image output by the encoder module and performing characteristic alignment processing on the target person cross-frame position deviation output by the characteristic alignment module; then, the local characteristics of the shielding person are removed by utilizing different relative states of the target person and the shielding person and the aligned frame level characteristic images; and finally, based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map from which the local features of the shielding person are removed, and finally outputting video-level features.
In the embodiment of the invention, the training steps specifically comprise:
acquiring a training sample set; each training sample in the training sample set comprises a sampled video clip containing pedestrians, and the video clip contains ID numbers of the pedestrians;
during training, for a selected training sample, inputting a video clip in the sample into a model, and predicting to obtain the ID number of a pedestrian in the video clip; and comparing the predicted result with the pedestrian ID numbers in the sample set, performing supervision training by adopting cross entropy, triples and mutual information loss functions, updating by adopting an AdamW optimizer, and obtaining a pre-trained pedestrian re-recognition model when the model training iterates for 90 generations.
In summary, the invention discloses a full-surveillance video pedestrian re-recognition method, which comprises an encoder based on Vision Transformer, a feature alignment module based on a related filtering algorithm and a decoder based on a multi-head self-attention mechanism; wherein the encoder models frame level features; before space-time interaction fusion is carried out, the feature alignment module can effectively restore semantic consistency among video frames under the condition that additional learnable parameters are not introduced, and provides aligned frame-level feature images for subsequent decoder modules; the decoder effectively utilizes the fine granularity characteristics of the target person in the characteristic diagram, and erases the local characteristics of the blocking person in the aligned frame level characteristic diagram by utilizing the relative state between the target person and the blocking person, so that the space-time interaction fusion is carried out under the influence of shielding and dislocation, and the final video level characterization is obtained.
Referring to fig. 2 and fig. 3, in a specific exemplary embodiment of the present invention, in a full-surveillance video pedestrian re-recognition method, a training model includes the following steps:
step 1, collecting pedestrian video clips and pedestrian video clip ID numbers, including:
1.1 Uniformly sampling video clips with different lengths to obtain video clips with the same length and different ID numbers;
1.2 For the video segment sampled in step 1.1), save its ID number as its tag.
Step 2, calculating the cross-frame position deviation of the target person in each video segment by using a correlation filtering algorithm, wherein the step comprises the following steps:
2.1 Initializing a correlation filter on frame 1;
2.1 Calculating a cross-frame position deviation of the target person on frame 2 using the correlation filter, and updating filter parameters using an exponential moving average;
2.2 Calculating the cross-frame position deviation of the target person on 3 frames by using the updated filter in such a way, and updating the filter parameters again;
2.3 Repeating the above steps until the cross-frame position deviation of the target person in all frames is calculated.
Step 3, learning frame level features with a self-attention encoder, comprising:
3.1 Constructing a self-attention encoder based on the Vision Transformer structure;
3.2 A sampled video segment is input and frame level features are learned with an encoder.
And 4, performing space-time interaction and feature fusion by using a self-attention decoder:
4.1 Constructing a decoder based on a multi-head self-attention mechanism;
4.2 Aligning the frame level feature map based on the calculated target person cross-frame position deviation in the step 2.3), and recovering semantic consistency;
4.3 Removing the characteristics of the shielding person in the frame level characteristic map through different relative states of the target person and the shielding person, and eliminating the influence of the shielding person on the characteristic map;
4.4 Performing interaction of time dimension on the frame level features, and then performing interaction of space dimension;
4.5 Using global average pooling to fuse the frame level features to obtain final video level features.
Step 5, calculating a neural network loss function:
5.1 Inputting ID numbers of the pedestrian video clips, and calculating a cross entropy loss function according to the video level characteristics obtained in the step 4;
5.2 A triplet loss function and other loss functions are calculated.
Step 6, optimizing network parameters, and improving the accuracy of video pedestrian re-identification:
6.1 Iteratively optimizing the neural network parameters according to the loss function obtained in the step 5;
6.2 And (3) after the preset iteration times are reached, the encoder and decoder obtained in the step (3) and the step (4) realize the video pedestrian re-identification.
In summary, aiming at the problems of inconsistent semantics caused by insufficient detection results and personnel shielding caused by intensive personnel scenes, a related filtering algorithm is introduced to restore semantic consistency between video frames, the characteristics of shielding people are effectively removed through the relative state between a target person and shielding people, and the driving model accurately realizes video pedestrian re-recognition under the conditions of shielding and insufficient detection results.
Further specifically, the method for re-identifying the video pedestrian based on time correlation decomposition in the embodiment of the invention comprises the following steps:
step 1, collecting pedestrian video clips and pedestrian video clip ID numbers, including:
1.1 Uniformly sampling video clips with different lengths into video clips with the length of 8 frames; wherein each video segment is uniformly divided into 8 sub-segments with the same length, then each sub-segment is randomly sampled for one frame, and finally each video segment is sampled into video segments with the length of 8 frames, which is expressed as
Figure BDA0004153846640000111
Wherein->
Figure BDA0004153846640000112
1.2 A video clip ID number is collected as a tag.
Step 2, calculating the cross-frame position deviation of the target person in each video segment by using a correlation filtering algorithm; wherein, for each frame X in the video clip t Averaging along the channel dimension, converting it into
Figure BDA0004153846640000113
2.1 Initializing a correlation filter, comprising: using the first frame X' 1 Initializing a correlation filter:
Figure BDA0004153846640000121
Figure BDA0004153846640000122
wherein DFT (·) represents the discrete Fourier transform, IDFT (·) represents the inverse discrete Fourier transform, y is the Gaussian regression target, λ is the regularization coefficient, DFT (X ')' 1 ) * Is DFT (X' 1 ) Is the complex conjugate of, ++represents the matrix elementA multiplication operation; alpha 1 Is a correlation filter calculated over the first frame.
2.2 Calculating a cross-frame position deviation on the next frame, and then updating the filter parameters; wherein the cross-frame position deviation on frame 2 is calculated using a correlation filter and the filter parameters are updated using an exponential moving average:
Figure BDA0004153846640000123
M 2 =IDFT(DFT(k 1,2 )⊙α t );
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004153846640000124
is the response diagram of the 2 nd frame on the correlation filter by calculating M 2 The distance from the maximum response point of the target person to the center can obtain the cross-frame position deviation of the target person on the 2 nd frame. According to the cross-frame position deviation of the target person, rolling and aligning the 2 nd frame pixels, and updating filter parameters by using the aligned 2 nd frame:
Figure BDA0004153846640000125
Figure BDA0004153846640000126
α 2 =βα 1 +(l-β)α 2
where β is the exponential moving average step size by which the filter parameters are updated.
2.3 Repeating the steps until the cross-frame position deviation of the target person on all frames is calculated; wherein the filter alpha updated on the 2 nd frame is used 2 To calculate the target person cross-frame positional deviation on frame 3 and update filter alpha using frame 3 after pixel roll alignment 2 Is a parameter of (a). And the like, until the cross-frame position bias of the target person on all frames is calculatedAnd (3) difference.
Step 3, learning frame level features with a self-attention encoder, comprising:
3.1 Constructing a self-attention encoder based on the Vision Transformer structure; wherein, the liquid crystal display device comprises a liquid crystal display device,
"An Image is Worth 16x16 Words:Transformers for Image Recognition at Scale" achieves better results on each task of computer vision, using Vision Transformer (ViT) proposed in this paper as the backbone network for video pedestrian re-recognition tasks, it employs a multi-headed self-attention mechanism that enables fine-grained characterization to be learned in the global receptive field;
3.2 Inputting the sampled video clip, and learning frame level features by an encoder; wherein, the liquid crystal display device comprises a liquid crystal display device,
feature extraction of frames in video using ViT encoder, video frame X t Divided into 16×16 blocks, feature extraction is performed on a block-by-block basis, and a learnable vector X is generated t[cls] As class labels for the t-th frame. Frame X using 12-layer 12-head self-attention layer t Mapping into feature map
Figure BDA0004153846640000131
Marking class X t[cls] Mapping to global features Z t[cls] . Where N represents the number of pixel blocks in a frame: />
Figure BDA0004153846640000132
D represents the channel dimension of each pixel block after feature extraction: d=16×16×c.
Step 4, performing space-time interaction and feature fusion by using a self-attention decoder, including:
4.1 Constructing a decoder based on a multi-head self-attention mechanism; wherein, referring to the structure of the ViT encoder in step 3.1), a 2-layer 12-head self-attention layer is used as a decoder for the feature map Z t Performing time-space interaction and fusion;
4.2 Aligning the frame level feature map based on the calculated target person cross-frame position deviation in the step 2.3), and recovering semantic consistency;
due to insufficient detection result, adjacent frame X t And X is t+1 Semantic information may not be consistent at the same spatial location. Calculating the cross-frame position deviation of the target person in the adjacent frames in the step 2.3), and enabling the response diagram M to be t The horizontal and vertical deviations of the maximum response point of (c) from the center are
Figure BDA0004153846640000133
And->
Figure BDA0004153846640000134
From a biased scrolling profile Z t Aligning the feature map, and enabling the aligned feature map to be Z' t
4.3 Removing the characteristics of the shielding person in the frame level characteristic map through different relative states of the target person and the shielding person, and eliminating the influence of the shielding person on the characteristic map; wherein, the liquid crystal display device comprises a liquid crystal display device,
The aligned feature images
Figure BDA0004153846640000141
Averaging along the time dimension to focus on relatively stationary parts of the sequence, resulting in a feature map +.>
Figure BDA0004153846640000142
Figure BDA0004153846640000143
Computing frame level feature map Z' t And (3) with
Figure BDA0004153846640000144
Cosine similarity of (c):
Figure BDA0004153846640000145
wherein the method comprises the steps of
Figure BDA0004153846640000146
Takes a value between 0 and 1, which represents Z' t The feature map of each pixel block and the averaged pixel block>
Figure BDA0004153846640000147
Cosine similarity of (c).
The target person is always positioned in the center of the video frame in the video pedestrian re-identification data set, and even if the target person moves in the video, the target person is relatively static; in contrast, when occlusion occurs, the occluding person passes the target person, and disappears from one side of the target person to the other side, and is relatively moving in the video. Feature map
Figure BDA0004153846640000148
Focusing on a relatively stationary part of a video clip, then the frame-level feature map is associated with +.>
Figure BDA0004153846640000149
The cosine similarity of the relative motion in the frame-level feature map will be high and the partial cosine similarity of the relative motion in the frame-level feature map will be low. Thereby distinguishing the target person from the blocking person and eliminating the influence of the blocking person on the subsequent space-time fusion.
Eliminating the characteristics of the shielding person according to the self-adaptive generation threshold value of the cosine similarity by respectively carrying out the following steps of
Figure BDA00041538466400001410
Taking the mean value along the time dimension and the time space dimension to generate parameters gamma and delta:
Figure BDA00041538466400001411
and then eliminating local features with cosine similarity smaller than the adaptively generated threshold value:
Figure BDA00041538466400001412
Generating a mask m from the above t By maskingCode and feature map Z' t Multiplying the element levels to obtain a feature map for eliminating the local features of the blocked person
Figure BDA00041538466400001413
4.4 Performing interaction of time dimension on the frame level features, and then performing interaction of space dimension;
successive pairs of feature maps using a decoder
Figure BDA0004153846640000151
The decoder consists of a 2-layer 12-head self-attention layer, which performs temporal and spatial interactions. The invention adopts a standard self-attention mechanism to perform characteristic interaction, and the first layer pair is +.>
Figure BDA0004153846640000152
Performing a temporal interaction, the local features having the same spatial position along the temporal dimension performing a temporal interaction with each other:
Figure BDA0004153846640000153
Figure BDA0004153846640000154
wherein W is q ,W k And W is v The learnable weights of query, key and value,
Figure BDA0004153846640000155
is a feature map after time interaction. Then the second multi-head self-attention layer pair feature map +.>
Figure BDA0004153846640000156
Performing spatial interaction, wherein all local features in the same frame perform spatial interaction:
Figure BDA0004153846640000157
Figure BDA0004153846640000158
obtaining a characteristic diagram after decoder space-time interaction according to the above
Figure BDA0004153846640000159
4.5 Using global averaging pooling to time-space interacted feature map
Figure BDA00041538466400001510
And fusing to obtain the final video level characteristics.
Fetching feature graphs
Figure BDA00041538466400001511
Global features->
Figure BDA00041538466400001512
And the intermediate layer global features Z obtained in step 3.2) [cls] Respectively carrying out time pooling and splicing on the video to obtain a final video representation v:
v=Concat(Pool(S [cls] ),Pool(Z [cls] ))。
Step 5, calculating a neural network loss function:
5.1 Calculating a cross entropy loss function and a triplet loss function from the video level features obtained in step 4.5), using L x Representing a triplet loss function, L i Representing a cross entropy loss function;
Figure BDA00041538466400001513
/>
5.2 Calculating a triplet loss function and a cross entropy loss function according to the frame level global features obtained in the step 3.2), and using L m Representing mutual information loss function, Z t[cls] A category label representing the t-th frame.
Figure BDA0004153846640000161
The model overall loss function is represented by L v And L f The composition is as follows: l=l v +λL f
Where λ is a hyper-parameter that balances frame level feature loss function and video level loss function weights.
Step 6, optimizing network parameters, improving accuracy of video pedestrian re-identification, comprising the following steps:
6.1 Iteratively optimizing the neural network parameters according to the loss function obtained in the step 5; the method comprises the steps of performing iteration for 90 times by using an AdamW optimizer, and performing weight attenuation to be 0.0005 by using a cosine learning strategy;
6.2 And (3) after the preset iteration times are reached, the encoder and decoder obtained in the step (3) and the step (4) realize the video pedestrian re-identification.
In summary, the invention provides an encoder-decoder network based on time correlation decomposition through the task of video target reconstruction, local features of an obscurant in a frame-level feature map are removed before entering a decoder through different relative states of a target person and the obscurant, so that video-level features which are not obstructed by non-target pedestrians are generated, and besides, a correlation filter is utilized to recover the semantic consistency between frames. For comparison fairness, the invention only uses cross entropy and triplet loss function to guide model training, and the method performs qualitative and quantitative comparison on two public data sets of Mars and LS-VID with the existing method to verify the effectiveness of the method.
TABLE 1 comparison of experimental results (%) under Mars and LS-VID data sets
Figure BDA0004153846640000162
Figure BDA0004153846640000171
mAP, rank-1 and Rank-5 are commonly used indexes for measuring the accuracy of pedestrian re-identification, the larger the numerical value is, the higher the re-identification accuracy is represented, and as can be seen from Table 1, the highest re-identification performance is achieved in the Mars data set, and the method is improved by 0.4% in Rank-1 index compared with the best method. On the LS-VID data set, mAP was improved by 1.7% over the best method. The Rank-1 index is 0.4% lower than CAViT, because CAViT integrates three-scale information, the video level characteristic characterization capability is greatly enhanced, but mAP indexes have a larger gap from the method, so that the method provided by the invention has the best effect on the existing data set, and the effectiveness and superiority of the method provided by the invention are fully demonstrated.
Fig. 4, 5 and 6 are visual result analyses of the present invention, and fig. 4 is a result of removing local features of an occluded person when non-target person occlusion occurs on Mars and iLiDS-VID data sets. The invention can effectively remove the characteristics of shielding people, and effectively retain the fine granularity characteristics of the target people even under the condition of serious shielding. Fig. 5 is a visual result of restoring inter-frame semantic consistency using a correlation filtering algorithm, where the first behavior is an original video sequence, and several frames have a dislocation phenomenon due to the deficiency of the detection result, and the same region semantic inconsistency between different frames. And the second behavior restores the video sequence with the consistent semantics, so that the misplaced video frames can be seen to be subjected to the second behavior, and the consistent semantics of the video sequence are restored. FIG. 6 is a graph of GradCAM thermodynamic diagram visualization of the final video level features of the method, a first line of the original input video sequence, a second behavior baseline model, and it can be seen that the model learns the features of the occluding person rather than the features of the target person when occlusion occurs. As shown in the third line, the model is able to focus on the fine-grained characteristics of the target person even in severe occlusion situations when using the proposed method. In summary, the novel time correlation decomposition network provided by the invention successfully removes the characteristics of the shielding person and retains the fine granularity characteristics of the target person through the relative state, solves the problem of inconsistent semantics caused by insufficient detection results by utilizing the correlation filter, and successfully recovers the semantic consistency among frames. Qualitative analysis and visual result analysis fully show the effectiveness and superiority of the invention, and accurate video pedestrian re-identification is realized.
The following are device embodiments of the present invention that may be used to perform method embodiments of the present invention. For details not disclosed in the apparatus embodiments, please refer to the method embodiments of the present invention.
Referring to fig. 7, a full-surveillance video pedestrian re-recognition system provided by an embodiment of the present invention includes:
the data acquisition module is used for acquiring video clips containing target persons and video clips to be re-identified by the persons;
the pedestrian re-recognition result acquisition module is used for carrying out pedestrian re-recognition processing by utilizing a pre-trained pedestrian re-recognition model based on the acquired video segment containing the target person and the video segment to be subjected to pedestrian re-recognition and outputting a pedestrian re-recognition result; the pedestrian re-recognition result at least comprises whether the video clip to be subjected to pedestrian re-recognition contains a target person or not;
the pedestrian re-recognition model includes:
the encoder module is used for inputting an original video frame to perform feature extraction and outputting a frame level feature map; wherein the encoder module is based on a classical Vision Transformer architecture;
the characteristic alignment module is used for performing deviation calculation processing on an input original video frame by adopting a kernel correlation filtering algorithm and outputting the cross-frame position deviation of a target person;
The decoder module is used for inputting the frame level feature map and the cross-frame position deviation of the target person, carrying out feature alignment processing and obtaining an aligned frame level feature map; carrying out local characteristic removal processing on the aligned frame level characteristic images by utilizing different relative states of the target person and the shielding person to obtain a frame level characteristic image with local characteristics of the shielding person removed; based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map with the local features of the shielding person removed, and outputting video-level features; wherein the decoder module is a multi-headed self-attention mechanism based decoder.
In yet another embodiment of the present invention, a computer device is provided that includes a processor and a memory for storing a computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions within a computer storage medium to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for the operation of the full-surveillance video pedestrian re-identification method.
In yet another embodiment of the present invention, a storage medium, specifically a computer readable storage medium (Memory), is a Memory device in a computer device, for storing a program and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the corresponding steps of the method for full surveillance video pedestrian re-recognition in the above embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (8)

1. The full-supervision video pedestrian re-identification method is characterized by comprising the following steps of:
acquiring a video clip containing a target person and a video clip to be re-identified by the person;
based on the obtained video clips containing the target person and the video clips to be re-identified by the pedestrian, performing pedestrian re-identification processing by utilizing a pre-trained pedestrian re-identification model, and outputting a pedestrian re-identification result; wherein, the liquid crystal display device comprises a liquid crystal display device,
the pedestrian re-recognition result at least comprises whether the video clip to be subjected to pedestrian re-recognition contains a target person or not;
the pedestrian re-recognition model includes:
the encoder module is used for inputting an original video frame to perform feature extraction and outputting a frame level feature map; wherein the encoder module is based on a classical Vision Transformer architecture;
the characteristic alignment module is used for performing deviation calculation processing on an input original video frame by adopting a kernel correlation filtering algorithm and outputting the cross-frame position deviation of a target person;
the decoder module is used for inputting the frame level feature map and the cross-frame position deviation of the target person, carrying out feature alignment processing and obtaining an aligned frame level feature map; carrying out local characteristic removal processing on the aligned frame level characteristic images by utilizing different relative states of the target person and the shielding person to obtain a frame level characteristic image with local characteristics of the shielding person removed; based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map with the local features of the shielding person removed, and outputting video-level features; wherein the decoder module is a multi-headed self-attention mechanism based decoder.
2. The full-surveillance video pedestrian re-recognition method of claim 1, wherein the training step of the pre-trained pedestrian re-recognition model comprises:
acquiring a training sample set; each training sample in the training sample set comprises a sampled video clip containing pedestrians, and the video clip contains ID numbers of the pedestrians;
during training, for a selected training sample, inputting a sampled video segment containing pedestrians in the training sample into a pedestrian re-recognition model, predicting to obtain the ID number of the pedestrian in the video segment, and taking the ID number as a prediction result; and comparing the predicted result with the ID numbers of pedestrians contained in the video clips in the training samples, performing supervision training by adopting cross entropy, triples and mutual information loss functions, updating parameters, and obtaining the pre-trained pedestrian re-recognition model after reaching a preset convergence condition.
3. The method for re-identifying pedestrian in full-surveillance video according to claim 1, wherein the step of performing deviation calculation processing on an input original video frame by using a kernel correlation filtering algorithm in the feature alignment module and outputting a frame-crossing position deviation of a target person comprises the steps of:
Calculating the cross-frame position deviation of the target person in each video segment by using a correlation filtering algorithm; wherein, for each frame X in the video clip t Averaging along the channel dimension, converting into
Figure FDA0004153846630000021
Initializing a correlation filter, comprising: using the first frame X' 1 The correlation filter is initialized, expressed as,
Figure FDA0004153846630000022
Figure FDA0004153846630000023
where DFT (·) represents the discrete Fourier transform, IDFT (·) represents the inverse discrete Fourier transform, y is the Gaussian regression target, λ is the regularization coefficient, DFT (X ')' 1 ) * Is DFT (X' 1 ) Is the complex conjugate of, +.; alpha 1 Is a correlation filter calculated over a first frame;
calculating a cross-frame position deviation on the next frame, and then updating filter parameters; wherein the cross-frame position deviation on frame 2 is calculated using a correlation filter, and the filter parameters are updated using an exponential moving average, expressed as,
Figure FDA0004153846630000024
M 2 =IDFT(DFT(k 1,2 )⊙α t );
in the method, in the process of the invention,
Figure FDA0004153846630000025
is the response diagram of the 2 nd frame on the correlation filter by calculating M 2 The distance between the maximum response point of the target person and the center can obtain the cross-frame position deviation of the target person on the 2 nd frame;
the 2 nd frame pixel is aligned in a rolling way according to the cross-frame position deviation of the target person, the filter parameters are updated by using the aligned 2 nd frame, the expression is,
Figure FDA0004153846630000031
Figure FDA0004153846630000032
α 2 =βα 1 +(1-β)α 2
Wherein, beta is the index moving average step length, and the filter parameters are updated through the index moving average;
repeating the steps until the cross-frame position deviation of the target person on all frames is calculated; wherein the filter alpha updated on the 2 nd frame is used 2 To calculate the target person cross-frame positional deviation on frame 3 and update filter alpha using frame 3 after pixel roll alignment 2 Parameters of (2); and repeatedly calculating the frame-to-frame position deviation of the target person and updating the filter parameters frame by frame until the frame-to-frame position deviation of the target person on all frames is calculated.
4. The full-surveillance video pedestrian re-recognition method according to claim 1, wherein the decoder module inputs the frame-level feature map and the cross-frame position deviation of the target person, performs feature alignment processing, and obtains an aligned frame-level feature map; carrying out local characteristic removal processing on the aligned frame level characteristic images by utilizing different relative states of the target person and the shielding person to obtain a frame level characteristic image with local characteristics of the shielding person removed; based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map with the local features of the shielding person removed, and outputting video-level features, wherein the steps comprise:
Based on the obtained cross-frame position deviation of the target person in the adjacent frames, enabling the response graph M t The horizontal and vertical deviations of the maximum response point of (c) from the center are
Figure FDA0004153846630000033
And->
Figure FDA0004153846630000034
From a biased scrolling profile Z t To align the feature images to obtain an aligned feature image Z' t
The aligned characteristic diagram Z' t Averaging along the time dimension to focus on relatively stationary portions of the sequence, resulting in a feature map
Figure FDA0004153846630000035
The expression is->
Figure FDA0004153846630000036
Computing frame level feature map Z' t And (3) with
Figure FDA0004153846630000037
The cosine similarity of (c) is calculated by the expression,
Figure FDA0004153846630000038
in the method, in the process of the invention,
Figure FDA0004153846630000041
takes a value between 0 and 1, which represents Z' t The feature map of each pixel block and the averaged pixel block>
Figure FDA0004153846630000042
Cosine similarity of (c);<·,·>representing a vector inner product operation;
eliminating the characteristics of the shielding person according to the self-adaptive generation threshold value of the cosine similarity by respectively aiming at c t Taking the average value along the time dimension and the time space dimension to generate parameters gamma and delta;
local features with cosine similarity less than an adaptively generated threshold are eliminated, expressed as,
Figure FDA0004153846630000043
by masking m t And feature map Z' t Multiplying the element levels to obtain a feature map for eliminating the local features of the blocked person
Figure FDA0004153846630000044
Successively to the characteristic diagram
Figure FDA0004153846630000045
Performing time interaction and space interaction; wherein local features having the same spatial position along the temporal dimension are time-interacted with each other by,
Figure FDA0004153846630000046
Figure FDA0004153846630000047
In which W is q ,W k And W is v The learnable weights of query, key and value,
Figure FDA0004153846630000048
is a feature map after time interaction;
for characteristic diagram
Figure FDA0004153846630000049
Performing spatial interaction, performing spatial interaction on all local features in the same frame, wherein the expression is,
Figure FDA00041538466300000410
Figure FDA00041538466300000411
in the method, in the process of the invention,
Figure FDA00041538466300000412
is a characteristic diagram after space-time interaction of a decoder;
feature map after time-space interaction using global averaging pooling
Figure FDA00041538466300000413
And fusing to obtain the final video level characteristics.
5. The method for fully supervised video pedestrian re-recognition as set forth in claim 2, wherein the training step of the pre-trained pedestrian re-recognition model uses a loss function with the expression of,
L=L v +λL f the method comprises the steps of carrying out a first treatment on the surface of the Wherein L is a model overall loss function; λ is a hyper-parameter balancing frame level feature loss function and video level loss function weights;
Figure FDA0004153846630000051
wherein L is x Representing a triplet loss function, L i Representing a cross entropy loss function, v representing a class label in the video level feature;
Figure FDA0004153846630000052
wherein L is m Representing mutual information loss function, Z t[cls] Representing class labels in the frame-level features.
6. A full surveillance video pedestrian re-recognition system, comprising:
the data acquisition module is used for acquiring video clips containing target persons and video clips to be re-identified by the persons;
The pedestrian re-recognition result acquisition module is used for carrying out pedestrian re-recognition processing by utilizing a pre-trained pedestrian re-recognition model based on the acquired video segment containing the target person and the video segment to be subjected to pedestrian re-recognition and outputting a pedestrian re-recognition result; the pedestrian re-recognition result at least comprises whether the video clip to be subjected to pedestrian re-recognition contains a target person or not;
the pedestrian re-recognition model includes:
the encoder module is used for inputting an original video frame to perform feature extraction and outputting a frame level feature map; wherein the encoder module is based on a classical Vision Transformer architecture;
the characteristic alignment module is used for performing deviation calculation processing on an input original video frame by adopting a kernel correlation filtering algorithm and outputting the cross-frame position deviation of a target person;
the decoder module is used for inputting the frame level feature map and the cross-frame position deviation of the target person, carrying out feature alignment processing and obtaining an aligned frame level feature map; carrying out local characteristic removal processing on the aligned frame level characteristic images by utilizing different relative states of the target person and the shielding person to obtain a frame level characteristic image with local characteristics of the shielding person removed; based on a multi-head self-attention mechanism, carrying out feature interaction and fusion processing on the frame-level feature map with the local features of the shielding person removed, and outputting video-level features; wherein the decoder module is a multi-headed self-attention mechanism based decoder.
7. An electronic device, comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the full surveillance video pedestrian re-recognition method of any one of claims 1 to 5.
8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the full surveillance video pedestrian re-recognition method of any one of claims 1 to 5.
CN202310327791.8A 2023-03-29 2023-03-29 Full-supervision video pedestrian re-identification method, system, equipment and medium Pending CN116343265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310327791.8A CN116343265A (en) 2023-03-29 2023-03-29 Full-supervision video pedestrian re-identification method, system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310327791.8A CN116343265A (en) 2023-03-29 2023-03-29 Full-supervision video pedestrian re-identification method, system, equipment and medium

Publications (1)

Publication Number Publication Date
CN116343265A true CN116343265A (en) 2023-06-27

Family

ID=86880339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310327791.8A Pending CN116343265A (en) 2023-03-29 2023-03-29 Full-supervision video pedestrian re-identification method, system, equipment and medium

Country Status (1)

Country Link
CN (1) CN116343265A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912633A (en) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 Training method and device for target tracking model
CN117173221A (en) * 2023-09-19 2023-12-05 浙江大学 Multi-target tracking method based on authenticity grading and occlusion recovery
CN117612112A (en) * 2024-01-24 2024-02-27 山东科技大学 Method for re-identifying reloading pedestrians based on semantic consistency

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912633A (en) * 2023-09-12 2023-10-20 深圳须弥云图空间科技有限公司 Training method and device for target tracking model
CN116912633B (en) * 2023-09-12 2024-01-05 深圳须弥云图空间科技有限公司 Training method and device for target tracking model
CN117173221A (en) * 2023-09-19 2023-12-05 浙江大学 Multi-target tracking method based on authenticity grading and occlusion recovery
CN117173221B (en) * 2023-09-19 2024-04-19 浙江大学 Multi-target tracking method based on authenticity grading and occlusion recovery
CN117612112A (en) * 2024-01-24 2024-02-27 山东科技大学 Method for re-identifying reloading pedestrians based on semantic consistency
CN117612112B (en) * 2024-01-24 2024-04-30 山东科技大学 Method for re-identifying reloading pedestrians based on semantic consistency

Similar Documents

Publication Publication Date Title
CN116343265A (en) Full-supervision video pedestrian re-identification method, system, equipment and medium
Medel et al. Anomaly detection in video using predictive convolutional long short-term memory networks
Porikli et al. Traffic congestion estimation using HMM models without vehicle tracking
Cong et al. Sparse reconstruction cost for abnormal event detection
KR102288645B1 (en) Machine learning method and system for restoring contaminated regions of image through unsupervised learning based on generative adversarial network
CN109460787B (en) Intrusion detection model establishing method and device and data processing equipment
Qin et al. Etdnet: An efficient transformer deraining model
CN114724060A (en) Method and device for unsupervised video anomaly detection based on mask self-encoder
CN112036381B (en) Visual tracking method, video monitoring method and terminal equipment
CN111259919B (en) Video classification method, device and equipment and storage medium
Ermis et al. Motion segmentation and abnormal behavior detection via behavior clustering
CN114898416A (en) Face recognition method and device, electronic equipment and readable storage medium
Vijayan et al. A fully residual convolutional neural network for background subtraction
Nayak et al. Video anomaly detection using convolutional spatiotemporal autoencoder
Varma et al. Object detection and classification in surveillance system
Bober et al. A hough transform based hierarchical algorithm for motion segmentation
CN112487961A (en) Traffic accident detection method, storage medium and equipment
CN113936175A (en) Method and system for identifying events in video
CN112149596A (en) Abnormal behavior detection method, terminal device and storage medium
Kumar et al. Efficient Video Anomaly Detection using Residual Variational Autoencoder
Pun et al. A real-time detector for parked vehicles based on hybrid background modeling
CN115147457A (en) Memory enhanced self-supervision tracking method and device based on space-time perception
Chae et al. Siamevent: Event-based object tracking via edge-aware similarity learning with siamese networks
Fu et al. Foreground gated network for surveillance object detection
Wei et al. Pedestrian anomaly detection method using autoencoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination