CN113420697B

CN113420697B - Reloading video pedestrian re-identification method and system based on appearance and shape characteristics

Info

Publication number: CN113420697B
Application number: CN202110748180.1A
Authority: CN
Inventors: 王亮; 黄岩; 单彩峰; 韩苛; 王海滨
Original assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Current assignee: Cas Artificial Intelligence Research Qingdao Co ltd
Priority date: 2021-07-01
Filing date: 2021-07-01
Publication date: 2022-12-09
Anticipated expiration: 2041-07-01
Also published as: CN113420697A

Abstract

The method comprises the steps that an obtained inquiry video to be identified and a search library video are input into a trained deep neural network model together, and pedestrian features are extracted; calculating Euclidean distances between the characteristics of the query video and the characteristics of the search library video, and sequencing identity matching according to the Euclidean distances; the feature fusion module of the deep neural network model can adaptively fuse the apparent features and the shape features into pedestrian features with stronger discrimination power so as to generalize the pedestrian features to different reloading degrees and achieve a better recognition effect.

Description

Reloading video pedestrian re-identification method and system based on appearance and shape characteristics

Technical Field

The disclosure belongs to the technical field of machine learning, and particularly relates to a reloading video pedestrian re-identification method and system based on appearance and shape features.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

The video pedestrian re-identification aims at providing a pedestrian inquiry video (query) under a certain camera, and matching the video of the same pedestrian in a video library (galery) of another camera; with the increasingly wide distribution of the cameras of the streets, the re-identification of the pedestrians shows a wide application prospect in the aspects of security and protection and the like; for example, scenes such as finding lost children and tracking suspects can be assisted by a pedestrian re-identification technology.

The reloading video pedestrian re-identification refers to the situation that the same pedestrian wears different clothes in videos of a query video (query) and a video library (challenge); many previous methods discard the apparent information of pedestrians (colors, styles and the like of clothes and shoes), and only use shape information (human body contours and the like) to identify pedestrians; these methods are only applicable to severe changes, such as one wearing a red dress to change it to a white shirt and black pants; however, in practical application scenarios, it is very likely that a person may only slightly change clothes, for example, only changing gray short sleeves into black short sleeves, and in this case, the apparent information of pedestrians may also provide some useful identity information.

Disclosure of Invention

The invention can adaptively learn the appearance of the coarse granularity and the shape information of the fine granularity of the pedestrian, perform adaptive feature fusion and construct richer pedestrian features so as to achieve better pedestrian recognition effect.

In order to achieve the above purpose, the present disclosure is achieved by the following technical solutions:

in a first aspect, the present disclosure provides a method for re-identifying a pedestrian in a change-over video based on appearance and shape features, including:

establishing a deep neural network model;

based on the deep neural network model, extracting pedestrian apparent characteristics and pedestrian shape characteristics of the pedestrian video in the training set;

fusing the pedestrian appearance characteristic and the pedestrian shape characteristic to obtain a final pedestrian characteristic;

carrying out deep neural network model training by adopting the pedestrian appearance characteristic, the pedestrian shape characteristic and the fused final pedestrian characteristic to obtain a trained model;

inputting the inquiry video and the search library video into the trained model to extract the pedestrian characteristics;

and calculating Euclidean distances between the pedestrian features of the inquiry video and the pedestrian features of the search library video, and sequencing identity matching according to the Euclidean distances.

Further, the deep neural network model comprises an apparent encoder for extracting the apparent features of the pedestrian video coarse granularity, a shape encoder for extracting the shape features of the pedestrian video fine granularity and a feature fusion module for adaptively fusing the apparent features and the shape features into final pedestrian features.

Further, the extracting the pedestrian appearance feature and the pedestrian shape feature of the pedestrian video in the training set comprises:

inputting the pedestrian video into an apparent encoder, and extracting the pedestrian apparent features frame by frame to obtain an apparent feature map; carrying out average pooling on the apparent feature maps corresponding to each frame, and aggregating the apparent feature maps into an apparent feature map of the video; then carrying out global average pooling on the apparent feature map of the video to obtain an apparent feature vector of the video; defining an apparent loss function by using the identity tag data in the data set;

simultaneously carrying out human body segmentation on the pedestrian video to obtain a binary pedestrian segmentation video; inputting the human segmentation video into a shape encoder, and extracting the shape features of the pedestrians frame by frame to obtain a shape feature map; performing maximum pooling on the shape characteristic graph corresponding to each frame, aggregating the shape characteristic graphs into a shape characteristic graph of the pedestrian, and uniformly and horizontally dividing the shape characteristic graph into a plurality of sub-characteristic graphs; performing maximum pooling in each sub-feature map, wherein the pooled feature vectors pass through a plurality of full-connection layers respectively to obtain a plurality of feature vectors which respectively represent a plurality of horizontal areas of the pedestrian; connecting a plurality of characteristic vectors to form a final characteristic vector representing the shape of the pedestrian; a shape loss function is defined using the identity tag data in the dataset.

Further, fusing the pedestrian appearance feature and the pedestrian shape feature includes:

respectively normalizing the feature vectors of the pedestrian apparent feature and the pedestrian shape feature;

performing weight prediction, connecting the normalized feature vectors, and generating two weight vectors through two convolution layers with convolution kernel size of 1 × 2 respectively; obtaining an apparent weight vector and a shape weight vector after passing through a Softmax function;

performing characteristic conversion, respectively inputting the normalized characteristic vectors into two convolution layers with convolution kernel size of 1 multiplied by 1, and obtaining converted apparent characteristic vectors and shape characteristic vectors through a Sigmoid function;

the feature vector that ultimately represents the pedestrian is a weighted sum of the apparent and shape feature vectors.

Further, the deep neural network model training process is two stages, including:

in the stage I, an appearance encoder and a shape encoder are optimized only through an appearance loss function and a shape loss function, and a coarse-grained appearance feature and a fine-grained shape feature with discriminant power are learned respectively.

In stage II, the entire network is trained jointly by a weighted sum of the apparent loss function, the shape loss function and the fusion loss function.

Further, inputting a tested inquiry video and a search library video into the model together to extract features; in the feature space, the Euclidean distance between the features of the query video and the search library video is calculated to measure the similarity.

Further, the search-pool videos with higher similarity are positioned more forward in the identity matching result.

In a second aspect, the present disclosure further provides a reloading video pedestrian re-identification system based on appearance and shape features, which includes a model establishing module, a feature extracting module, a model optimizing module and a testing module;

the model building module configured to: establishing a deep neural network model;

the feature extraction module configured to: based on a deep neural network model, extracting pedestrian appearance characteristics and pedestrian shape characteristics of pedestrian videos in a training set; fusing the pedestrian appearance characteristic and the pedestrian shape characteristic to obtain a final pedestrian characteristic;

the model optimization module configured to: carrying out deep neural network model training by adopting the pedestrian appearance characteristic, the pedestrian shape characteristic and the fused final pedestrian characteristic to obtain a trained model;

the testing module configured to: inputting the inquiry video and the search library video into the trained model to extract the pedestrian characteristics; calculating Euclidean distance between the pedestrian features of the inquiry video and the pedestrian features of the search library video, and sequencing identity matching according to the Euclidean distance

In a third aspect, the present disclosure also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the above-mentioned reloading video pedestrian re-identification method based on appearance and shape features.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the above-mentioned reloading video pedestrian re-identification method based on appearance and shape features.

Compared with the prior art, the beneficial effect of this disclosure is:

1. the coarse-grained apparent branch can extract the coarse-grained apparent features based on the whole situation to process the condition that the pedestrians slightly change the outfit, and the fine-grained shape branch can extract the fine-grained shape features based on the part to process the condition that the pedestrians seriously change the outfit.

2. The appearance and the shape characteristic of extraction can be fused in a self-adaptive mode, and the fused characteristic can be better generalized to the condition of changing clothes of different degrees, so that the recognition effect of the people who change clothes is improved.

Drawings

The accompanying drawings, which form a part hereof, are included to provide a further understanding of the present embodiments, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the present embodiments and together with the description serve to explain the present embodiments without unduly limiting the present embodiments.

Fig. 1 is a flow chart of embodiment 1 of the present disclosure.

The specific implementation mode is as follows:

the present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

The term "change of clothing" as used in this disclosure means that the same pedestrian is considered to have changed in clothing (such as a change in color, pattern or style of upper garment, lower garment or shoes) which is recognizable to the human eye in two video sequences.

Example 1:

the embodiment provides a reloading video pedestrian re-identification method based on appearance and shape characteristics; the method specifically comprises the following steps:

inputting the obtained pedestrian video to be identified into a trained deep neural network model, and extracting pedestrian features;

and calculating Euclidean distances between the pedestrian features of the query video and the video features of the search library, and sequencing identity matching according to the Euclidean distances.

Specifically, in this embodiment, the establishing of the deep neural network model and the training process of the model specifically include:

establishing a deep neural network model, and setting a corresponding module network structure; as shown in FIG. 1, the model includes three parts, namely an Appearance Encoder (Appearance Encoder), a Shape Encoder (Shape Encoder) and a Feature Integration Module (Feature Integration Module).

Suppose a segment of pedestrian video v in a training set _i Is shown as

The video is composed of T frames, I _t (T =1,2, \8230;, T) denotes the T-th frame; will be provided with

Inputting the image data into an apparent encoder, extracting the pedestrian apparent features frame by frame to obtain an apparent feature map

Apparent feature map corresponding to each frame

Carrying out average pooling and aggregating to form an apparent characteristic diagram of the video; then global average pooling is carried out on the apparent feature map of the video to obtain an apparent feature vector of the video

Defining an apparent loss L using identity tag data in a dataset _a Comprises the following steps:

wherein

Is the cross-entropy loss of the entropy of the sample,

is the ternary loss and alpha is the weighting factor.

Will be provided with

Simultaneously input into a human body segmentation module to obtain a binary pedestrian segmentation video

Will be provided with

Inputting the pedestrian shape feature into a shape encoder, and extracting the pedestrian shape feature frame by frame to obtain a shape feature map; to pairPerforming maximum pooling on the shape feature maps corresponding to each frame, aggregating the shape feature maps into a shape feature map of the pedestrian, and uniformly and horizontally dividing the shape feature map into J sub-feature maps

And performing maximum pooling in each sub-feature map, wherein the pooled feature vectors pass through J full-connected layers respectively, and J feature vectors are obtained and represent J horizontal regions of the pedestrian respectively. J feature vectors are connected to form a final feature vector representing the shape of the pedestrian

Defining a shape loss L using identity tag data in a dataset _s Comprises the following steps:

wherein

Is the cross entropy loss corresponding to the jth shape feature vector,

is the corresponding ternary penalty, and β is the weighting factor.

Apparent character to pedestrian

And shape characteristics

Carrying out feature fusion; the feature fusion comprises two steps of weight prediction and feature conversion; first, L is performed on two feature vectors, respectively ₂ Normalization; when weight prediction is performed, L is calculated ₂ The normalized feature vectors are concatenated (concatenated), and passed through two convolution layers with convolution kernel size of 1 × 2, respectively, to generate two weight vectors

And

after passing through a Softmax function, obtaining an apparent weight vector

And shape weight vector

This process can be expressed as:

when performing feature conversion, the device L ₂ The normalized feature vectors are respectively input into two convolution layers with convolution kernel size of 1 multiplied by 1, and then are processed by a Sigmoid function to obtain an apparent feature vector and a shape feature vector after conversion. This process can be expressed as:

feature vector ultimately representing pedestrian

Is a weighted sum of the apparent and shape feature vectors,expressed as:

wherein |, indicates a hadamard product. Defining a fusion loss L using identity tag data in a dataset _c Comprises the following steps:

wherein

Is the cross-entropy loss of the entropy of the sample,

is the ternary loss and alpha is the weighting factor.

The total loss L when training the model is as follows:

L _all ＝L _a +λ ₁ L _s +λ ₂ L _c

λ ₁ ,λ ₂ are all weight factors; the objective function is minimized using an error back propagation algorithm, thereby optimizing the parameters of the model.

In this embodiment, a two-stage training strategy is used to train the model; in stage I, the feature fusion module is not used, and only L is passed _a And L _s Learning with discriminant power separately

And

optimizing parameters of the apparent encoder and the shape encoder; in stage II, through the total loss L _all And (4) jointly training the whole model (except a human body segmentation module) and optimizing the model parameters.

During testing, inputting query video (query) and search library video (galery) into a trained model to extract features; in the feature space, similarity is measured by their euclidean distance; the smaller the Euclidean distance between the two picture characteristics is, the higher the similarity is; the higher the similarity, the more forward the position in the identity matching result of the query video (query) is in the search base video (gallery).

For the purpose of explaining the specific embodiment of the present disclosure in detail, a pedestrian re-identification data set COCV of a change-over video is taken as an example for explanation; in this embodiment, the apparent encoder is a ResNet50 module pre-trained on an ImageNet data set, the pedestrian segmentation module uses a JPPNet model pre-trained on a human body analysis data set LIP, and the shape encoder uses a gaitput model; the method specifically comprises the following steps:

stage i training was performed. Without using a feature fusion module, by apparent loss L _a And shape loss L _s Having discriminant power for separate learning

And

optimizing parameters of the apparent encoder and the shape encoder; the training period for pre-training is set to 400.

At apparent loss L _a Shape loss L _s And fusion loss L _c Weighted sum of (L) _all For the total loss function, the whole model (except the human body segmentation module) is trained, and model parameters are optimized. The training period is set to 400.

During testing, inputting both a query video (query) and a search library video (gallery) into a trained model to extract features; in the feature space, the similarity is measured by the Euclidean distance of the features; the smaller the Euclidean distance between the two picture characteristics is, the higher the similarity is; the higher the similarity, the more forward the position in the identity matching result of the query video (query) is.

Example 2:

the embodiment provides a reloading video pedestrian re-identification system based on appearance and shape characteristics, which comprises a model establishing module, a characteristic extracting module, a model optimizing module and a testing module;

the feature extraction module configured to: based on the deep neural network model, extracting pedestrian apparent characteristics and pedestrian shape characteristics of the pedestrian video in the training set; fusing the pedestrian appearance characteristic and the pedestrian shape characteristic to obtain a final pedestrian characteristic;

the test module configured to: inputting the inquiry video and the search library video into the trained model to extract the pedestrian characteristics; and calculating Euclidean distances between the pedestrian features of the inquiry video and the pedestrian features of the search library video, and sequencing identity matching according to the Euclidean distances.

Example 3:

the present embodiment provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the reloaded video pedestrian re-identification method based on appearance and shape features as described in embodiment 1.

It is understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on.

A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The memory may include both read-only memory and random access memory and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

Example 4:

the present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the reloading video pedestrian re-identification method based on appearance and shape features as described in embodiment 1.

The method for pedestrian re-identification based on the reloading video of the appearance and shape features in the embodiment 1 can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is positioned in the memory, the processor reads the information in the memory and combines the hardware thereof to complete the steps of the method; to avoid repetition, they will not be described in detail here

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and those skilled in the art can make various modifications and variations. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present embodiment should be included in the protection scope of the present embodiment.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. The reloading video pedestrian re-identification method based on appearance and shape features is characterized by comprising the following steps of:

establishing a deep neural network model;

calculating Euclidean distances between the pedestrian features of the query video and the pedestrian features of the search library video, and sequencing identity matching according to the Euclidean distances;

fusing the pedestrian appearance features and the pedestrian shape features includes:

respectively normalizing the feature vectors of the pedestrian apparent features and the pedestrian shape features;

performing feature conversion, respectively inputting the normalized feature vectors into two convolution layers with convolution kernel sizes of 1 multiplied by 1, and obtaining converted apparent feature vectors and shape feature vectors through a Sigmoid function;

2. The method of claim 1, wherein the deep neural network model comprises an appearance encoder for extracting appearance features of pedestrian video coarse granularity, a shape encoder for extracting shape features of pedestrian video fine granularity, and a feature fusion module for adaptively fusing the appearance features and the shape features into final pedestrian features.

3. The video-over-mount pedestrian re-recognition method of claim 2, wherein extracting pedestrian appearance features and pedestrian shape features of the pedestrian video in the training set comprises:

simultaneously carrying out human body segmentation on the pedestrian video to obtain a binary pedestrian segmentation video; inputting the human segmentation video into a shape encoder, and extracting the shape features of the pedestrians frame by frame to obtain a shape feature map; performing maximum pooling on the shape characteristic graph corresponding to each frame, aggregating the shape characteristic graphs into a shape characteristic graph of the pedestrian, and uniformly and horizontally dividing the shape characteristic graph into a plurality of sub-characteristic graphs; performing maximum pooling in each sub-feature map, and allowing the pooled feature vectors to pass through a plurality of full-connection layers respectively to obtain a plurality of feature vectors respectively representing a plurality of horizontal areas of the pedestrians; connecting a plurality of feature vectors to form a final feature vector representing the shape of the pedestrian; a shape loss function is defined using the identity tag data in the dataset.

4. The method of claim 1, wherein the deep neural network model training process is performed in two stages, including:

in the first stage, an apparent encoder and a shape encoder are optimized only through an apparent loss function and a shape loss function, and coarse-grained apparent features and fine-grained shape features with discrimination are learned respectively;

in stage ii, the entire network is jointly trained by a weighted sum of the apparent loss function, the shape loss function, and the fusion loss function.

5. The method for pedestrian re-identification of reloading video based on appearance and shape features as claimed in claim 1, wherein the query video of the test is input to the model together with the search library video to extract features; in the feature space, the Euclidean distance between the features of the query video and the search library video is calculated to measure the similarity.

6. The reloading video pedestrian re-identification method based on appearance and shape features as recited in claim 5, wherein the higher the similarity, the more forward the position in the identity matching result is for searching the library video.

7. The reloading video pedestrian re-identification system based on appearance and shape features is characterized by comprising a model establishing module, a feature extraction module, a model optimization module and a testing module;

the model optimization module configured to: performing deep neural network model training by using the pedestrian appearance characteristics, the pedestrian shape characteristics and the fused final pedestrian characteristics to obtain a trained model;

the test module configured to: inputting the inquiry video and the search library video into the trained model to extract the pedestrian characteristics; calculating Euclidean distances between the pedestrian features of the inquiry video and the pedestrian features of the search library video, and sequencing identity matching according to the Euclidean distances;

8. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor performs the visual pedestrian re-identification method based on appearance and shape features of any one of claims 1-6.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method for pedestrian re-identification based on reloaded video of apparent and shape features according to any one of claims 1 to 6.