CN107832672B

CN107832672B - Pedestrian re-identification method for designing multi-loss function by utilizing attitude information

Info

Publication number: CN107832672B
Application number: CN201710946443.3A
Authority: CN
Inventors: 周忠; 吴威; 姜那; 刘俊琦; 孙晨新
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2020-07-07
Anticipated expiration: 2037-10-12
Also published as: CN107832672A

Abstract

The invention discloses a pedestrian re-identification method for designing a multi-loss function by utilizing attitude information, which can effectively solve the difficulties caused by frequent pedestrian shielding, large video illumination difference and variable non-rigid pedestrian attitude in a monitoring video and is widely applied to the fields of security monitoring and the like. The method is mainly divided into two stages, namely an off-line stage and an on-line stage. The off-line stage is responsible for training and learning a deep learning network model with high accuracy, and comprises preprocessing, joint point information extraction, local feature extraction and feature fusion with global features extracted by a main network framework, and finally training is completed by utilizing a quintuple loss function for the fused features. And in the online stage, the trained deep learning network model is used for feature extraction, so that the pedestrian re-identification between the target to be analyzed and the stored target picture library is realized through similarity calculation.

Description

Pedestrian re-identification method for designing multi-loss function by utilizing attitude information

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a pedestrian re-identification method for designing a multi-loss function by utilizing attitude information, which is an accurate pedestrian re-identification method capable of resisting pedestrian shielding and having variable attitude and applied to an intelligent monitoring analysis system.

Background

The pedestrian re-recognition technology is used for searching a given target in a plurality of cameras and performing association matching on the search results. The technology provides basic support for the application of the video monitoring field, such as pedestrian retrieval, cross-camera tracking, man-machine interaction and the like. For the figure searching task of massive video data, the pedestrian re-identification can greatly liberate manpower. However, the pedestrian re-identification problem is very challenging due to different camera shooting visual angles, complex lighting conditions, frequent shielding, variable non-rigid pedestrian postures and the like. To overcome these difficulties, researchers have proposed many different solutions over the past 20 years. The algorithm principle can be roughly divided into two algorithms of design expression characteristics and optimized distance measurement.

Designing expression features refers to finding features that are robust to changes in the appearance of the image. The feature expression-based method focuses on how to design a feature description having a degree of recognition for pedestrians and stability against image changes. The method comprises the following steps of low-level visual features such as color histograms, texture features and local feature points, and middle-level features with semantic attributes.

In order to effectively utilize spatial information, the existing method generally divides the image into different areas, such as the Zheng Weishi from 2006 to 2013, which divides the image of a pedestrian into a plurality of horizontal stripes from top to bottom. Farenzena et al in 2010 divide the image of a pedestrian into three parts, namely the head, the trunk and the legs of the body by using image symmetry and asymmetric prior theories so as to extract feature combinations among different regions. Thanks to the advent of the large-data-volume pedestrian re-recognition data set Marker1501, MARS, researchers began using deep learning-based methods to represent image features. Cheng et al in 2016 proposed a multi-channel deep neural network framework based on local blocks that simultaneously extract global and local features using horizontally partitioned local stripes with artwork. However, due to the variation of different camera view angles and the pedestrian postures, the horizontal segmentation may generate misalignment, and may adversely affect the accuracy of the model. Based on the consideration, the invention adopts pedestrian joint point detection to obtain more accurate local position, achieves alignment based on semantics, and provides key conditions for the complementary fusion of global and local features.

Optimizing distance metrics refers to learning a distance space such that feature distances between images belonging to the same person are close, and feature distances between images belonging to different persons are far. In 2009 Weinberger et al proposed a large-interval nearest neighbor classification (LMNN), which employs a triplet constraint to make the k nearest neighbors of each sample belong to the same class in the new metric space. In 2012, Kostiger et al proposed a distance metric learning algorithm that remained simple and direct (Kiss). After that, gradually, the learners combine the distance measurement with deep learning to establish a verification model for pedestrian re-identification. The model takes the image pair as network input, and simultaneously calculates the distance between the features after the image features are extracted, and finally outputs the similarity between the images. Integrating extracted features and similarity metrics into one framework is a major advantage of this type of model. However, only with the verification model, only the features of dissimilarity between pairs of pictures can be extracted. The salient features that each picture has by itself are often ignored. Therefore, the invention considers the combined classification model and the verification model to carry out training, simultaneously calculates the classification loss and the verification loss, and weights the classification loss and the verification loss to achieve model complementation.

With the widespread application of deep learning to multiple subproblems in the field of computer vision, the method for accurately extracting joint point information in a complex scene, which is proposed by Wei and the like, provides possibility for re-identifying accurate local information acquisition for pedestrians. The algorithm for automatically extracting the joint point information based on the deep learning method can be applied to the pedestrian re-recognition problem in consideration of the fact that the pedestrian data posture change in the monitoring video presents a certain rule and rarely has abnormal postures. Therefore, the method calculates the local position of the human body by using the joint information obtained by the method and conjectures the pedestrian attitude orientation, wherein the local position information can be used for extracting local features and fusing global features, the pedestrian attitude orientation can be used for designing a quintuple loss function, and the information can improve the accuracy of pedestrian re-identification in a complex monitoring environment.

Disclosure of Invention

The purpose of the invention is: the pedestrian re-identification method based on the attitude information can deal with the situations that pedestrians are frequently sheltered, the illumination difference is large, the non-rigid pedestrians are changeable in attitude and the like in a monitoring video, and can be integrated into any intelligent monitoring system to realize the basic analysis of the pedestrians.

The technical scheme adopted by the invention is as follows: a pedestrian re-identification method for designing a multi-loss function by utilizing attitude information comprises the following steps: extracting two parts of main contents of feature network model training and pedestrian re-identification on line off line;

step (1), an off-line extraction characteristic network model training stage:

(m1) preprocessing all pictures, original picture rI_iAfter treatment with I_iRepresents;

(m2) Joint Point information is detected for each picture, and the obtained 18 pieces of Joint Point information have P_Ii＝{x₁，y₁，……，x₁₈，y₁₈In the four points, a corresponding Boolean array label indicates whether different joint points are detected or not, and the label_i＝(True or False)；

(m3) estimating a height high of each pedestrian based on the joint point information extracted in the step (m2)_iCalculating local region information of head, trunk and leg respectively

(m4) estimating the orientation of the pedestrian target from the joint point information extracted in the step (m2), and recording the orientation as dir_i(1or2or 3), where equal to 1 denotes a forward sample, 2 denotes a lateral sample, and 3 denotes a backward sample;

(m5) extracting global features according to the designed backbone network, extracting local features according to the local region position information extracted in the step (m3) and the branch network structure, and fusing the global features and the local features of each picture to form expressive feature vectors together;

(m6) calculating a multi-classification loss function and a first triplet loss function of the present invention based on the data true tags, while calculating a second triplet loss function based on the pedestrian attitude heading presumed in step (m 4);

(m7) training the current feature extraction network by combining the plurality of loss function errors calculated in the step (m6), analyzing the influence of different loss function weights on the network, and selecting the optimal weight lambda₁And λ₂To complete the joint training;

step (2), an online pedestrian re-identification stage:

(s1) preprocessing all pictures I in the Picture library_galleryAnd (2) extracting features by using the network model obtained by the off-line stage training in the step (1), and storing the extracted features one by one according to the identification information corresponding to the pictures to form a feature library F_gallery；

(s2) preprocessing the picture I to be analyzed_queryExtracting features by using the network model obtained by the off-line stage training in the step (1), and obtaining a final feature vector f_queryUnique valid information used as similarity measure for the subsequent step (s 3);

(s3) calculating f extracted in the step (s2)_queryAnd feature library F_galleryCarrying out normalization and sorting operation, and selecting the pictures with similarity greater than 0.7 and ranked at the top M as the retrieval result of pedestrian re-identification, wherein the numerical value of M is dynamically selected according to the quantity in the current picture library;

(s4) periodically updating the picture library and its corresponding feature library, with emphasis on both static library and dynamic library modalities detected and captured by the dynamic video library.

Further, the step (m3) comprises the following steps:

(m3.1) the joint point information P extracted according to the step (m2)_IiRemoving label_iAll False, i.e. a sample of failure to detect the joint; removing a sample which represents that the joint points of the trunk part are mostly False;

(m3.2) step (m2) of extracting samples of which the joint information meets the requirements of the invention to participate in training, and according to the existing joint information P_IiPresume pedestrian's height high_i；

(m3.3) calculating head region information from the joint points of the left and right ears, nose, and the like

(m3.4) calculating trunk area information according to the position information of the joint points related to the left shoulder, the right shoulder, the left crotch and the right crotch

(m3.5) calculating leg region information based on the waist position, ankle, height, etc

Because the detected bounding box often does not contain feet, the bounding box is scaled down proportionally according to the height;

and (m3.6) generating the interest region according to the local region position information obtained by calculation in the steps (m3.3) to (m3.5), and entering a branch network for local feature extraction by utilizing an improved interest region feature extraction layer.

Further, the step (m4) comprises the following steps:

(m4.1) after screening according to the step (m3.1), determining the posture orientation of the sample participating in training, and judging the sample without the left shoulder or the right shoulder as the lateral dir_i＝2；

(m4.2) left and right shoulder vectors are calculated for samples where both left and right shoulders are present

(m4.3) left and right shoulder vectors obtained according to step (m4.2)

Calculating the included angle dir _ angle with the vertical line_Ii；

(m4.4) judging the included angle dir _ angle calculated in the step (m4.3)_IiWithin the range of included angle dir _ angle_IiIn the range of [260 °,280 ° ]]The inner direction is marked as forward dir_iIf not, the included angle is judged to be in the range of [80 degrees ] and 100 degrees DEG when the included angle is 1]Within this range this is marked as back dir_i3 if not aboveWithin the two ranges, the sample is marked as lateral dir_i＝2。

Further, the step (m5) comprises the following steps:

(m5.1) extracting global features, labeled fglobal (I), from the backbone network in the network framework proposed by the present invention_i)；

(m5.2) connecting the three local features (I) extracted in step (m3.6)_i) Which are respectively labeled fh (I)_i)、ft(I_i)、fl(I_i)；

(m5.3) implementing the global feature extracted in the step (m5.1) and the local feature f (I) extracted in the step (m5.2) by using a full connection layer_i)。

Further, the step (m6) comprises the following steps:

(m6.1) calculating a multi-classification loss function error;

(m6.2) calculate the first triple constraint of the invention:

D_id(I_i ^a,I_i ^p,I_i ⁿ)＝d(f(I_i ^a)-f(I_i ^p))-d(f(I_i ^a)-f(I_i ⁿ))＜α

wherein, I^aIs any one reference pedestrian image in the data set, I_i ^pRepresenting another image representing the same person as the reference pedestrian, i.e. a positive sample, I_i ⁿFor the images of other people, namely negative samples, the triple input is subjected to network calculation to obtain respective feature vectors { f (I)_i ^a),f(I_i ^p),f(I_i ⁿ)}，d(f(I_i ^a)-f(I_i ^p) D (I) is the distance between the reference map and the positive sample pair_i ^a)-f(I_i ⁿ) Distance between the reference map and the negative sample pairs, α a threshold for the triplet constraint;

(m6.3) calculating the second quintuple constraint of the invention:

D_pose(I_i ^a,I_i ^ps,I_i ^pd)＝d(f(I_i ^a)-f(I_i ^ps))-d(f(I_i ^a)-f(I_i ^pd))＜β

wherein the content of the first and second substances,

is shown and

a positive sample with the same posture is taken,

is shown and

positive samples with different poses, β, are thresholds for quintuple double constraints.

Further, the step (m7) comprises the following steps:

(m7.1) calculating a back-propagated combined error value from the multi-loss function error obtained in step (m 6):

Loss₃(I,w)＝λ₁Loss₁(I,w)+λ₂Loss₂(I,w)

therein, Loss₁Representing a multi-class Loss function, Loss₂Representing the quintuple Loss function, Loss₃Representing a joint loss function, λ₁And λ₂To balance the weight of the joint loss function, λ is the weight of the balanced triplet and quintet constraints, w is the network parameter,

representing the probability of prediction, p_iIs the target probability, N is the number of pedestrian species, N is the quinaryNumber of groups.

(m7.2) analyzing the error weight parameter λ in step (m7.1)₁And λ₂Determining the optimal loss function distribution weight used in the off-line stage.

Further, the step (s3) includes the steps of:

(s3.1) dynamically selecting the value of M according to the number in the current picture library;

(s3.2) calculating f extracted in step (s2) in sequence_queryAnd feature library F_galleryA characteristic distance therebetween;

(s3.3) carrying out normalization and sorting operation on all the characteristic distances calculated in the step (s3.2), and selecting the picture with the similarity larger than 0.7 and ranked at the top M as a retrieval result of pedestrian re-identification;

(s3.4) visualizing the pedestrian re-identification search result obtained in step (s3.3) and displaying I for the static photo gallery_queryAnd after sorting I_resultsFor the dynamic video library, the method is based on I_resultsAnd restoring the real condition of the result in the video at the moment in the database by using the camera ID, the pedestrian ID, the bounding box position information, the frame number time and the like.

Further, the step (s4) includes the steps of:

(s4.1) setting a time t for periodic updating;

(s4.2) continuously adding query pictures I into the static picture library within the time t range_queryInformation and characteristics of; at the moment t, replacing or updating the picture library according to requirements, re-extracting the changed picture characteristics, and establishing a new characteristic library;

(s4.3) continuously adding the detected new target to the dynamic video library within the time t range, and storing the camera ID, the pedestrian ID, the position information of the surrounding frame, the frame number, the time, the place and other world information in a database; and after the moment t is reached, clearing half of pedestrian data information in the current database according to time, adding new detection results frame by frame, and simultaneously extracting the characteristics of the new detection results as a main attribute stored in the database.

The principle of the invention is as follows:

the invention provides a pedestrian re-identification method for learning features by calculating various loss functions by utilizing human posture information. The design of the invention firstly derives from the increase of the number of monitoring cameras and bayonets and the enhancement of the storage capacity, and provides resource guarantee for pedestrian big data. The pedestrian data with different magnitudes provides a good data base for the pedestrian re-identification technology based on deep learning. Secondly, the method considers the presenting rule of the pedestrian in the monitoring video, and adjusts the aspect ratio of each picture in the preprocessing stage so as to keep good spatial information characteristics when extracting the characteristics in the subsequent deep network framework. In addition, in order to process the situation that the background in the surveillance video is noisy and frequently shielded, the invention introduces local features to make up the deficiency of global features. Namely, joint point information is introduced to calculate the position of a local area of the pedestrian and the orientation posture of the local area relative to the camera. And then extracting the local features of the human body according to the local position information, and fusing the local features with the global features. Finally, the invention also considers the improvement of the expression capability of the deep learning network model from the aspect of the training strategy, designs the quintuple loss function by utilizing the orientation information, and completes the training by combining with the cross entropy loss function, thereby obtaining the efficient and robust feature extraction model in the off-line stage.

In the face of the huge data volume of surveillance videos, it has become impractical to manually complete pedestrian re-identification. The automatic pedestrian re-identification technology can promote the development of various applications such as video analysis, security and the like. The main reason for the low efficiency of manual pedestrian re-identification is that the number of targets to be analyzed is large, and a large number of observed target features cannot be stored in the human brain in a short time. Therefore, after the feature extraction model is obtained, the technical route for completing pedestrian re-identification on line is designed. In the process, firstly, the existing picture library is updated regularly according to the monitoring content, and relevant characteristics are pre-fetched, so that the retrieval time is shortened. And then, after the target to be analyzed is obtained, the target to be analyzed is quickly matched, and necessary pedestrian re-identification is completed.

Specifically, the pedestrian re-identification method for designing the multi-loss function by utilizing the attitude information is divided into an off-line stage and an on-line stage. In an off-line stage, the invention firstly provides a feature extraction depth network framework for keeping the pedestrian aspect ratio; secondly, joint point information is introduced to calculate the position of the pedestrian local area and the orientation posture of the pedestrian local area relative to the camera; then, extracting local features of the human body according to the local position information, and fusing the local features with the global features; and finally, designing a quintuple loss function by using the orientation information, and training the quintuple loss function together with the cross entropy loss function. In the on-line stage, firstly, a feature extraction model obtained by off-line stage training is used for extracting and storing features of a preprocessed picture library; secondly, adjusting the aspect ratio of the target picture to be analyzed, and extracting features after adjustment; according to the extracted target features to be analyzed, similarity measurement is carried out in storage features of a picture library, the calculated similarities are subjected to normalization sorting, and pictures in the library which meet similarity conditions and are ranked in front are selected as retrieval results; and finally, integrating the information such as the camera and the ID matched with the retrieval results, outputting the information in a visual mode, and simultaneously storing the information in a query library to provide input for analysis of other applications. In addition, for surveillance videos or pedestrian data acquired currently, the picture library and the characteristics thereof need to be updated regularly to ensure that the most accurate pedestrian re-identification result is obtained.

In the off-line stage, the specific steps are as follows:

firstly, the invention preprocesses all pictures (pictures to be analyzed and picture library), adjusts the aspect ratio to 1:2, and adjusts the size to 107 × 219 before training, so as to ensure that effective spatial information can be reserved in the next feature extraction stage, and simultaneously, the invention also plays a role of reducing network parameters.

Secondly, the network structure proposed by the present invention consists of the following parts: the system comprises a joint point detection network with 1 fixed parameter, 1 main network, 3 local branch networks, 3 connection layers with integrated characteristics and two loss layers. The knuckle detection network provides pedestrian knuckle information. The main network, the branch network and the connection layer are responsible for extracting the global features and the local features and fusing the global features and the local features. The loss layer is responsible for combining the two loss functions and performing metric learning.

The joint point detection network mainly extracts 18 joint points of a human body, including a neck, a nose, left and right shoulders, elbows, wrists, knees, ankles, glutes, shoulders, eyes and ears, and allows partial joint points to be lost. After the coordinates of all the joint points are obtained, the pedestrian height of the pedestrian is estimated by using the coordinates, the height is used as the assistance, the region boundary is calculated through the maximum value and the minimum value of the joint coordinates of all the regions, and the position information is provided for the subsequent network extraction of the local features.

Meanwhile, the invention also uses the obtained joint point information to estimate the pedestrian posture orientation. And discarding the pedestrian targets with joint point information acquisition failure or without joint points of the trunk part, and not utilizing the defective samples for training so as to avoid polluting the feature extraction model. And preferentially judging whether the left shoulder joint point or the right shoulder joint point exists for the samples participating in the training, and locking the standard lateral samples. A left shoulder to right shoulder direction vector is then calculated, the angle between this vector and the vertical line being used as the primary credential for orientation discrimination. Those with included angles in the range of [80 °,100 ° ] are labeled as back-facing samples, and those with included angles in the range of [260 °,280 ° ] are labeled as front-facing samples.

Thirdly, the global features and the local features required by pedestrian re-identification are extracted by utilizing the backbone network, the branch network and the local area information designed by the invention. The backbone network structure is based on the idea of initiation _ v3, but the different cost lies in that the structure of the invention comprises 5 convolution modules, each module has a plurality of branches, and each branch is formed by stacking convolution layers with various scales and pooling layers. Such a structure can increase the width of the network and reduce the network parameters, and can also enhance the adaptability to the scale. The network of the present invention uses relus to introduce non-linearity capability and uses batch regularization before each ReLU to speed up convergence and mitigate the effects of parameter distribution variation, with a 50% Dropout set at the last fully connected layer to prevent overfitting. The branch network structure shares the parameters before conv5_ x with the backbone network. And the position information is added through the pooling layer to extract the local characteristics of the respective areas. The branch network is similar to the main network in structure, except that the output number of the last pooling layer and the full connection layer is less than that of the main network, thereby playing a role in adjusting the weight. And the network rear end combines the local features and the global features by utilizing a full connection layer to be used as feature vectors of the pedestrians.

Finally, the pedestrian similarity rule is obtained according to long-term experimental observation, namely the characteristic distance between different pedestrians is larger than that between the same pedestrians, and the characteristic distance between different postures of the same person is also larger than that between the same postures of the same person. According to the rule, the invention designs a quintuple loss function with double constraints and provides a strategy for joint training by using the loss function and a multi-classification loss function. The new loss function can correct the cognitive error that the network considers that the negative sample with the same posture is more similar to the positive sample with different postures in appearance, and fundamentally enables the network to learn the expression characteristics for overcoming the posture change. And the joint training strategy can increase the expression capability of the network under the condition of not changing the network structure, so that the obtained network model has better mobility, and the pedestrian re-recognition feature extraction network model required by the invention is obtained.

After the feature extraction network model obtained in the above steps is obtained in the off-line stage, pedestrian re-identification is carried out in the on-line stage, and the specific steps are as follows:

firstly, preprocessing all pictures in a picture library, adjusting the pictures to be in accordance with the unified size input by a feature extraction model, then extracting features of the preprocessed picture library by using the feature extraction model obtained by off-line stage training, and storing feature vectors in a strip mode according to key information of the preprocessed picture library to form a feature library;

secondly, preprocessing a target picture to be analyzed and extracting a feature vector with expression capability;

and thirdly, carrying out similarity measurement on the extracted target feature vector to be analyzed in a stored feature vector library. Carrying out normalization sorting on the calculated similarities, and selecting the images in the library which accord with the similarity conditions and are ranked in front as the retrieval results;

and finally, integrating the information such as the camera and the ID matched with the retrieval results, outputting the information in a visual mode, and simultaneously storing the information in a query library to provide input for other application analysis. In addition, for surveillance videos or pedestrian data acquired currently, the picture library and the feature vector library thereof need to be updated regularly, so that the most accurate pedestrian re-identification result is ensured.

Compared with the prior art, the invention has the advantages that:

1. the invention provides a deep neural network framework consisting of a main network and three sub-networks, wherein the main network is used for extracting global features, and the three sub-networks are used for extracting local features of the head, the trunk and the legs of a human body by utilizing joint point information. And finally, fusing the global and local features to improve the retrieval accuracy and effectively resist frequent occlusion in the surveillance video.

2. Quintuple constraint is designed by utilizing the deduced pedestrian orientation information, the metric learning ability is enhanced, and a strategy of training a network model by using joint classification loss and verification loss is used. It is ensured that the characteristic distance between images belonging to the same person is smaller than the characteristic distance between images belonging to different persons, and the characteristic distance between images of the same pose of the same person is smaller than the characteristic distance between images of different poses. The difficulty brought to the re-identification of the pedestrian by the variable target postures of the non-rigid pedestrian is fundamentally overcome.

3. The invention is separated from the common video analysis modules for detection, tracking and the like. The method can be integrated into any intelligent monitoring system as an independent module, provides accurate input for upper layer analysis, and is convenient to use and robust.

Drawings

FIG. 1 is a general diagram of a pedestrian re-identification method using attitude information to design a multi-loss function according to the present invention;

FIG. 2 is a schematic diagram of the present invention for estimating the orientation of a pedestrian and calculating the local region information to extract the local fine features of the head, the trunk and the legs according to the attitude information;

fig. 3 is a diagram illustrating a comparison between the position information of the extracted joint point and the local area of the present invention and the conventional striped local area division, where the first group is the original image, the second group is the extraction effect of the present invention, and the third group is the striped division effect, and it can be found that the method of the present invention can align the local area of the pedestrian target more effectively and can eliminate part of the background interference by comparing the second group and the third group of images;

FIG. 4 is a flow chart of estimating a pedestrian target attitude heading;

FIG. 5 is an exemplary diagram illustrating the method of estimating the orientation of a pedestrian using joint information, wherein the method is generally divided into three orientations, i.e., a side orientation, a front orientation and a back orientation;

FIG. 6 is a schematic diagram of the design of quintuple loss function according to the present invention.

Detailed Description

The specific steps of the present invention will be described in detail with reference to the accompanying drawings and examples.

The invention provides a pedestrian re-identification method for designing a multi-loss function by utilizing attitude information, which firstly introduces the pedestrian re-identification processing process in detail by combining the general schematic diagram of figure 1. The method comprises an off-line stage and an on-line stage, wherein the off-line stage comprises preprocessing, rough feature extraction, fine feature extraction, feature fusion, quintuple similarity measurement, multi-class loss function calculation, network parameter learning and the like; the online stage comprises four parts of feature extraction, similarity measurement, picture library updating and result visualization.

Stage (1) off-line stage: and training and learning a network model for extracting features.

A. The data preprocessing steps are as follows: note that in real video, the pedestrian bounding box is mostly rectangular, and the aspect ratio is about 0.5. Most of the existing pedestrian re-identification methods based on deep learning use square network inputs, and are not beneficial to keeping the spatial characteristics of pedestrians. Therefore, the input size of the network is changed into 107x219, the input size is consistent with the actual aspect ratio of the pedestrian image, effective feature extraction is facilitated, and meanwhile the effect of reducing network parameters is achieved. For image list LI₁,I₂,…,I_nAny picture I in_iThe above pretreatment was carried out.

B. Step of coarse feature extractionComprises the following steps: the network architecture designed by the invention mainly comprises a main network, a branch network and a plurality of important parts of a loss function layer. The parts of the backbone network and the branch network before the Conv5_ x use shared network parameters, the parts are mainly responsible for extracting rough features of pictures, and the features comprise information such as semantics and the like, are close to global features, and therefore are also basic features for global feature extraction. By picture I_iFor example, the partially extracted feature icon is labeled

Is the input for the next part of fine feature extraction.

C. The fine feature extraction steps are as follows: after the coarse features are obtained, the main network further extracts the fine features, and the branch network extracts the local fine features according to the local feature extraction schematic diagram shown in fig. 2. The specific process is as follows:

1) detecting a pedestrian joint point: as shown in fig. 2, joint detection is performed on the preprocessed pictures. The invention mainly extracts 18 joint points (allowed to be lost) of the human body, including neck, nose, left and right shoulders, elbows, wrists, knees, ankles, crotch, shoulders, eyes and ears. The joint point detection fails, and the sample without the trunk information is not involved in training. After the coordinates of each joint point are obtained, the position information of the left shoulder and the right shoulder of the sample meeting the conditions can be used as 2) the main basis for estimating the posture and the orientation of the pedestrian, and the height prediction can be used as 3) the auxiliary for extracting the local area information.

2) Presume pedestrian's posture orientation: after removing the sample that failed to detect the joint and had no torso, the presence of the left and right shoulders was observed, as shown in fig. 4, to determine a standard lateral sample. And calculating shoulder vectors of the samples existing in the left shoulder and the right shoulder, calculating included angles between the obtained vectors and the vertical line, and judging whether the samples are in the forward direction, the back direction or the side direction according to the included angle range. An example of the orientation of a pedestrian estimated by the method of the invention is shown in fig. 5.

3) Calculating pedestrian local area information: the traditional strip-type local area division can not eliminate the interference of a complex background and faces toIn the example of fig. 3, local feature region alignment cannot be achieved, and such errors may drive the network model to learn wrong features. Therefore, after the coordinates of each joint point are obtained, the height of the pedestrian is estimated, the height is taken as the assistance, the region boundary is calculated through the maximum value and the minimum value of the joint coordinates of each region, and the position information is provided for the subsequent network extraction of the local features. Also in picture I_iFor the purpose of explaining the formulation in the present invention, three local area information are expressed as

Each local region position information is composed of four-tuple (x)ⁱ,yⁱ,wⁱ,hⁱ) Composition, where x, y, w, h represent the upper left coordinates (x, y) of a region and the length and height of the region, respectively.

4) Extracting local fine features: after 3) obtaining the local area information, extracting local fine features using a branching network within the network structure. The partial branch network parameters are not shared.

D. The feature fusion mode is as follows: the invention analyzes a plurality of characteristic fusion modes in experiments, and mainly compares an ElementWise mode and a Concat mode. The results show that: the Concat mode of complementing the global characteristic and the local characteristic can obtain the most effective characteristic vector according to the design principle of the invention.

E. The design and the construction steps of the quintuple are as follows: the quintuple loss function is improved by adding the attitude constraint by the triplet loss function. The commonly used triplet loss function is mathematically expressed as a triplet { I }_i ^a,I_i ^p,I_i ⁿIn which I^aIs any one reference pedestrian image in the data set, I_i ^pRepresenting another image representing the same person as the reference pedestrian, i.e. a positive sample, I_i ⁿAn image of another person, i.e. a negative example. The triple inputs are subjected to network calculation to obtain respective feature vectors { f (I)_i ^a),f(I_i ^p),f(I_i ⁿ) And there is a triple constraint:

wherein d (f (I)_i ^a)-f(I_i ^p) D (I) is the distance between the reference map and the positive sample pair_i ^a)-f(I_i ⁿ) The inequality is meaningful by learning a metric in which the feature distance between the same person must be less than the feature distance between different persons in the distance space, i.e., the image features of the same person are more similar than those between different persons, on this basis, the present invention introduces a pose double constraint that is known from the pose orientation of the pedestrian obtained in C2), the present invention classifies the sample into three classes, forward, lateral, and backward.

Wherein the content of the first and second substances,

is shown and

a positive sample with the same posture is taken,

is shown and

the objective of the loss function is a measure that the distance of image features in the same pose of the same person in the distance space is smaller than the distance of image features in different poses of the same person in the distance spaceThe distance between them. Such constraints ensure that the distance between positive samples with the same posture is smaller, and the influence caused by posture change can be reduced.

The method takes the original triple constraint as the first constraint, designs the improved posture constraint as the second constraint, combines the two as the quintuple structure, calculates the loss of the quintuple structure and can realize the verification model training. The loss function is calculated as follows:

F. the implementation scheme of the multi-class loss function calculation and network joint training is as follows: the invention uses two loss functions jointly, one is a softmax loss function, and focuses on classifying the images. The other is a quintuple loss function added with posture constraint, and focuses on verifying whether the two images are the same person or not. FIG. 6 is a schematic diagram of the design of quintuple loss function according to the present invention. The classification model generally uses a softmax layer with an output k after the network total feature layer, where k is the number of classes in the training set. The training of the classification network consists in minimizing the cross-entropy losses, i.e. the classification losses together with the quintuple loss function described in step E) above can jointly train the network model. The combined loss function is calculated in the following way:

Loss₃(I,w)＝λ₁Loss₁(I,w)+λ₂Loss₂(I,w)

stage (2) on-line stage: and carrying out re-identification on the specified pedestrian in the pedestrian database.

A. Extraction of expressive features: the invention utilizes the feature extraction network obtained by off-line stage training to extract the expressive features of the picture to be analyzed and the existing pedestrian picture library. And simultaneously storing the feature vectors corresponding to the current picture library one by one before the next updating.

B. Similarity measure with pedestrian picture library and feature library: and after comparing and analyzing the Euclidean distance and the cosine distance, selecting the cosine distance as a standard measurement mode. And sequentially carrying out similarity measurement on the feature vector of the picture to be analyzed and the feature vector of the picture library. And carrying out normalization and sequencing treatment on the obtained similarity. The similarity is more than 0.7, and the picture of M before the ranking is taken as the retrieval result. Wherein M is dynamically set according to the total number of the current pictures.

C. Regular updating mode of pedestrian picture library: and aiming at the static picture library, continuously adding the picture to be analyzed each time and storing the characteristic vector of the picture. And updating the pedestrian information obtained by inspection every 30 minutes according to the pedestrian data generated by the dynamic video, continuously adding a pedestrian target judged as a new person before the next updating, and extracting features later in the system so as to complete pedestrian re-identification on line after obtaining the target to be analyzed.

D. Visualization scheme of pedestrian re-identification result: the invention records the result of each inquiry and stores the result into the database. Displaying the pedestrian re-identification result in two modes, and displaying the target to be analyzed and no more than M pictures which are determined as the same pedestrian aiming at the static picture library; and for the dynamic video, firstly, locking the picture of the retrieval result, and visualizing the picture into a corresponding video picture according to the camera ID, the pedestrian ID, the frame number, the position in the video and other information stored in the database. The stored entry information can be used for visualization, and can also be used for upper-layer applications such as camera topology analysis and video content analysis.

Claims

1. A pedestrian re-identification method for designing a multi-loss function by utilizing attitude information is characterized by comprising the following steps of: the method comprises two parts of main contents of off-line extraction characteristic network model training and on-line pedestrian re-identification;

step (1), an off-line extraction characteristic network model training stage:

(ml) preprocessing all pictures, the original picture rI_iAfter treatment with I_iRepresents;

(m2) Joint Point information is detected for each picture, and the obtained 18 pieces of Joint Point information have P_Ii＝{x₁，y₁，......，x₁₈，y₁₈In the four points, a corresponding Boolean array label indicates whether different joint points are detected or not, and the label_i＝(True or False)；

(m6) calculating a multi-classification loss function and a triplet according to the data true label, and simultaneously designing a quintet according to the pedestrian posture orientation presumed in the step (m4) and calculating a quintet loss function; the step (m6) comprises the following steps:

(m6.1) calculating a multi-classification loss function error;

(m6.2) computing the triplet constraints:

D_id(I_i ^a，I_i ^p，I_i ⁿ)＝d(f(I_i ^a)-f(I_i ^p))-d(f(I_i ^a)-f(I_i ⁿ))＜α

wherein the content of the first and second substances,

is any one reference pedestrian image in the data set, I_i ^pRepresenting another image representing the same person as the reference pedestrian, i.e. a positive sample, I_i ⁿFor the images of other people, namely negative samples, the triple input is subjected to network calculation to obtain respective feature vectors { f (I)_i ^a)，f(I_i ^p)，f(I_i ⁿ)}，d(f(I_i ^a)-f(I_i ^p) D (I) is the distance between the reference map and the positive sample pair_i ^a)-f(I_i ⁿ) Distance between the reference map and the negative sample pairs, α a threshold for the triplet constraint;

(m6.3) computing the triplet constraints:

D_pose(I_i ^a，I_i ^ps，I_i ^pd)＝d(f(I_i ^a)-f(I_i ^ps))-d(f(I_i ^a)-f(I_i ^pd))＜β

wherein the content of the first and second substances,

is shown and

a positive sample with the same posture is taken,

is shown and

positive samples with different poses, β threshold for triple constraints;

(m7) training the current feature extraction network by combining the plurality of loss function errors calculated in the step (m6), analyzing the influence of different loss function weights on the network, and selecting the optimal weight lambda₁And λ₂To complete the joint training; the step (m7) comprises the following steps:

Loss₃(I，w)＝λ₁Loss₁(I，w)+λ₂Loss₂(I，w)

therein, Loss₁Representing a multi-class Loss function, Loss₂Representing the quintuple Loss function, Loss₃Representing a joint loss function, λ₁And λ₂To balance the weight of the combined loss function, and λ is D in the balanced quintuple loss function_poseThe weight value of the triplet, w being a network parameter,

representing the probability of prediction, p_iIs the target probability, N is the number of pedestrian species, N is the number of quintuple;

(m7.2) analyzing the error weight parameter λ in step (m7.1)₁And λ₂Determining the optimal loss function distribution weight used in the off-line stage;

step (2), an online pedestrian re-identification stage:

2. The method of claim 1, wherein the pedestrian re-identification method using attitude information to design multi-loss function is characterized in that: the step (m3) comprises the following steps:

(m3.1) the joint point information P extracted according to the step (m2)_IiRemoving label_iAll False, i.e. a sample of failure to detect the joint;

(m3.2) step (m2) of extracting samples with joint information meeting the requirements to participate in training, and according to the existing joint information P_IiPresume pedestrian's height high_i；

(m3.3) calculating head region information from the left and right ear and nose joint points

(m3.5) calculating leg region information based on the waist position, ankle, height information

3. The pedestrian re-identification method for designing a multi-loss function by using attitude information as claimed in claim 2, wherein: the step (m4) comprises the following steps:

(m4.1) after screening according to the step (m3.1), determining samples participating in training to analyze the posture orientation of the samples, and judging the samples without a left shoulder or a right shoulder as samplesLateral dir_i＝2；

(m4.3) left and right shoulder vectors obtained according to step (m4.2)

Calculating the included angle dir _ angle with the vertical line_Ii；

(m4.4) judging the included angle dir _ angle calculated in the step (m4.3)_IiWithin the range of included angle dir _ angle_IiIn the range of [260 °,280 ° ]]The inner direction is marked as forward dir_iIf not, the included angle is judged to be in the range of [80 degrees ] and 100 degrees DEG when the included angle is 1]Within this range, the symbol is back dir_iIf not, label the sample as lateral dir_i＝2。

4. The pedestrian re-identification method for designing a multi-loss function by using attitude information as claimed in claim 2, wherein: the step (m5) comprises the following steps:

(m5.1) extracting global features according to the backbone network in the network framework, and marking as fglobal (I)_i)；

5. The method of claim 1, wherein the pedestrian re-identification method using attitude information to design multi-loss function is characterized in that: the step (s3) includes the steps of:

(s3.4) visualizing the pedestrian re-identification search result obtained in step (s3.3) and displaying I for the static photo gallery_queryAnd after sorting I_resultsFor the dynamic video library, the method is based on I_resultsAnd recovering the real condition of the result in the video at the moment in time from the camera ID, the pedestrian ID, the bounding box position information and the frame number time stored in the database.

6. The method of claim 1, wherein the pedestrian re-identification method using attitude information to design multi-loss function is characterized in that: the step (s4) includes the steps of:

(s4.1) setting a time t for periodic updating;

(s4.3) continuously adding the detected new target to the dynamic video library within the time t range, and storing the camera ID, the pedestrian ID, the position information of the surrounding frame, the frame number, the time and the place world information in a database; and after the moment t is reached, clearing half of pedestrian data information in the current database according to time, adding new detection results frame by frame, and simultaneously extracting the characteristics of the new detection results as a main attribute stored in the database.