CN114863505B - Pedestrian re-identification method based on trigeminal convolutional neural network - Google Patents

Pedestrian re-identification method based on trigeminal convolutional neural network Download PDF

Info

Publication number
CN114863505B
CN114863505B CN202210215993.9A CN202210215993A CN114863505B CN 114863505 B CN114863505 B CN 114863505B CN 202210215993 A CN202210215993 A CN 202210215993A CN 114863505 B CN114863505 B CN 114863505B
Authority
CN
China
Prior art keywords
pedestrian
layer
convolution
network structure
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210215993.9A
Other languages
Chinese (zh)
Other versions
CN114863505A (en
Inventor
熊明福
高志於
李家辉
胡新荣
陈佳
张俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Textile University
Original Assignee
Wuhan Textile University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Textile University filed Critical Wuhan Textile University
Priority to CN202210215993.9A priority Critical patent/CN114863505B/en
Publication of CN114863505A publication Critical patent/CN114863505A/en
Application granted granted Critical
Publication of CN114863505B publication Critical patent/CN114863505B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • G06V40/173Classification, e.g. identification face re-identification, e.g. recognising unknown faces across different face tracks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a pedestrian re-identification method based on a trigeminal multi-branch convolutional neural network. The convolutional neural network based on color, the convolutional neural network based on spatial position information and the convolutional neural network based on high-level semantic information are designed, and the designed neural network model has a shallower network structure and fewer network parameters, so that the possibility is provided for realizing a pedestrian re-recognition technology based on movement. The designed network model is used for acquiring the depth level characteristics of pedestrians, and is more in line with the recognition mechanism of human beings on things from thick to thin and from shallow to deep; in addition, based on a unified model training mode of the network, the optimization of network parameters is facilitated, and compared with a traditional training mode of neglecting information of a middle layer of the network, the structure designed by the invention can better combine the visual characteristics of the bottom layer of the pedestrian with the semantic attributes of a high layer, and can acquire pedestrian descriptors with more discriminant power so as to realize the pedestrian re-recognition efficiency.

Description

Pedestrian re-identification method based on trigeminal convolutional neural network
Technical Field
The invention relates to a pedestrian re-recognition technology, in particular to a pedestrian re-recognition method based on a trigeminal convolution neural network.
Background
Pedestrian re-identification is a technology for matching whether two pedestrians are the same target under different spatial physical position camera angles, and has been widely focused and applied in the application fields of academia, industry (artificial intelligence and public security criminal investigation) and the like. However, the pedestrian re-recognition technology is a challenging problem due to the influence of objective environmental factors (including illumination, viewing angle, scale, etc.) and the like. In the actual study of pedestrian re-recognition, the method mainly comprises three steps: feature extraction (appearance feature representation of pedestrian objects), distance measurement (similarity comparison of pedestrian objects) and feedback optimization (optimization of sequencing results), and in each link, relevant scholars are put into specific research work. In recent years, with the development of deep learning technology, it is becoming a mainstream method for solving the problem of pedestrian re-recognition, and good effects are obtained. However, in the practical algorithm practice process, the method based on the deep neural network model occupies huge calculation cost and hardware resources in the aspects of parameter storage, network optimization and the like, limits the practical operability and wide application possibility of the method, and particularly encounters huge technical bottlenecks in the aspect of application based on intelligent mobile terminal equipment.
Disclosure of Invention
Aiming at the problem that the pedestrian re-recognition algorithm cannot be effectively applied to mobile terminal equipment due to the limitation of time and labor consumption in the aspects of model design, network optimization and the like of the traditional deep neural network. The invention is aimed at the requirements of actual intelligent mobile terminal equipment, and designs a set of efficient and light deep neural network model to realize the pedestrian re-recognition technology, namely, the depth characteristics of pedestrians are extracted by adopting a convolutional neural network structure based on the trigeminal. The pedestrian appearance is represented in the mode of pedestrian level characteristic cascade learning, the accurate expression of pedestrians is realized, and the pedestrian re-recognition technology is completed. And the applicability of the invention on intelligent mobile terminal equipment is demonstrated through analysis of model complexity, so that the accuracy of pedestrian identification is ensured.
The invention provides a light and simple-trigeminal-root convolutional neural network structure-based pedestrian re-recognition method, which is used for solving the problem of pedestrian re-recognition by extracting the depth characteristics of a pedestrian level through the multi-branch network result of a simpler structure of combined training and carrying out L2 normalization on the depth characteristics so as to realize the problem of pedestrian re-recognition. Therefore, the multi-branch convolutional neural network structure designed by the invention specifically executes the following steps:
The invention designs a three-fork multi-branch network structure which mainly comprises three convolution neural network structures, namely a semantic convolution neural network (SEMANTIC CNN, S-CNN), a color convolution neural network (color CNN, C-CNN) and a position convolution neural network (location CNN, L-CNN). The S-CNN is used for acquiring pedestrian semantic information features and mainly comprises the step of acquiring final high-level abstract features of a deep network structure; C-CNN is used for obtaining visual characteristics (such as color, texture and the like) of the pedestrian bottom layer, and L-CNN is used for extracting pedestrian spatial position information. In the invention, the hierarchical depth feature extraction of pedestrians is realized by integrally training the trigeminal convolutional neural network structure, and the specific steps comprise:
Step 1, pedestrian image data under different cameras are acquired and used as input data of a network, wherein the pedestrian image data comprises a triplet image consisting of an anchor point sample, a positive sample image and a negative sample image;
Step2, preprocessing the pedestrian image obtained in the step1, whitening each pedestrian image sample, and normalizing and subtracting the average value;
step 3, designing and training a network structure; inputting the pedestrian image data preprocessed in the step 2 into a trigeminal network structure for optimization training, firstly designing the trigeminal network structure, and then adopting a triple loss mode to realize the optimization training of the network structure so as to acquire depth characteristics of a pedestrian level;
The three-fork network structure comprises three convolution neural network structures, namely a semantic convolution neural network (SEMANTIC CNN, S-CNN), a color convolution neural network (color CNN, C-CNN) and a position convolution neural network (location CNN, L-CNN); the S-CNN is used for acquiring pedestrian semantic information features, including acquiring final high-level abstract features of a deep network structure; C-CNN is used for acquiring visual characteristics of the pedestrian bottom layer, and L-CNN is used for extracting pedestrian space position information;
step 4, carrying out L 2 normalization processing on the different depth level features of the pedestrians acquired in the step 3 to acquire final level depth features for describing the pedestrians;
and 5, measuring the pedestrian similarity of the pedestrian level depth characteristics obtained in the step 4 by adopting a distance measurement mode for different pedestrian samples so as to obtain a final pedestrian re-identification result.
Further, in step 1, the specific steps of acquiring pedestrian image data under different cameras include:
Step 1.1, taking an image I of a pedestrian as an anchor point sample, wherein a positive sample image is denoted as I +, an image of the same person as the anchor point sample, a negative sample image is denoted as I -, and an image of the same person as the anchor point sample is not;
Step 1.2, for the selection of the triplet image, the invention adopts a data enhancement method to realize the generation of triplet pedestrian image data so as to obtain the pedestrian triplet image data adapting to the training of the network structure.
Further, in the step 2, preprocessing is performed on the pedestrian image, and the specific steps include;
Step 2.1, adopting a combination process of whitening and dimension reduction to enable a covariance matrix of input data to be changed into an identity matrix I, specifically, if R is any orthogonal matrix, namely RR T=RT R=I is satisfied, wherein R is a rotation or reflection matrix, the defined ZCA whitening result is as follows: x ZCAwhite=UxPCAwhite, wherein U refers to a eigenvector matrix of a covariance matrix of data, x PCAwhite refers to PCA whitening, x ZCAwhite refers to ZCA whitening, which is equivalent to reconverting the PCA whitened data back to the original space, and the result of ZCA whitening is as close to the original input data x as possible;
And 2.2, obtaining a ZCA whitened data result, carrying out normalization processing on the data, and subtracting the average value of the data so as to enable the data to be more in line with the data form input by the network structure of the three-fork halberd.
Further, the network structure of the color convolutional neural network is that an original pedestrian monitoring image is decomposed into RGB component images, color features of all components are extracted through three convolutional layers of a convolutional layer 1, a convolutional layer 2 and a convolutional layer 3 respectively by using a network model taking CaffeNet as a reference, and then the color features are combined, namely the color features of pedestrians are obtained through a sub-convolutional layer 1 and three sub-full-connection layers;
The network structure of the position convolution neural network is that the four convolution layers of the convolution layer 1, the convolution layer 2, the convolution layer 3 and the convolution layer 4 are used for extracting the convolution characteristics of pedestrians, a multi-scale-based spatial pyramid pooling layer structure is designed to obtain the characteristics of the pedestrians under different scales, the sub-convolution layer 2 is used for extracting the characteristics related to the spatial position information, and then the spatial pyramid layer and three sub-full-connection layers are used for pooling layer structures with 16 x 256 dimensions, 4 x 256 dimensions and 256 dimensions respectively, and finally the characteristics of the pedestrians are subjected to cascade learning to form the spatial characteristics of the pedestrians;
The network structure of the semantic convolution neural network is that on the basis of CaffeNet network structure, namely three convolution layers of a convolution layer 1, a convolution layer 2 and a convolution layer 3 are used for extracting input image convolution characteristics, and for the convolution characteristics, a segmentation algorithm is utilized for carrying out local segmentation on the convolution characteristics, namely, each part of a head, a trunk, legs and shoes is respectively obtained through a convolution layer 4, a convolution layer 5, a convolution layer 6 and a convolution layer 7 so as to obtain an interested region, and finally, the corresponding semantic attribute of pedestrians is obtained through global average pooling operation and three full connection layers;
Wherein the other convolution layers are connected with the pooling layer except the convolution layer 3 and the convolution layer 4.
Further, in step 3, the optimization training of the trigeminal convolution neural network structure comprises the following specific steps:
step 3.1, inputting data into convolution neural networks with different structures, specifically, the semantic convolution neural network comprises 7 convolution layers, 5 pooling layers and 3 full connection layers, and the final full connection layer outputs semantic information with characteristics of pedestrians; the color convolutional neural network structure comprises 4 convolutional layers, 3 pooling layers and 3 full-connection layers, wherein the last full-connection layer is the color information of pedestrians; the position convolution neural network comprises 5 convolution layers, 2 pooling layers, a spatial pyramid layer and 3 full connection layers, and is used for outputting the spatial position information of pedestrians;
Step 3.2, according to the network structure designed in the step 3.1, obtaining the pedestrian level characteristics, taking the last layer characteristics of each branch network as the pedestrian characteristics, and taking the penultimate layer characteristics as the descriptors of pedestrians so as to obtain the level pedestrian characteristics with more discriminant;
Step 3.3, for the inputted pedestrian image triplet < I, I +、I- >, optimizing the network structure according to the inherent characteristics of pedestrian re-identification, wherein the distance between the same person is smaller than the distance between different persons, namely: Where/> is expressed as the distance between the same person and/> is expressed as the distance between different persons; specifically, for the inputted pedestrian triplet data < I, I +、I- >, after training through the trigeminal network, the depth feature obtained is expressed as < g w(I)、gw(I+)、gw(I-) >, wherein g w (I) is the feature of the anchor pedestrian, g w(I+) is the feature of the positive sample pedestrian, g w(I-) is the negative sample pedestrian, and w is the network parameter, and in the specific network training process, the implementation is to be realized
‖gw(I)-gw(I+)‖<‖gw(I)-gw(I-)‖ (1)
Constant true, while facilitating the calculation of the error, equation (1) can be expressed as:
‖gw(I)-gw(I+)‖2<‖gw(I)-gw(I-)‖2 (2)
and conducting derivative calculation on the transformed cost function.
Furthermore, the back propagation thought is adopted to conduct derivative calculation on the transformed cost function, and the specific steps are as follows:
step 3.3.1, firstly calculating the similarity distance between different pedestrians, and converting the loss function represented by the formula (2) into:
Wherein L is the loss of three networks in the network structure, d (w, I) is the difference between the distance between the same pedestrians and the distance between different people, N is the total number of samples of the pedestrians, and C is the boundary distance of the constraint positive and negative samples;
step 3.3.2, in the overall network structure, convolution layer 1, convolution layer 2 and convolution layer 3 use convolution kernels with sizes of 11×11,5×5,3×3, respectively, and the forward convolution operation is expressed as:
Wherein and/> represent the output of the ith neuron of the first layer and the ith neuron of the first-1 layer, respectively, k (l) is the convolution kernel size between the first layer and the first-1 layer,/> is the bias term of the kth feature map of the first layer, and ReLU (RECTIFIED LINEAR Unit, reLU) is a modified linear Unit expressed as an activation function between the two convolution layers;
Step 3.3.3, calculating errors after forward propagation for one time, and realizing backward propagation; the partial derivative calculation mode of the formula (3) is as follows:
wherein W represents a network parameter, and I j represents a jth pedestrian image;
By definition of d (W, I j), the gradient can be calculated as follows:
Step 3.3.4: calculating and values respectively according to the calculation result derived in step 3.3.3 to obtain the final loss;
according to the partial derivative formula, substituting the partial derivative formula into a gradient descent method algorithm to minimize L, thereby solving the loss of each layer of back propagation and realizing the final network optimization process.
Further, in step 4, the L 2 normalization processing is performed on the hierarchical depth feature, which specifically includes:
step 4.1, according to the optimized network structure in step 3, depth features, color, space position information and high-level semantic information of different levels are obtained, and L 2 normalization preprocessing is adopted for different input features, wherein the specific calculation mode is as follows:
Where f= [ f 1,f2,…,fp ] is represented as a network output feature with k dimensions;
And 4.2, performing PCA processing on the feature y processed by the L 2 in the step 4.1 to obtain the pedestrian level descriptor with extremely high discrimination.
Further, in step5, the specific step of similarity measurement includes;
Pedestrian similarity is judged based on Euclidean distance, the Euclidean distance is derived from the distance between two points x 1,x2 in an N-dimensional Euclidean space, and a distance formula can be expressed as follows:
wherein M is the total sample number, x 1,x2 can respectively represent the pedestrian characteristics of two people under different visual angles, a final measurement result is obtained through Euclidean distance calculation, and a final pedestrian re-recognition result is obtained through ranking the measurement result.
The invention has the following positive effects and advantages:
1) The invention realizes application of pedestrian re-recognition technology based on mobile terminal equipment by designing the trigeminal multi-branch neural network model. Specifically, a convolutional neural network based on color, a convolutional neural network based on spatial position information and a convolutional neural network based on high-level semantic information are designed. The designed neural network model has shallower network structure and fewer network parameters, and provides possibility for realizing the pedestrian re-identification technology based on movement.
2) The designed network model is used for acquiring the depth level characteristics of pedestrians, and is more in line with the recognition mechanism of human beings on things from thick to thin and from shallow to deep; in addition, based on a unified model training mode of the network, the optimization of network parameters is facilitated, and compared with a traditional training mode of neglecting information of a middle layer of the network, the structure designed by the invention can better combine the visual characteristics of the bottom layer of the pedestrian with the semantic attributes of a high layer, and can acquire the pedestrian descriptor with more discriminant.
Drawings
FIG. 1 is a flow chart of an example of the present invention.
FIG. 2 is a schematic diagram of a model of a trigeminal multi-branch convolutional neural network.
Fig. 3 is a process of color convolutional neural network.
Fig. 4 is a process of spatially convoluting a neural network.
Fig. 5 is a process of a semantic convolutional neural network.
Detailed Description
In order to further clarify the technical means and effects adopted by the present invention, a technical solution of the present invention will be further described below with reference to the accompanying drawings and specific embodiments.
The invention provides a pedestrian re-identification method based on a trigeminal multi-branch network structure, which comprises the following steps:
And step 1, acquiring data. Pedestrians at a plurality of different camera perspectives (the present invention takes two cameras as an example, namely C a and C b) are denoted and/> respectively, where n and m represent the number of people at each camera perspective respectively (in our case there is typically n=m). Taking out corresponding pedestrian images from the view angles of the cameras a and b respectively as input data of a network;
and 2, preprocessing. Preprocessing the pedestrian image obtained in the step 1, whitening each pedestrian image sample, subtracting the mean value and the normalized image, adapting to the network result, and facilitating the training of subsequent data;
And 3, training a network structure. Inputting the pedestrian image data preprocessed in the step 2 into a trigeminal network structure for optimization training, and optimizing the network structure by adopting a triplet training mode in the step to acquire depth characteristics of a pedestrian level;
and 4, L 2 normalization processing of the hierarchical features. Adopting L 2 normalization processing to the different depth level characteristics of the pedestrians acquired in the step 3 so as to acquire final characteristics for describing the pedestrians;
and 5, similarity measurement. And (3) measuring the pedestrian level depth characteristics obtained in the step (4) by adopting a distance measurement mode (the invention adopts Europe distance and XQDA distance respectively, and details are described later) on different pedestrian samples so as to obtain a final pedestrian re-identification result.
In the above method for re-identifying pedestrians based on the trigeminal convolutional neural network structure, in the step 1, the steps for acquiring the pedestrian image data under different cameras specifically include:
Step 1.1: because the network structure proposed by the invention is developed and trained based on the triplet loss function, the image data input by the network structure is developed based on the triplet image. Taking an image I of a pedestrian as an anchor point sample, wherein a positive sample image is denoted as I + (an image of the same person as the anchor point sample), and a negative sample image is denoted as I - (an image of the same person as the anchor point sample);
step 1.2: for the selection of the triplet image, the invention adopts a data enhancement method to realize the generation of the triplet pedestrian image data so as to obtain the pedestrian triplet image data adapting to the training of the network structure.
In the above method for identifying a pedestrian based on a trigeminal convolutional neural network structure, in the step 2, preprocessing the obtained triplet image data, the specific steps include:
Step 2.1: ZCA whitening treatment is carried out on the pedestrian image pair; in the invention, the combination processing of whitening and dimension reduction is adopted, so that the covariance matrix of input data is changed into an identity matrix I, specifically, if R is any orthogonal matrix (namely RR T=RT R=I is satisfied, wherein R is a rotatable or reflective matrix), the defined ZCA whitening result is as follows: x ZCAwhite=UxPCAwhite, wherein U refers to a eigenvector matrix of a covariance matrix of data, x PCAwhite refers to PCA whitening, x ZCAwhite refers to ZCA whitening, which is equivalent to reconverting the PCA whitened data back to the original space, and the result of ZCA whitening is as close to the original input data x as possible;
Step 2.2: and (3) obtaining a ZCA whitened data result, carrying out normalization processing on the data, and subtracting the mean value of the data, so that the data better accords with the data form input by the trident network structure.
In the above method for re-identifying pedestrians based on the trigeminal convolutional neural network structure, in the step 3, the optimized training of the trigeminal convolutional neural network structure comprises the following specific steps:
Step 3.1: we input data into convolutional neural networks of different structure (color, spatial location information, and semantics). Specifically, the semantic convolutional neural network comprises 7 convolutional layers, 5 pooling layers and 3 full-connection layers, and the final full-connection layer is considered to output semantic information with the characteristics of pedestrians; the color convolutional neural network structure comprises 4 convolutional layers, 3 pooling layers and 3 full-connection layers, and the last full-connection layer is considered to be the color information of pedestrians; the position convolution neural network comprises 5 convolution layers, 2 pooling layers, a spatial pyramid layer and 3 full connection layers, and is used for outputting the spatial position information of pedestrians;
Step 3.1.1: for the network tri-network structure designed as above, specifically, the design of the color convolution neural network structure is as follows: firstly, an original pedestrian monitoring image is decomposed into RGB component images, color features of all the components are extracted through a network model taking CaffeNet as a reference, namely through a convolution layer 1, a convolution layer 2 and a convolution layer 3 in the figure 2, then the color components are combined, and the color features of pedestrians are obtained through a sub-convolution layer 1 and a sub-full-connection layer;
Step 3.1.2: the design of the position convolution neural network is specifically to extract pedestrian convolution characteristics and design a spatial pyramid pooling layer structure based on multiple scales so as to obtain pedestrian characteristics under different scales. In fig. 2, after passing through the common convolution layer 1, the convolution layer 2 and the convolution layer 3, extracting the features related to the spatial position information through the sub-convolution layer 2, then, passing through a spatial pyramid layer, wherein the design of the pyramid structure is respectively 16-256-dimensional, 4-256-dimensional and 256-dimensional pooling layer structures, and finally, carrying out cascade learning on the features to form the spatial features of pedestrians;
Step 3.1.3: the design of the semantic convolutional neural network is based on CaffeNet network structure, namely, the convolutional layer 1, the convolutional layer 2 and the convolutional layer 3 in the figure 2 are utilized to locally divide the convolutional characteristics by a segmentation algorithm, the head, the trunk, the legs and the shoes are respectively obtained through the convolutional layer 4, the convolutional layer 5, the convolutional layer 6 and the convolutional layer 7 so as to obtain the interested areas, and finally, the semantic attributes corresponding to pedestrians are obtained through global average pooling operation, namely, a pooling layer and a full connection layer;
Step 3.2: according to the network structure designed in the step 3.1, the method is used for acquiring the pedestrian level characteristics, and the last layer characteristics of each branch network are taken as the pedestrian characteristics, and the penultimate layer characteristics are also taken as descriptors of pedestrians so as to acquire the level pedestrian characteristics with more discriminant;
Step 3.3: for the inputted pedestrian image triplet < I, I +、I- >, the network structure is optimized according to the inherent characteristics of pedestrian re-identification, wherein the distance between the same person is smaller than the distance between different persons, namely: Where/> is expressed as the distance between the same person and/> is expressed as the distance between different persons. Specifically, in the training of the network structure of the present invention, for the above-mentioned inputted pedestrian triplet data < I, I +、I- >, after training through the trigeminal network, the depth feature obtained is expressed as < g w(I)、gw(I+)、gw(I-) >, where g w (I) is the feature of the anchor pedestrian, g w(I+) is the feature of the positive sample pedestrian, g w(I-) is the feature of the negative sample pedestrian, and w is the network parameter. In the specific network training process, namely, to realize
‖gw(I)-gw(I+)‖<‖gw(I)-gw(I-)‖ (1)
Constant true, while facilitating the calculation of the error, equation (1) can be expressed as:
‖gw(I)-gw(I+)‖2<‖gw(I)-gw(I-)‖2 (2)
the derivative calculation is carried out on the transformed cost function, and the back propagation thought is adopted specifically, and the specific steps are as follows:
Step 3.3.1 because in the pedestrian re-recognition problem, in order to facilitate optimization of network parameters, similarity distances between different pedestrians are calculated, the loss function represented by equation (2) may be transformed into:
where L is the loss of network structure (in the present invention there are three networks, three losses are collectively referred to herein as L), d (w, I) is the difference in distance between the same pedestrian and the distance between different people, N is the total number of samples of the pedestrian, and C is the boundary distance that constrains the positive and negative samples;
step 3.3.2: in the overall structure, the present invention uses convolution kernels of 11×11,5×5,3×3 sizes for convolution layers 1,2, and 3 in fig. 2, respectively, and the forward convolution operation is expressed as:
wherein and/> represent the output of the ith neuron of the first layer and the ith neuron of the first-1 layer, respectively, k (l) is the convolution kernel size between the first layer and the first-1 layer,/> is the bias term of the kth feature map of the first layer, and ReLU (RECTIFIED LINEAR Unit, reLU) is a modified linear Unit expressed as an activation function between the two convolution layers;
Step 3.3.3: and (3) performing forward propagation to obtain an output result, calculating errors of the output result and the actual result, and performing backward propagation on the errors, wherein in the backward propagation process, the values of various parameters are adjusted according to the errors. The above process is iterated until convergence. The partial derivative calculation mode of the formula (3) is as follows:
Where W represents a network parameter. I i denotes the I-th pedestrian image.
By definition of d (W, I i), the gradient can be calculated as follows:
Step 3.3.4: the values and are calculated separately from the calculation derived in step 3.3.3 to obtain the final loss.
With the partial derivative formulas (5) and (6) above, we can substitute it into the gradient descent method algorithm to minimize L, thereby solving the loss of back propagation of each layer and realizing the final network optimization process.
In the above method for identifying a pedestrian based on a trigeminal convolutional neural network structure, in the step 4, the step of normalizing the L 2 of the hierarchical features comprises the following specific steps:
Step 4.1: according to the optimized network structure in the step 3, depth features (color, space position information and high-level semantic attributes) of different levels are obtained, and L 2 normalization preprocessing is adopted for output features of different dimensions, wherein the specific calculation mode is as follows:
Where f= [ f 1,f2,…,fp ] is represented as a network output feature with k dimensions;
step 4.2: according to the feature y processed by the L 2 in the step 4.1, PCA (Principal Component Analysis) processing is performed on the feature y, namely the principal component analysis method, which is the most widely used data dimension reduction algorithm, so as to obtain the pedestrian level descriptor with extremely high discrimination.
In the above method for re-identifying pedestrians based on the trigeminal convolutional neural network structure, in the step 5, the specific steps of measuring the similarity include:
Step 5.1: pedestrian similarity is generally judged by adopting a method based on Euclidean distance and XQDA (Cross-view Quadratic DISCRIMINANT ANALYSIS);
Step 5.2: the Euclidean distance is derived from the distance between two points x 1,x2 in the N-dimensional Euclidean space, and the distance formula can be expressed as follows:
Where M is expressed as the total number of samples. In the invention, x 1,x2 can respectively represent the characteristics of pedestrians under different visual angles of two persons, and a final measurement result can be obtained through Euclidean distance calculation; XQDA (Cross-view Quadratic DISCRIMINANT ANALYSIS) is Cross-view quadratic discriminant analysis for calculating the similarity between samples at different view angles. The distance calculation formula can be expressed as:
where x i,xj represents two samples across view angles, respectively, Σ I and Σ E are sample covariance matrices; and respectively measuring the pedestrian similarity of the pedestrians in the two distance calculation modes, and ranking the measurement results to obtain a final pedestrian re-identification result.
Example 1
Preparation:
1. Assuming that C a and C b are two camera views (the invention uses two cameras as an example) under different spatial and regional environments, the data sets of people in the cameras are and/> respectively, wherein n and m respectively represent the number of people under each camera view;
2. preprocessing pedestrian data obtained by actual monitoring, mainly performing ZCA whitening treatment, normalization, mean value removal and other related operations to obtain a denoised pedestrian image;
3. And acquiring triple data, namely forming corresponding input data by selecting anchor pedestrians, positive samples and negative samples mainly in an online generation mode.
The specific implementation is as follows:
1. design of trigeminal halberd multi-branch convolutional neural network
Specifically, the trigeminal convolutional neural network model designed by the invention mainly comprises a semantic convolutional neural network, a spatial position information convolutional neural network and a semantic convolutional neural network as shown in fig. 2. The semantic convolutional neural network comprises 7 convolutional layers and 5 pooling layers, and the final full-connection layer output characteristics of the semantic convolutional neural network are considered as semantic information of pedestrians; the color convolutional neural network structure comprises 4 convolutional layers and 3 pooling layers, and the final full-connection layer is considered as the color information of pedestrians; the position convolution neural network comprises 5 convolution layers, 2 pooling layers and a spatial pyramid layer and is used for outputting the spatial position information of pedestrians; and the last two full connection layers of each sub-network are used as the hierarchical pedestrian characteristic description of pedestrians;
2. training of trigeminal multi-branch convolutional neural networks
The convolutional neural network model designed by the invention is trained by adopting a backward propagation mode, and a specific cost function is triplet loss. The concrete expression form is as follows:
and performs loss calculation by means of backward propagation. The back propagation calculation is as follows to implement the calculation of the network model.
3. Depth level depth feature acquisition
After optimizing the network structure, the test image is propagated forward, and the forward convolution operation is expressed as:
In order to obtain corresponding output characteristics of pedestrians, and in the invention, the last two full connection layers of each convolutional neural network are used as the characteristics of the pedestrians, and the hierarchical characteristics of the respective network structures are respectively obtained. Compared with the traditional method based on deep learning, the depth features acquired by the method have more discriminant ability.
4. Normalization of depth level features
For the hierarchical characteristics output by the sub-depth neural network, the invention adopts the L 2 based normalization plus PCA (Principal Component Analysis) processing, namely a principal component analysis method. The degradation and noise reduction of the hierarchical features are realized, so that pedestrian features with more discriminant ability are obtained.
5. Similarity measure for pedestrian features
According to the processed pedestrian characteristics obtained in step 4, the similarity measurement of pedestrians is realized by adopting the Euclidean distance and XQDA measurement mode (the specific principle is detailed in the step 5 of the invention).
The above examples illustrate only embodiments of the invention, which are described in greater detail and are not to be construed as limiting the scope of the invention, it being understood that variations and modifications can be made by those skilled in the art without departing from the spirit of the invention, which is defined by the appended claims.

Claims (8)

1. The pedestrian re-identification method based on the trident multi-branch network structure is characterized by comprising the following steps of:
Step 1, pedestrian image data under different cameras are acquired and used as input data of a network, wherein the pedestrian image data comprises a triplet image consisting of an anchor point sample, a positive sample image and a negative sample image;
Step2, preprocessing the pedestrian image obtained in the step1, whitening each pedestrian image sample, and normalizing and subtracting the average value;
step 3, designing and training a network structure; inputting the pedestrian image data preprocessed in the step 2 into a trigeminal network structure for optimization training, firstly designing the trigeminal network structure, and then adopting a triple loss mode to realize the optimization training of the network structure so as to acquire depth characteristics of a pedestrian level;
The three-fork network structure comprises three convolution neural network structures, namely a semantic convolution neural network (SEMANTIC CNN, S-CNN), a color convolution neural network (color CNN, C-CNN) and a position convolution neural network (location CNN, L-CNN); the S-CNN is used for acquiring pedestrian semantic information features, including acquiring final high-level abstract features of a deep network structure; C-CNN is used for acquiring visual characteristics of the pedestrian bottom layer, and L-CNN is used for extracting pedestrian space position information;
step 4, carrying out L 2 normalization processing on the different depth level features of the pedestrians acquired in the step 3 to acquire final level depth features for describing the pedestrians;
and 5, measuring the pedestrian similarity of the pedestrian level depth characteristics obtained in the step 4 by adopting a distance measurement mode for different pedestrian samples so as to obtain a final pedestrian re-identification result.
2. The pedestrian re-identification method based on the trigeminal multi-branch network structure according to claim 1, wherein the pedestrian re-identification method is characterized by comprising the following steps of: in step 1, the method for acquiring pedestrian image data under different cameras specifically comprises the following steps:
Step 1.1, taking an image I of a pedestrian as an anchor point sample, wherein a positive sample image is denoted as I +, an image of the same person as the anchor point sample, a negative sample image is denoted as I -, and an image of the same person as the anchor point sample is not;
and 1.2, for the selection of the triplet image, generating triplet pedestrian image data by adopting a data enhancement method so as to obtain pedestrian triplet image data adapting to network structure training.
3. The pedestrian re-identification method based on the trigeminal multi-branch network structure according to claim 1, wherein the pedestrian re-identification method is characterized by comprising the following steps of: in the step 2, preprocessing pedestrian images, wherein the specific steps comprise;
Step 2.1, adopting a combination process of whitening and dimension reduction to enable a covariance matrix of input data to be changed into an identity matrix I, specifically, if R is any orthogonal matrix, namely RR T=RT R=I is satisfied, wherein R is a rotation or reflection matrix, the defined ZCA whitening result is as follows: x ZCAwhite=UxPCAwhite, wherein U refers to a eigenvector matrix of a covariance matrix of data, x PCAwhite refers to PCA whitening, x ZCAwhite refers to ZCA whitening, which is equivalent to reconverting the PCA whitened data back to the original space, and the result of ZCA whitening is as close to the original input data x as possible;
And 2.2, obtaining a ZCA whitened data result, carrying out normalization processing on the data, and subtracting the average value of the data so as to enable the data to be more in line with the data form input by the network structure of the three-fork halberd.
4. The pedestrian re-identification method based on the trigeminal multi-branch network structure according to claim 1, wherein the pedestrian re-identification method is characterized by comprising the following steps of: the network structure of the color convolutional neural network is that an original pedestrian monitoring image is decomposed into RGB component images, color features of all components are extracted through three convolutional layers of a convolutional layer 1, a convolutional layer 2 and a convolutional layer 3 respectively by using a network model taking CaffeNet as a reference, and then the color features are combined by using the color features, namely the color features of pedestrians are obtained through a sub-convolutional layer 1 and three sub-full-connection layers;
The network structure of the position convolution neural network is that the four convolution layers of the convolution layer 1, the convolution layer 2, the convolution layer 3 and the convolution layer 4 are used for extracting the convolution characteristics of pedestrians, a multi-scale-based spatial pyramid pooling layer structure is designed to obtain the characteristics of the pedestrians under different scales, the sub-convolution layer 2 is used for extracting the characteristics related to the spatial position information, and then the spatial pyramid layer and three sub-full-connection layers are used for pooling layer structures with 16 x 256 dimensions, 4 x 256 dimensions and 256 dimensions respectively, and finally the characteristics of the pedestrians are subjected to cascade learning to form the spatial characteristics of the pedestrians;
The network structure of the semantic convolution neural network is that on the basis of CaffeNet network structure, namely three convolution layers of a convolution layer 1, a convolution layer 2 and a convolution layer 3 are used for extracting input image convolution characteristics, and for the convolution characteristics, a segmentation algorithm is utilized for carrying out local segmentation on the convolution characteristics, namely, each part of a head, a trunk, legs and shoes is respectively obtained through a convolution layer 4, a convolution layer 5, a convolution layer 6 and a convolution layer 7 so as to obtain an interested region, and finally, the corresponding semantic attribute of pedestrians is obtained through global average pooling operation and three full connection layers;
Wherein the other convolution layers are connected with the pooling layer except the convolution layer 3 and the convolution layer 4.
5. The pedestrian re-identification method based on the trigeminal multi-branch network structure according to claim 1, wherein the pedestrian re-identification method is characterized by comprising the following steps of: in step 3, the optimized training of the trigeminal convolutional neural network structure comprises the following specific steps:
step 3.1, inputting data into convolution neural networks with different structures, specifically, the semantic convolution neural network comprises 7 convolution layers, 5 pooling layers and 3 full connection layers, and the final full connection layer outputs semantic information with characteristics of pedestrians; the color convolutional neural network structure comprises 4 convolutional layers, 3 pooling layers and 3 full-connection layers, wherein the last full-connection layer is the color information of pedestrians; the position convolution neural network comprises 5 convolution layers, 2 pooling layers, a spatial pyramid layer and 3 full connection layers, and is used for outputting the spatial position information of pedestrians;
Step 3.2, according to the network structure designed in the step 3.1, obtaining the pedestrian level characteristics, taking the last layer characteristics of each branch network as the pedestrian characteristics, and taking the penultimate layer characteristics as the descriptors of pedestrians so as to obtain the level pedestrian characteristics with more discriminant;
Step 3.3, for the inputted pedestrian image triplet < I, I +、I- >, optimizing the network structure according to the inherent characteristics of pedestrian re-identification, wherein the distance between the same person is smaller than the distance between different persons, namely: Where/> is expressed as the distance between the same person and/> is expressed as the distance between different persons; specifically, for the inputted pedestrian triplet data < I, I +、I- >, after training through the trigeminal network, the depth feature obtained is expressed as < g w(I)、gw(I+)、gw(I-) >, wherein g w (I) is the feature of the anchor pedestrian, g w(I+) is the feature of the positive sample pedestrian, g w(I-) is the negative sample pedestrian, and w is the network parameter, and in the specific network training process, the implementation is to be realized
‖gw(I)-gw(I+)‖<‖gw(I)-gw(I-)‖ (1)
Constant true, while facilitating the calculation of the error, equation (1) can be expressed as:
‖gw(I)-gw(I+)‖2<‖gw(I)-gw(I-)‖2 (2)
and conducting derivative calculation on the transformed cost function.
6. The pedestrian re-identification method based on the trigeminal multi-branch network structure according to claim 5, wherein the pedestrian re-identification method is characterized by comprising the following steps of: the back propagation idea is adopted to conduct derivative calculation on the transformed cost function, and the specific steps are as follows:
step 3.3.1, firstly calculating the similarity distance between different pedestrians, and converting the loss function represented by the formula (2) into:
Wherein L is the loss of three networks in the network structure, d (w, I) is the difference between the distance between the same pedestrians and the distance between different people, N is the total number of samples of the pedestrians, and C is the boundary distance of the constraint positive and negative samples;
step 3.3.2, in the overall network structure, convolution layer 1, convolution layer 2 and convolution layer 3 use convolution kernels with sizes of 11×11,5×5,3×3, respectively, and the forward convolution operation is expressed as:
Wherein and/> represent the output of the ith neuron of the first layer and the ith neuron of the first-1 layer, respectively, k (l) is the convolution kernel size between the first layer and the first-1 layer,/> is the bias term of the kth feature map of the first layer, and ReLU (RECTIFIED LINEAR Unit, reLU) is a modified linear Unit expressed as an activation function between the two convolution layers;
Step 3.3.3, calculating errors after forward propagation for one time, and realizing backward propagation; the partial derivative calculation mode of the formula (3) is as follows:
wherein W represents a network parameter, and I j represents a jth pedestrian image;
By definition of d (W, I j), the gradient can be calculated as follows:
Step 3.3.4: calculating and/> values respectively according to the calculation result derived in step 3.3.3 to obtain the final loss;
according to the partial derivative formula, substituting the partial derivative formula into a gradient descent method algorithm to minimize L, thereby solving the loss of each layer of back propagation and realizing the final network optimization process.
7. The pedestrian re-identification method based on the trigeminal multi-branch network structure according to claim 1, wherein the pedestrian re-identification method is characterized by comprising the following steps of: in step 4, performing L 2 normalization processing on the hierarchical depth feature, which specifically includes:
step 4.1, according to the optimized network structure in step 3, depth features, color, space position information and high-level semantic information of different levels are obtained, and L 2 normalization preprocessing is adopted for different input features, wherein the specific calculation mode is as follows:
Where f= [ f 1,f2,…,fp ] is represented as a network output feature with k dimensions;
And 4.2, performing PCA processing on the feature y processed by the L 2 in the step 4.1 to obtain the pedestrian level descriptor with extremely high discrimination.
8. The pedestrian re-identification method based on the trigeminal multi-branch network structure according to claim 1, wherein the pedestrian re-identification method is characterized by comprising the following steps of: in step 5, the specific steps of similarity measurement include;
Pedestrian similarity is judged based on Euclidean distance, the Euclidean distance is derived from the distance between two points x 1,x2 in an N-dimensional Euclidean space, and a distance formula can be expressed as follows:
wherein M is the total sample number, x 1,x2 can respectively represent the pedestrian characteristics of two people under different visual angles, a final measurement result is obtained through Euclidean distance calculation, and a final pedestrian re-recognition result is obtained through ranking the measurement result.
CN202210215993.9A 2022-03-07 2022-03-07 Pedestrian re-identification method based on trigeminal convolutional neural network Active CN114863505B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210215993.9A CN114863505B (en) 2022-03-07 2022-03-07 Pedestrian re-identification method based on trigeminal convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210215993.9A CN114863505B (en) 2022-03-07 2022-03-07 Pedestrian re-identification method based on trigeminal convolutional neural network

Publications (2)

Publication Number Publication Date
CN114863505A CN114863505A (en) 2022-08-05
CN114863505B true CN114863505B (en) 2024-04-16

Family

ID=82627142

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210215993.9A Active CN114863505B (en) 2022-03-07 2022-03-07 Pedestrian re-identification method based on trigeminal convolutional neural network

Country Status (1)

Country Link
CN (1) CN114863505B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN113516012A (en) * 2021-04-09 2021-10-19 湖北工业大学 Pedestrian re-identification method and system based on multi-level feature fusion
CN113553908A (en) * 2021-06-23 2021-10-26 中国科学院自动化研究所 Heterogeneous iris identification method based on equipment unique perception

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137415A1 (en) * 2016-11-11 2018-05-17 Minitab, Inc. Predictive analytic methods and systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN113516012A (en) * 2021-04-09 2021-10-19 湖北工业大学 Pedestrian re-identification method and system based on multi-level feature fusion
CN113553908A (en) * 2021-06-23 2021-10-26 中国科学院自动化研究所 Heterogeneous iris identification method based on equipment unique perception

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于多辅助分支深度网络的行人再识别;夏开国;田畅;;通信技术;20181110(11);全文 *

Also Published As

Publication number Publication date
CN114863505A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN108549873A (en) Three-dimensional face identification method and three-dimensional face recognition system
CN111898736B (en) Efficient pedestrian re-identification method based on attribute perception
US7711157B2 (en) Artificial intelligence systems for identifying objects
Oliva et al. Scene-centered description from spatial envelope properties
CN111738143B (en) Pedestrian re-identification method based on expectation maximization
CN112801015B (en) Multi-mode face recognition method based on attention mechanism
CN114419671B (en) Super-graph neural network-based pedestrian shielding re-identification method
CN111625667A (en) Three-dimensional model cross-domain retrieval method and system based on complex background image
Liu et al. TreePartNet: neural decomposition of point clouds for 3D tree reconstruction
CN107169117B (en) Hand-drawn human motion retrieval method based on automatic encoder and DTW
CN111814845B (en) Pedestrian re-identification method based on multi-branch flow fusion model
CN112784782B (en) Three-dimensional object identification method based on multi-view double-attention network
CN113095149A (en) Full-head texture network structure based on single face image and generation method
CN116958420A (en) High-precision modeling method for three-dimensional face of digital human teacher
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN114494594A (en) Astronaut operating equipment state identification method based on deep learning
Pini et al. Learning to generate facial depth maps
CN112686202A (en) Human head identification method and system based on 3D reconstruction
CN114863505B (en) Pedestrian re-identification method based on trigeminal convolutional neural network
Yang et al. Semantic perceptive infrared and visible image fusion Transformer
CN114360058A (en) Cross-visual angle gait recognition method based on walking visual angle prediction
Liu et al. Multi-view ear shape feature extraction and reconstruction
Wu et al. DeepShapeKit: accurate 4D shape reconstruction of swimming fish
Wang et al. Scene recognition based on DNN and game theory with its applications in human-robot interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant