CN109508663A

CN109508663A - A kind of pedestrian's recognition methods again based on multi-level supervision network

Info

Publication number: CN109508663A
Application number: CN201811299473.0A
Authority: CN
Inventors: 张君鹏; 申瑞民; 姜飞
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2019-03-22
Anticipated expiration: 2038-10-31
Also published as: CN109508663B

Abstract

The present invention relates to a kind of pedestrian's recognition methods again based on multi-level supervision network, this method extracts the feature of different semantic hierarchies to pedestrian image by a multi-level supervision network, and then realizes that pedestrian identifies again；The multi-level supervision network is including a multilayer depth convolutional neural networks as core network and multiple categorization modules as feature extraction sub-network；Pedestrian image is converted to the characteristic pattern of different semantic hierarchies by core network, each categorization module is separately converted to the feature vector with distinction by each layer characteristic pattern that supervised learning extracts core network, feature vector splicing shape on all levels is final feature vector, realizes that pedestrian identifies again based on the final feature vector.Compared with prior art, the present invention extracts the feature of pedestrian image difference semantic hierarchies, improves the distinction of feature, and the stability of training process is improved in the way of the supervised learning of half-separation type, network accuracy rate performance is improved, has many advantages, such as that heavy recognition accuracy is high.

Description

A kind of pedestrian's recognition methods again based on multi-level supervision network

Technical field

The present invention relates to a kind of pedestrians again recognition methods, more particularly, to a kind of pedestrian's weight based on multi-level supervision network Recognition methods.

Background technique

Pedestrian in video identify again be computer vision and artificial intelligence field an important subject.Its task Target can summarize are as follows: the image (query image) of given one (or multiple) pedestrian to be searched needs having prison All images of the pedestrian are searched out in control video image set (gallery images).Pedestrian identify again intelligent security guard, The fields such as urban safety are of great immediate significance and are worth, and are a big hot spots of Recent study.

However, in reality scene, since video camera shooting angle, shooting distance, locating light environment etc. are different, With a group traveling together, there are significant vision differences in different video.In addition to this, attitudes vibration caused by human motion, block Phenomena such as further improve the difficulty of this task.Then, how pedestrian's image to be extracted under the influence of these factors and is had The feature of high discrimination and be for identification this technical field problem.

Existing pedestrian's weight identification technology is commonly divided into 3 steps.Scheme firstly, preparing the largely pedestrian with identity label Decent, as training set data library.Later, the depth convolutional neural networks of training set data training specific structure are utilized.Mind Structure and training method through network generally determine the accuracy rate performance of pedestrian weight identifying system, are a most important steps. Finally, using trained convolutional network to search library image zooming-out feature.When needing to identify pedestrian again, only need to utilize Trained convolutional network extracts feature to pedestrian image to be checked (query image), by obtained feature vector and retrieval The feature vector of library image compares cosine similarity or Euclidean distance one by one and is ranked up.The most similar several library images are For the output result of pedestrian's weight identifying system.

Under the above technological frame, prior art focuses primarily on the structure design and instruction of depth convolutional neural networks The mode of white silk designs.This kind of technology can substantially be divided into the pedestrian of two classes (1) based on provincial characteristics and identify again.(2) it is based on metric learning Pedestrian identify again.

Pedestrian's weight identifying schemes based on provincial characteristics divide an image into multiple horizontal zones generally according to spatial position Or net region.After region division, feature is extracted using depth convolutional neural networks to each block respectively.Feature mentions The process taken may be summarized to be: original image (or block) is sent into convolutional neural networks, by multiple convolutional layers, batch normalization The network units such as layer, nonlinear activation layer obtain the characteristic pattern containing high-level semantic feature.Later, by obtained characteristic pattern into The global average pond of row, as the feature vector for representing this image block.Finally, the feature vector of all blocks is melted It closes or merges, obtain the feature vector for representing this group traveling together.For example, document " Glad:global-local-alignment descriptor for pedestrian retrieval”(Wei L,Zhang S,Yao H,et al.Proceedings of The 2017ACM on Multimedia Conference.ACM, 2017:420-428) human body is drawn according to human body key point It is divided into three head, the upper part of the body, lower part of the body regions, and respectively to these three extracted region features.Document " Beyond Part Models:Person Retrieval with Refined Part Pooling”(Sun Y,Zheng L,Yang Y,et Al.arXiv preprint arXiv:1711.09349,2017) pedestrian image level is divided into 6 blocks, and right respectively Six extracted region features.

Pedestrian's weight identifying schemes based on metric learning are common usually using well-designed loss function training network Metric learning loss function has control loss function, triple loss function, big distance s oftmax loss function etc..

There is also following disadvantages for the above-mentioned prior art:

1, during extracting feature, such methods only with depth convolutional network the last layer feature, to net The utilization rate of network is not high.

2, such methods do not utilize multilayer semantic information caused by network.Although the characteristic pattern of network the last layer wraps Containing stronger semantic information, but it also can be therefore easily lost image detail, and then cause the distinction of feature limited.

3, the pedestrian based on metric learning identifies that network is generally more difficult with training again.

Summary of the invention

It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art and provide one kind based on multi-level prison Superintend and direct pedestrian's recognition methods again of network.

An object of the present invention is to solve existing pedestrian weight identification technology to convolutional network middle layer characteristic use rate not High problem improves the distinction and robustness of general characteristic.

The second object of the present invention is to improve the stability of network training process, and promote network accuracy rate performance.

The purpose of the present invention can be achieved through the following technical solutions:

A kind of pedestrian's recognition methods again based on multi-level supervision network, this method are based on a multi-level supervision network implementations Pedestrian identifies that the multi-level supervision network includes a multilayer depth convolutional neural networks as core network and multiple points again Pedestrian image is converted to the characteristic pattern of different semantic hierarchies as feature extraction sub-network, the core network by generic module, respectively The categorization module is separately converted to the feature with distinction by each layer characteristic pattern that supervised learning extracts core network Vector, the feature vector splicing shape on all levels is final feature vector, realizes that pedestrian knows again based on the final feature vector Not.

Further, the multilayer depth convolutional neural networks are made of multiple residual error convolution modules, each residual error convolution mould Block includes several convolutional layers, batch normalization layer and nonlinear activation layer.

Further, the input of multiple categorization modules respectively corresponds the defeated of multiple residual error convolution modules in core network Out.

Further, each categorization module not shared parameter.

Further, the categorization module include the convolutional layer set gradually, it is batch normalization layer, nonlinear activation layer, complete The average pond layer of office, dropout layers, the full articulamentum of dimension, batch normalization layer and softmax layers.

Further, the training process of the multi-level supervision network specifically includes:

1) by acquiring a collection of sample in pedestrian's database for collecting, input the multi-level supervision network carry out it is preceding to biography It broadcasts；

2) intersect entropy loss according to the classification results of each categorization module and sample label calculating；

3) half-separation type backpropagation, the backpropagation are carried out to the multi-level supervision network based on intersection entropy loss Cheng Zhong, only the last one corresponding categorization module of each convolutional neural networks module is completed to the reversed of entire multi-level supervision network It propagates, remaining only carries out backpropagation to categorization module；

4) gradient descent algorithm and undated parameter are executed to network according to the gradient that backpropagation obtains, until network is received It holds back.

Further, the final feature vector is spliced by the feature vector that each categorization module obtains.

Further, realize that pedestrian identifies again based on the final feature vector specifically:

By final feature vector and library image feature vector one by one compared with cosine similarity, with the highest preceding k of similarity Image is as the query result identified again.

Further, the sample acquisition process in pedestrian's database specifically includes:

Pedestrian's video is obtained using different cameras by different spatial, is frame by pedestrian's video slicing, to every Each of one frame image difference pedestrian marks an independent sample label, carries out resolution ratio scaling and reversion to every image Processing forms pedestrian's database.

Compared with prior art, the present invention have with following the utility model has the advantages that

1, the present invention devises categorization module, realizes the depth convolutional network that can effectively extract different semantic hierarchies features Structure merges multi-level merging features in identification, so as to improve the distinction and robustness of general characteristic, solves existing The problem for having pedestrian's weight identification technology not high to convolutional network middle layer characteristic use rate, is more efficiently utilized computing resource.

2, the present invention devises a kind of effective multilayer supervision network training mode, the i.e. training method of half-separation type, has Effect improves the stability of training process, and promotes network accuracy rate performance.

3, the inquiry phase identified again in pedestrian, the merging features Cheng Genggao for all levels that multiple categorization modules are obtained The feature vector of dimension, the identification of lifting feature emphatically, and then significant improve the accuracy rate identified again.

4, the method for the present invention is suitble to pedestrian's weight identification mission under complex scene.

Detailed description of the invention

Fig. 1 is the structural schematic diagram of the multi-level supervision network of the present invention；

Fig. 2 is testing result schematic diagram of the invention.

Specific embodiment

The present invention is described in detail with specific embodiment below in conjunction with the accompanying drawings.The present embodiment is with technical solution of the present invention Premised on implemented, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to Following embodiments.

The invention proposes a kind of pedestrian's recognition methods again based on multi-level supervision network, in depth residual error network (ResNet) on the basis of, using the categorization module (Classification Block) of multiple not shared parameters, in network Supervised learning is carried out on different depth, and then the feature of different semantic hierarchies is extracted to pedestrian image.The overall structure of network is such as Shown in Fig. 1.In the network training stage, the present invention uses a kind of supervised learning mode of half-separation type, improves training process Stability, and improve network accuracy rate performance.In inquiry (query) stage that pedestrian identifies again, by the feature of all levels It is spliced into more high-dimensional feature vector, emphatically the identification of lifting feature, and then significant improves the accuracy rate identified again.

The multi-level supervision network that the present invention uses include multilayer depth convolutional neural networks as core network with Multiple categorization modules are turned residual error convolution module pedestrian image by a series of as feature extraction sub-network, the core network It is changed to the characteristic pattern of different semantic hierarchies, each categorization module passes through each layer characteristic pattern that supervised learning extracts core network It is separately converted to the feature vector with distinction, the feature vector splicing shape on all levels is final feature vector, is based on The final feature vector realizes that pedestrian identifies again.The categorization module includes the convolutional layer set gradually, criticizes normalization layer, is non-thread Property active coating, global average pond layer, dropout layer, the full articulamentum of dimension, criticize and normalize layer and softmax layers.Each classification Module not shared parameter.

As shown in Figure 1, in the present embodiment, it is multi-level supervise network be based on existing ResNet50 (He K, Zhang X, Ren S,et al.Deep residual learning for image recognition.Proceedings of the IEEE conference on computer vision and pattern recognition.2016:770-778) it realizes. ResNet50 is constituted with ResNet Block (residual error convolution module), and then extracts characteristic pattern to pedestrian image, and network is whole by 5 A stage composition, is referred to as Block1 to Block5.Each stage includes multiple residual error convolution modules, such as Block4 is by 6 A residual error convolution module composition, respectively Block4_1 to Block4_6；Block5 is made of 3 residual error convolution modules, respectively For Block5_1 to Block5_3.These residual error convolution modules are by several convolutional layers, batch normalization layer, nonlinear activation layer structure At.Traditional pedestrian identifies that network is usually only used to the feature of whole network the last layer (Block5_3) output again, In the present embodiment, 9 categorization modules are provided with, are extracted Block4_1 to Block4_6, Block5_1 to Block5_3 totally 9 The feature of level, and be respectively fed to categorization module and exercise supervision study, to improve the distinction and robustness of feature.

In the present embodiment, the detailed process that categorization module carries out characteristic processing can be divided into:

(1) firstly, the triple channel pedestrian image that dimension is 256*128*3 enters network, by Block1, it is changed into 128* The characteristic pattern of 64*64.

(2) characteristic pattern of 128*64*64 passes through Block2, is changed into the characteristic pattern of 64*32*256.

(3) characteristic pattern of 64*32*256 passes through Block3, is changed into the characteristic pattern of 32*16*512.

(4) characteristic pattern of 32*16*512 sequentially passes through Block4-1 to Block4-6, exports the feature of 6 16*8*1024 Figure.

(5) characteristic pattern of 16*8*1024 sequentially passes through Block5-1 to Block5-3, exports the feature of 3 16*8*2048 Figure.Different from original ResNet50, which removes the down-sampling operation of Block5.

The characteristic pattern of (6) 6 16*8*1024 is respectively fed to categorization module 1 to 6.Categorization module is 1*1* by core size 2048 convolutional layer, batch normalization layer, nonlinear activation layer, global average pond layer, dropout layers, 512 tie up full articulamentum, Batch normalization layer, softmax layers form in order.It should be noted that the not shared parameters of categorization module 1 to 9.

(7) characteristic pattern of 16*8*1024 is converted into the characteristic pattern of 16*8*2048 through 1*1 convolutional layer, by the average pond of the overall situation Changing layer becomes the feature vector of 1*1*2048, using full articulamentum dimensionality reduction, 512 dimensional feature vector of boil down to.

(8) 512 dimensional feature vectors are sent into softmax layers as the feature for representing pedestrian, complete the supervision of classification task It practises.

(9) categorization module 7-9 is similar with categorization module 1 to 6, only difference is that Block5_1 to Block5_3 is exported Characteristic pattern be 16*8*2048 tie up rather than 16*8*1024 tie up.

The present invention is trained multi-level supervision network using the training method of a half-separation type, improves training process Stability, and promote network accuracy rate performance.Training process specifically includes:

Step 1: by acquiring a collection of sample in pedestrian's database for collecting, input the multi-level supervision network carry out it is preceding to It propagates.

The collection of pedestrian's database includes: with preprocessing process

1) pedestrian's video needs to shoot in different spatial using different cameras, to guarantee the more of pedestrian image Sample and otherness form the training set for possessing a large amount of different pedestrian's videos.

It 2) is that frame uses the mode or calculation manually marked for each frame image by video slicing after video collect The mode of method automatic marking intercepts out the pedestrian in frame, saves as individual picture, and mark one to each different pedestrian Independent sample label.

3) every picture is zoomed to the resolution sizes of 256*128, is needed with adapting to the input of subsequent convolutional neural networks It asks, while being adapted to the physical size of human body, avoid anamorphose.

4) by treated, pedestrian's picture carries out horizontal 180 degree reversion one by one, to expand training data, is formed final Pedestrian's database.

Step 2: intersecting entropy loss according to the classification results of each categorization module and sample label calculating.

Step 3: half-separation type backpropagation, the reversed biography being carried out to the multi-level supervision network based on entropy loss is intersected During broadcasting, only the last one corresponding categorization module of each convolutional neural networks module is completed to entire multi-level supervision network Backpropagation, remaining only carries out backpropagation to categorization module.

In the present embodiment, as shown in the dotted line of Fig. 1, for categorization module 1,2,3,4,5,7,8 backpropagation only to point Generic module itself carries out, and the gradient of backpropagation is not back to the ResNet network of trunk, and dotted arrow indicates backpropagation ladder Degree is truncated herein；For categorization module 6 and 9, categorization module itself and entire ResNet core network participate in reversely passing together It broadcasts, completes the supervised learning of entire ResNet core network.

Step 4: gradient descent algorithm and undated parameter being executed to network according to the gradient that backpropagation obtains, until network Convergence.

In the present embodiment, when carrying out pedestrian's weight identification mission using above-mentioned trained multi-level supervision network, one is given Pedestrian image to be searched is opened, completes identification process according to following process:

(1) pedestrian image to be searched is zoomed to the resolution sizes of 256*128, with the good neural network of adaptation training Input demand.

(2) pedestrian image is sent into trained multi-level supervision network and carries out propagated forward.Each categorization module output The feature of one 512 dimension.

(3) feature vector that 9 categorization modules export is spliced, forms the feature vector of one 4608 dimension, represented The final feature of pedestrian.

(4) by feature vector and the feature vector of library image of 4608 dimensions obtained in previous step one by one compared with cosine it is similar Degree, and be ranked up according to similarity.The highest preceding k of similarity images are the query result of identification of attaching most importance to.K can be as needed Setting.

Table 1 illustrates accuracy rate effect of the present invention on Duke-MTMC Reid [4] data set, it can be seen that the party It is higher that method compares accuracy rate with some existing methods, and weight recognition effect is more preferable.In table, Rank@1, Rank@5, Rank@10 distinguish 1 before when indicating using CMC curve statistical, preceding 5, preceding 10 accuracy rate.

1 present invention of table is compared with prior art accuracy rate

Fig. 2 illustrates some effect of visualization of the present invention on Duke-MTMC Reid data set, effect of the present invention compared with It is good.Query indicates that image to be searched, subsequent 10 picture are and the highest 10 library images of pedestrian's similarity.

The preferred embodiment of the present invention has been described in detail above.It should be appreciated that those skilled in the art without It needs creative work according to the present invention can conceive and makes many modifications and variations.Therefore, all technologies in the art Personnel are available by logical analysis, reasoning, or a limited experiment on the basis of existing technology under this invention's idea Technical solution, all should be within the scope of protection determined by the claims.

Claims

1. a kind of pedestrian's recognition methods again based on multi-level supervision network, which is characterized in that this method is based on a multi-level prison It superintends and directs network implementations pedestrian to identify again, the multi-level supervision network includes a multilayer depth convolutional neural networks as backbone network As feature extraction sub-network, pedestrian image is converted to different semantic hierarchies by the core network for network and multiple categorization modules Characteristic pattern, each categorization module are separately converted to have and distinguish by each layer characteristic pattern that supervised learning extracts core network Property feature vector, the feature vector splicing shape on all levels is final feature vector, is realized based on the final feature vector Pedestrian identifies again.

2. pedestrian's recognition methods again according to claim 1 based on multi-level supervision network, which is characterized in that described more Layer depth convolutional neural networks are made of multiple residual error convolution modules, and each residual error convolution module includes several convolutional layers, batch normalizing Change layer and nonlinear activation layer.

3. pedestrian's recognition methods again according to claim 2 based on multi-level supervision network, which is characterized in that Duo Gesuo The input for stating categorization module respectively corresponds the output of multiple residual error convolution modules in core network.

4. pedestrian's recognition methods again according to claim 1 based on multi-level supervision network, which is characterized in that each described Categorization module not shared parameter.

5. pedestrian's recognition methods again according to claim 1 based on multi-level supervision network, which is characterized in that described point Generic module include the convolutional layer set gradually, batch normalization layer, nonlinear activation layer, global average pond layer, dropout layers, It ties up full articulamentum, criticize normalization layer and softmax layers.

6. pedestrian's recognition methods again according to claim 1 based on multi-level supervision network, which is characterized in that described more The training process of level supervision network specifically includes:

1) by acquiring a collection of sample in pedestrian's database for collecting, the multi-level supervision network progress propagated forward is inputted；

3) half-separation type backpropagation, the back-propagation process are carried out to the multi-level supervision network based on intersection entropy loss In, only part classification module completes the backpropagation to entire core network, remaining only carries out backpropagation to categorization module；

4) gradient descent algorithm and undated parameter are executed to network according to the gradient that backpropagation obtains, until network convergence.

7. pedestrian's recognition methods again according to claim 1 based on multi-level supervision network, which is characterized in that it is described most Whole feature vector is spliced by the feature vector that each categorization module obtains.

8. pedestrian's recognition methods again according to claim 1 based on multi-level supervision network, which is characterized in that be based on institute It states final feature vector and realizes that pedestrian identifies again specifically:

By final feature vector and library image feature vector one by one compared with cosine similarity, with the highest preceding k of similarity images As the query result identified again.

9. pedestrian's recognition methods again according to claim 6 based on multi-level supervision network, which is characterized in that the row Sample acquisition process in personal data library specifically includes:

Pedestrian's video is obtained using different cameras by different spatial, is frame by pedestrian's video slicing, to each frame Each of image difference pedestrian marks an independent sample label, carries out resolution ratio scaling and horizontal inversion to every image Processing forms pedestrian's database.