CN115063831A

CN115063831A - High-performance pedestrian retrieval and re-identification method and device

Info

Publication number: CN115063831A
Application number: CN202210409679.4A
Authority: CN
Inventors: 华璟; 吴绍鑫; 孙杰
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-09-16

Abstract

The invention discloses a high-performance pedestrian retrieval and re-identification method and a high-performance pedestrian retrieval and re-identification device, wherein the method comprises the following steps: respectively acquiring pedestrian data under a single-view and multi-view real monitoring scene, and performing data annotation on the pedestrian data, wherein a pedestrian detection data set is constructed by the single-view pedestrian data and a pedestrian part in a COCO data set together, and a pedestrian re-identification data set is constructed by the multi-view pedestrian data; training a network model by using a pedestrian detection data set based on a Yolov5 pedestrian detection algorithm improved by a Ghost lightweight model; training by utilizing a pedestrian re-identification data set to obtain a pedestrian re-identification model; and (5) building a pedestrian searching system. The invention optimizes the deep learning efficiency by a top-down method from algorithm to hardware through the cooperative optimization of depth model compression and algorithm calculation force, and realizes a pedestrian re-identification system with low cost and high performance.

Description

High-performance pedestrian retrieval and re-identification method and device

Technical Field

The invention belongs to the technical field of pedestrian detection and pedestrian re-identification, and particularly relates to a high-performance pedestrian retrieval and re-identification method and device.

Background

The target detection technology is mainly used for finding out an object of a specific category in an image or a video in a given image and detecting a class label and a coordinate of the target. The algorithm based on the deep convolutional neural network becomes a mainstream algorithm in the field of target detection, and can be divided into a Two-stage algorithm and an One-stage algorithm at present according to different classification standards. R-CNN is a Two-stage algorithm, and candidate frames are generated firstly, then classified and fine-tuned. The One-stage algorithm, such as Yolo, SSD, does not need to generate candidate frames in advance, and directly performs regression and classification on each position of the image. The Two-stage algorithm has higher precision but lower speed, the algorithm improvement is accompanied by the speed improvement, the One-stage algorithm has high speed, and the algorithm improvement is accompanied by the precision improvement.

The pedestrian re-identification technology is a core technology of long-term and cross-domain multi-target tracking, and the main objective is to perform cross-camera re-identification on the same pedestrian. The existing video analysis system is a set of various task algorithms, and has extremely high requirements on a computing chip. Pedestrian search involves a process of finding (object detection) and matching (pedestrian re-identification). The pedestrian re-identification models are all based on pedestrian images obtained by preprocessing monitoring videos, a large amount of preparation work is needed, the pedestrian re-identification models are specific to real scenes, and the application requirements cannot be met by the single pedestrian re-identification model.

Different from the single scene of an academic data set, the pedestrian search algorithm inputs the result picture of the pedestrian detection algorithm into the pedestrian re-identification module, so that the detection effect of the pedestrian detection model is a key step for identifying the correct pedestrian. For a large video monitoring system, an intelligent device platform with high-performance and low-power consumption hardware is often required. In the pedestrian re-identification task, a deep learning-based method is the best algorithm at present. However, the complex deep learning model generally has good detection effect and re-recognition capability, and is difficult to be deployed on a device with limited hardware resources and a tight power budget while ensuring accuracy and real-time performance.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a high-performance pedestrian retrieval and re-identification method; the model mainly faces the difficulties of the size of the model, the memory occupied by operation, the calculation efficiency and the like in the process of deployment, so that the core of the method is to insert a Ghost lightweight module into a detection model and adopt a channel-level sparse pruning method, thereby reducing the size of the original search network model and achieving the equivalent accuracy benchmark. The trained model is subjected to quantization operation on the computationally intensive SC5 and a computing card, and is deployed on hardware lightweight and quickly, so that the requirements of a large-scale video monitoring system on an intelligent equipment platform requiring high-performance and low-power-consumption hardware are met, and the accuracy and the real-time performance are ensured, and meanwhile, the trained model is deployed on equipment with limited hardware resources and short power budget.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present specification, there is provided a high-performance pedestrian retrieval and re-identification method, comprising the steps of:

s1, acquiring single-view pedestrian data and multi-view pedestrian data under a real monitoring scene respectively, wherein a pedestrian detection data set is constructed by adopting the single-view pedestrian data and partial pedestrian data in the COCO data set, and a pedestrian re-identification data set is constructed by the multi-view pedestrian data;

s2, training a pedestrian detection model based on a Ghost lightweight model improved YOLOv5 pedestrian detection algorithm by using the pedestrian detection data set in S1;

s3, training a channel-level sparsely pruned pedestrian re-recognition network by using the pedestrian re-recognition data set in S1 to obtain a pedestrian re-recognition model;

and S4, carrying out quantitative deployment on the basis of the computationally abundant SC5 and the cloud AI computing accelerator card by utilizing the pedestrian detection model trained in S2 and the pedestrian re-identification model in S3, and building a pedestrian search system.

Further, the pedestrian detection model in step S2 includes four modules, which are an input end module, a backbone network module, a neck network module, and an output end module, respectively, and inputs a picture of the pedestrian detection data set;

the image is firstly input into the trunk network module through the input end module, the pedestrian feature image is extracted, the pedestrian feature image is sent into the neck network module, the detection of the pedestrian detection model on the pedestrian feature images with different scaling scales can be enhanced by the neck network module, the enhanced pedestrian feature image is sent into the output end, the output end predicts the pedestrian feature image, and a boundary frame and a category in the predicted pedestrian feature image are generated.

Further, the backbone network module comprises three modules, namely a Focus sub-module, a CBL sub-module and a GhostCSP sub-module;

the Focus submodule carries out slicing operation on an input picture, downsampling operation is carried out on every pixel, the CBL submodule carries out convolution operation on the input picture, the Ghost CSP submodule is generated by replacing a Ghost network, the Ghost network with the step length of 1 replaces a residual error assembly in a CSP structure, the Ghost network with the step length of 2 replaces a convolutional layer in the CSP structure, and downsampling is achieved.

Further, the neck network module performs multiple feature extraction on the pedestrian feature image extracted by the main network module to generate a pedestrian feature image with 8, 16 or 32 scales, loss calculation is performed on the pedestrian feature image with 8, 16 or 32 scales to obtain a loss value, and the pedestrian detection model is trained and updated according to the loss value to obtain a trained enhanced pedestrian detection model.

Further, the pedestrian re-identification network in the step S3 includes a ResNet50 network and a bntack module, and is input as a picture in the pedestrian re-identification data set;

randomly cutting an input picture into different sizes and aspect ratios, zooming the input picture into the same size, randomly erasing the input picture, and shielding the picture by using a rectangular frame full of random values to obtain an enhanced image;

inputting the enhanced image into a ResNet50 network, performing model pre-training on the ResNet50 network by using an ImageNet data set, extracting pedestrian image characteristics, and performing global pooling on the extracted characteristics to obtain a pedestrian global characteristic F _global ；

The BNNeck module separates the pedestrian re-identification loss into two different feature spaces for optimization, and one-time learning is completed.

Further, the Loss function Loss of the pedestrian re-identification network in the step S3 is:

wherein: n is the number of samples, x _i Is an input image, y _i Is its class label, o (y) _i |x _i ) Denotes classification by softmax, x _i Is recognized as y _i A predicted probability of (d); d _p Is the distance between the homogeneous image and the input image, d _n Is the distance between the different images and the input image, alpha and beta are hyper-parameters of balance loss, max (x) is the maximum distance;

the features preceding the fully-connected layer are shown,

the feature center representing the yi-th class,

is the norm of L2.

Further, the ResNet50 network adopts a channel-level sparsification pruning method for processing, a scaling factor alpha is introduced into each channel, connectivity is learned through normal network training, the scaling factors are sparsified and regularized in the training process, the importance of the channels is automatically recognized, and finally the channels with lower scaling factors obtained through training are pruned.

Further, the objective function of the pruning of the pedestrian re-identification model is as follows:

where (x, y) is the training input and the target, the first term is expressed as a whole as the original loss function of the untrimmed network, the second term is the penalty term on the scale factor, a represents the trainable parameter in the network, α is the scale factor, β is the balance factor of the two terms, |, is the norm of L1.

Further, the step S4 is specifically:

s41, 3000 pictures of the pedestrian detection training set constructed in the step S1 and 1500 pictures of the pedestrian re-identification training set constructed in the step S1 are selected and respectively converted into an lmdb data set for subsequent calibration and quantification;

s42, converting the pedestrian detection model and the pedestrian re-identification model trained in S2 and S3 into an fp32umodel file and a corresponding prototxt file by using a BMNNSDK2 SDK tool, wherein the fp32umodel is in a format proprietary to a bit platform;

s43, converting the fp32umodel converted in the step S42 into a bit-private intermediate temporary model int8umodel by using a caliibration _ use _ pb quantization tool, and taking the lmdb data set in the S41 as quantization calibration, wherein the int8umodel is a network coefficient file in an int8 format and generated by quantization;

s44, checking the int8umodel network error after S43 conversion by using a calibretion visual analysis tool, and defining the average absolute percentage error and a cosine function as error evaluation criteria as follows:

wherein Ac istual _i Representing true values, Forecast _i Representing a predicted value, wherein n is the number of samples;

s45, after confirming that the quantization precision is normal through an error evaluation standard, compiling into a file required by BMRuntime by using a BMNETU tool provided by BMNNSDK2 SDK and using the int8umdel model of the step S43 as input, and comparing the result of calculating each layer of NPU model with the calculation result of the CPU to obtain an int8bmode model for pedestrian detection and pedestrian re-identification;

s46, inputting multiple paths of monitoring video streams based on the int8bmode model quantized in the step S45, creating each frame of video on a specified chip, and performing pedestrian detection on each frame of video in each path of video stream by using the int8bmode model quantized in the step S45 to obtain a pedestrian boundary frame and confidence;

s47, filtering the pedestrian boundary frame and the confidence coefficient through a DIoU-NMS method, and if the final confidence coefficient of the pedestrian is smaller than the confidence coefficient threshold value, inhibiting to obtain a filtered video monitoring pedestrian image, wherein the formula is as follows:

where ε is the NMS threshold, N _i Is the classification confidence, M is the highest confidence detection box, B _i Is a bounding box, R _DIoU Is the center distance between the two bounding boxes;

s48, cutting each frame of pedestrian screened in the step S47 in a video frame to be used as a pedestrian library picture, sending each batch of pedestrian libraries to be identified into the int8bmode model of pedestrian re-identification, which is quantized and completed in the step S45, to extract pedestrian picture characteristics when the set batch processing number is reached, obtaining candidate set characteristics, and inputting a pedestrian image to be inquired to obtain inquiry set characteristics;

and S49, performing Euclidean distance feature calculation on the candidate set features and the query set features in the S48 to obtain a pedestrian similarity value, judging whether the pedestrian similarity value is larger than a preset pedestrian threshold value or not, and obtaining a re-identification result of the given target pedestrian.

According to a second aspect of the present specification, there is provided a high-performance pedestrian retrieval and re-identification apparatus, comprising a memory and one or more processors, wherein the memory stores executable codes, and the processors execute the executable codes to implement the high-performance pedestrian retrieval and re-identification method according to the first aspect.

The beneficial effects of the invention are: the improved Yolov5 pedestrian detection network is constructed, the Ghost module is used for replacing the original CSP structure, the network calculation amount is reduced, and the detection efficiency is improved. And simultaneously, DIoU-NMS is introduced in the reasoning stage, so that the condition of missing detection is reduced, and the detection precision is improved. And constructing a residual error network ResNet50 to extract global features of the pedestrian image features, and training by combining triple loss, center loss and Identity loss, so that the overfitting degree of the model is effectively reduced, and the generalization capability of the model is improved. And a channel-level sparse pruning method is adopted to prune the pedestrian re-identification network convolution network, so that the size and parameter quantity of the model are reduced, and the memory occupation and time consumption of the model during operation are reduced. And (3) building a pedestrian search system, and combining Sophon SC5 and a cloud AI calculation accelerator card to realize the quantification of the model, further reducing the size of the model and accelerating the reasoning speed. The efficiency of deep learning is optimized by a top-down method from an algorithm to hardware by using an SC5+ specific interface, a deployment model and specific pedestrian retrieval in a surveillance video quickly and accurately.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic diagram of yolov5-ghost network framework in an embodiment of the invention;

FIG. 2 is a schematic diagram of a ghost bottompiece framework in an embodiment of the invention;

FIG. 3 is a schematic diagram of a pedestrian re-identification network framework in an embodiment of the present invention;

FIG. 4 is a flow chart of the implementation of model quantization in an embodiment of the present invention;

FIG. 5 is a flow chart of a pedestrian search system implementation in one embodiment of the invention;

fig. 6 is a block diagram of a high-performance pedestrian searching and re-identifying apparatus according to an embodiment of the present invention.

Detailed Description

For better understanding of the technical solutions of the present application, the following detailed descriptions of the embodiments of the present application are provided with reference to the accompanying drawings.

It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The invention provides a high-performance pedestrian retrieval and re-identification method, which comprises the following steps:

in one embodiment, 64115 pedestrian images labeled as person category in the COCO dataset are extracted;

the method comprises the following steps that a plurality of cameras with different angles which are fixedly erected collect video monitoring images, and pedestrian labels and coordinates 7537 are marked on the images; and (3) combining the collected pedestrian image with the MS COCO data set according to the ratio of 4: the ratio of 1 constructs a pedestrian detection data set.

At least 50000 images with pixels not lower than 64 x 128 are acquired, and the images must be captured by at least two cameras and must contain images of pedestrians. According to the following steps of 4: 1, constructing a pedestrian re-identification data set.

as shown in fig. 1, in an embodiment, the pedestrian detection model in step S2 includes four modules, namely an input end module, a backbone network module, a neck network module and an output end module, which are respectively input as a picture of the pedestrian detection data set;

In one embodiment, the backbone network module includes three modules, which are a Focus sub-module, a CBL sub-module, and a ghost csp sub-module;

the picture input in the backbone network module is self-copied through the Focus sub-module and is sliced, and the down-sampling operation is carried out at intervals of pixels, so that the network calculation amount is reduced, and the speed of extracting the candidate region features is increased. And carrying out convolution normalization on the slice images through a CBL submodule to carry out feature extraction. And in the extraction process, the GhostCSP submodule optimizes network gradient information, accelerates network calculation speed and reduces calculated amount. And finally, the SPP module converts the input with different sizes into the output with the same size, so that the problem of non-uniform size of the input image is solved. The GhostCSP submodule is generated by replacing the GhostCSP by using a GhostCSP network, wherein the GhostCSP network with the step length of 1 replaces a residual error component in the CSP structure, and the GhostCSP network with the step length of 2 replaces a convolutional layer in the CSP structure to play a role in down-sampling

The GhostCSP module comprises CBL processing, a Ghost network with the step length of 1 and 2, batch normalization processing and modified linear unit processing.

Fig. 2 is a Ghost network used in this example, which has two bottleneck structures, and for the case where the step size is 1, the Ghost network is composed of two Ghost modules, where the first Ghost module performs expansion processing to increase the number of channels of the input feature map. And the second Ghost module is used for reducing the number of channels of the characteristic diagram to enable the characteristic diagram to be matched with a diameter structure body in the network, and connecting the two modules according to the diameter structure for information transmission. The structure that the ReLU activation function is used after the first Ghost module and only normalization processing is used in the second Ghost module enables the model to effectively reduce parameters and calculated amount and optimize the feature map. For the step size of 2, two Ghost modules are connected through a depth convolution with the step size of 2, and a downsampling layer is used in a diameter connection path to reduce the size of the feature map by half, so that the calculation amount is further reduced.

The Ghost network is formed by convolving n groups of k × k convolution kernels with an input feature map with the size of c × h × w to generate an intrinsic feature map with the number of channels of n and the size of h '× w', and then applying linear transformation (phi) _a ) Further generating a new similar feature map ghost, and finally combining the two groups of feature information to obtain all feature information, wherein the details are as follows:

Y′＝X*f′+b

wherein Y' is epsilon with R ^h′*w′*n Representing the output eigen-feature map, X' is equal to R ^h*w*c Representing the input profile, f ∈ R ^c*k*k*n Representing the convolution kernel of the convolution. y' _i (i is more than or equal to 1 and less than or equal to m) n feature maps of Y', and each feature map is subjected to lightweight linear operation phi _i,j (j is more than or equal to 1 and less than or equal to s) to obtain s similar characteristic graphs ghost.

The neck network module adopts FPN and PAN network structures, and outputs three different detection scale characteristic diagrams by upsampling and fusing the characteristics of different layers and utilizing the high resolution of the bottom layer characteristics and the semantic information of the high layer characteristics.

Loss calculation is carried out based on the three characteristic graphs to obtain a loss value, the pedestrian detection model is trained and updated according to the loss value to obtain a trained model,

the loss function comprises bounding box regression loss, category prediction loss and confidence loss, and the losses are added according to specific weights to obtain total target detection loss.

The bounding box regression is represented as follows:

the IOU is the intersection ratio of the predicted frame and the real frame, Distance _ C is the diagonal Distance, Distance _2 is the Euclidean Distance between the center points of the two frames, and v is a parameter for measuring the consistency of the aspect ratio.

The class prediction loss and confidence loss are expressed as follows using a binary cross entropy loss:

n denotes the total number of categories, y _i For the probability of the current class obtained after the activation function

The true value (0 or 1) for the current class.

In an embodiment, the neck network module performs multiple feature extractions on the pedestrian feature image extracted by the backbone network module to generate a pedestrian feature image in 8, 16, 32 scales, performs loss calculation based on the pedestrian feature image in 8, 16, 32 scales to obtain a loss value, and the pedestrian detection model is trained and updated according to the loss value to obtain a trained enhanced pedestrian detection model.

in one embodiment, the pedestrian re-identification network in step S3 includes a ResNet50 network and a bntack module, and inputs a picture in the pedestrian re-identification data set;

As shown in fig. 3, the 50-layer residual error network ResNet50 constructed in this step can be divided into seven parts, the first part does not include a residual error module, and mainly performs convolution, regularization, and maximum pooling calculation on the input. The second part, the third part, the fourth part and the fifth part all contain residual modules, each residual module has three convolutions, and after the convolution calculation of the five parts, the pooling layer can convert the residual modules into feature vectors, and finally the classifier calculates and outputs class probability.

And the sixth part normalizes the obtained feature vectors through a BN layer. The dimension of each feature is balanced, and the constraint of ID loss on the features is reduced.

In one embodiment, the ResNet50 network is processed by a channel-level sparsification pruning method, a scaling factor α is introduced into each channel, connectivity is learned through normal network training, the scaling factors are sparsified and regularized in the training process, the importance of the channels is automatically identified, and finally the channels with lower scaling factors obtained through training are pruned.

For example, a channel-level sparse pruning method is adopted to prune part of batch normalization layers in the network of the pedestrian re-identification model. BN performs the following conversion:

n _out ＝αN+β

wherein u is _b Mean, n, representing a feature map in the minimum batch _in And n _out The method comprises the steps of respectively inputting and outputting a batch normalization layer, wherein alpha is a scaling factor, delta and beta are super-parameters for adjustment, learning connectivity through normal network training, sparsifying and regularizing scale factors in the training process, automatically identifying the importance of channels, setting a global threshold of the network layer, and pruning channels with lower scaling factors.

In one embodiment, the objective function of the pedestrian re-identification model pruning is as follows:

In one embodiment, the Loss function of the pedestrian re-identification network is trained by using the triple Loss, the central Loss and the identity Loss, and the three losses are added according to specific weights to obtain the total weight identification Loss, and the Loss function Loss of the pedestrian re-identification network in step S3 is:

wherein: n is the number of samples, x _i Is an input image, y _i Is its class label, p (y) _i |x _i ) Denotes classification by softmax, x _i Is recognized as y _i A predicted probability of (d); d _p Of the same type of image and of the input imageDistance, d _n Is the distance between the heterogeneous image and the input image, α and β are the hyper-parameters of the balance loss, max (-) is the maximum distance;

the features preceding the fully-connected layer are shown,

the feature center representing the yi-th class,

is the norm of L2.

Fig. 4 is a flowchart of implementing model quantization according to the present invention, and step S4 specifically includes:

s41, 3000 pictures of the pedestrian detection training set constructed in the step S1 and 1500 pictures of the pedestrian re-identification training set constructed in the step S1 are selected, and the pictures are respectively converted into an lmdb data set by using a convert _ image interface in a BMNNSDK tool, wherein the size of the pedestrian detection picture is 640, the size of the pedestrian re-identification picture is 256 128, a shuffle parameter is set, and the picture and label sequence are disturbed randomly for subsequent calibration and quantification;

s42, converting the pedestrian detection model and the pedestrian re-identification model trained in S2 and S3 into an fp32umodel file and a corresponding prototxt file by using a BMNNSDK2 SDK tool, wherein the fp32umodel is in a format proprietary to a bit platform; for prototxt generated in S42, corresponding preprocessing parameters are added in a data layer according to corresponding preprocessing operations on the image in S2 and S3, and the consistency of data sent to a network and an original frame is ensured;

s43, converting the fp32umodel converted in the step S42 into a bit-proprietary intermediate temporary model int8umodel by using a calibration _ use _ pb quantization tool, taking an lmdb data set in S41 as quantization calibration, and taking a prototxt file in S43 as a network layer, wherein int8umodel is a network coefficient file in int8 format generated by quantization;

wherein, Actual _i Representing true values, Forecast _i Representing a predicted value, wherein n is the number of samples;

s45, after confirming that the quantization precision is normal through an error evaluation standard, using a BMNETU Tool provided by a BMNNSDK2 SDK, namely, using a BMNETU interface in a network model quantization Tool Qquantization-Tool autonomously developed through the bit continent as input, compiling into a file required by BMRuntime by using an int8umdel model of the step S43, and comparing the result of calculating each layer of NPU model with the calculation result of a CPU (central processing unit) during compiling to obtain an int8bmode model for pedestrian detection and pedestrian re-identification;

if the error is within the specified range, a final model may be generated. If the error is not in the designated range, checking which layer has larger quantization error before and after, setting fpfwd _ outputs parameters before and after the layer, operating the layers to keep floating point calculation, not being quantized and not influencing the quantization precision of the network;

s48, cutting each frame of pedestrian screened in the step S47 in a video frame to be used as a pedestrian library picture, sending each batch of pedestrian libraries to be identified into the int8bmode model of pedestrian re-identification, which is quantized and completed in the step S45, to extract pedestrian picture characteristics when the set batch processing number is reached, obtaining candidate set characteristics, and inputting a pedestrian image to be inquired to obtain inquiry set characteristics; for example, fig. 5 is a flowchart of a pedestrian search method according to an embodiment of the present invention, which includes inputting a to-be-retrieved pedestrian picture and a video stream of each pedestrian, and performing feature extraction on the to-be-retrieved pedestrian picture by using a quantized pedestrian re-identification model to obtain a query set feature.

And S49, performing Euclidean distance feature calculation on the candidate set features and the query set features in the S48 to obtain a pedestrian similarity value, judging whether the pedestrian similarity value is greater than a preset pedestrian threshold value or not, and obtaining a re-identification result of the given target pedestrian.

Specifically, the quantified pedestrian detection model is used for detecting the pedestrian of each video stream, whether the detected pedestrian score is larger than a set threshold value or not is judged, if yes, the pedestrian is judged, and the pedestrian is cut from the corresponding video frame and stored in a pedestrian library.

And when the storage reaches a certain batch number, extracting features of the batch of pedestrian library pictures by using a pedestrian re-identification model to obtain candidate set features.

And the query set characteristics and the candidate set characteristics utilize cosine similarity to judge the similarity of the pedestrians, and the similarity is expressed as follows:

x _i ,y _i are the feature vectors of the two pictures, respectively, and n is the vector dimension.

And judging whether the calculated pedestrian similarity is larger than a set pedestrian threshold value, if so, storing the frame number of the video stream corresponding to the pedestrian to be identified, wherein the pedestrian to be identified is a specific pedestrian in the pedestrian bank. If not, calculating the pedestrian similarity of the next pedestrian bank picture, and repeating the steps.

The invention has the beneficial effects that: the improved Yolov5 pedestrian detection network is constructed, the Ghost module is used for replacing the original CSP structure, the network calculation amount is reduced, and the detection efficiency is improved. And simultaneously, DIoU-NMS is introduced in the reasoning stage, so that the condition of missing detection is reduced, and the detection precision is improved. And constructing a residual error network ResNet50 to extract global features of the pedestrian image features, and training by combining triple loss, center loss and Identity loss, so that the overfitting degree of the model is effectively reduced, and the generalization capability of the model is improved. And a channel-level sparse pruning method is adopted to prune the pedestrian re-identification network convolution network, so that the size and parameter quantity of the model are reduced, and the memory occupation and time consumption of the model during operation are reduced. And (3) building a pedestrian search system, and combining Sophon SC5 and a cloud AI calculation accelerator card to realize the quantification of the model, further reducing the size of the model and accelerating the reasoning speed. The efficiency of deep learning is optimized by a top-down method from an algorithm to hardware by using an SC5+ specific interface, a deployment model and specific pedestrian retrieval in a surveillance video quickly and accurately.

Corresponding to the embodiment of the high-performance pedestrian retrieval and re-identification method, the invention also provides an embodiment of the high-performance pedestrian retrieval and re-identification device.

Referring to fig. 6, the high-performance pedestrian retrieval and re-identification apparatus provided in the embodiment of the present invention includes a memory and one or more processors, where the memory stores executable codes, and when the processors execute the executable codes, the processor is configured to implement the high-performance pedestrian retrieval and re-identification method in the foregoing embodiment.

The embodiment of the high-performance pedestrian searching and re-identifying device can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 6, a hardware structure diagram of any device with data processing capability where the high-performance pedestrian searching and re-identifying apparatus is located in the present invention is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 6, in the embodiment, any device with data processing capability where the apparatus is located may generally include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

An embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, and when the program is executed by a processor, the method for pedestrian search and re-identification in the above-mentioned embodiment is implemented.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing capability-enabled devices of any of the previous embodiments. The computer readable storage medium may also be any external storage device of a device having data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing capable device. The computer-readable storage medium is used for storing a computer program and other programs and data required by any data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A high-performance pedestrian retrieval and re-identification method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the pedestrian detection model in step S2 includes four modules, namely an input module, a backbone network module, a neck network module and an output module, which are respectively inputted as a picture in the pedestrian detection data set;

the method comprises the steps that a picture is input into a backbone network module through an input end module, a pedestrian feature image is extracted, the pedestrian feature image is sent into a neck network module, the neck network module can enhance detection of a pedestrian detection model on pedestrian feature images with different scaling scales, the enhanced pedestrian feature image is sent into an output end, the output end predicts the pedestrian feature image, and a boundary frame and a category in the predicted pedestrian feature image are generated.

3. The high-performance pedestrian retrieval and re-identification method according to claim 2, wherein the backbone network module comprises three modules, namely a Focus sub-module, a CBL sub-module and a GhostCSP sub-module;

4. The high-performance pedestrian retrieval and re-identification method according to claim 3, wherein the neck network module performs multiple feature extractions on the pedestrian feature image extracted by the trunk network module to generate a pedestrian feature image of 8, 16, 32 scales, performs loss calculation based on the pedestrian feature image of 8, 16, 32 scales to obtain a loss value, and the pedestrian detection model is trained and updated according to the loss value to obtain a trained enhanced pedestrian detection model.

5. The method according to claim 1, wherein the pedestrian re-identification network in step S3 comprises a ResNet50 network and a bnhack module, and is inputted as a picture in the pedestrian re-identification data set;

6. The method according to claim 5, wherein the Loss function Loss of the pedestrian re-identification network in the step S3 is:

wherein: n is the number of samples, x _i Is an input image, y _i Is its class label, p (y) _i |x _i ) Denotes classification by softmax, x _i Is recognized as y _i A predicted probability of (a); d _p Is the distance between the homogeneous image and the input image, d _n Is the distance between the heterogeneous image and the input image, α and β are the hyper-parameters of the balance loss, max (-) is the maximum distance;

the features preceding the fully-connected layer are shown,

the feature center representing the yi-th class,

is the norm of L2.

7. The high-performance pedestrian retrieval and re-identification method according to claim 5, wherein the ResNet50 network is processed by a channel-level sparse pruning method, a scaling factor α is introduced into each channel, connectivity is learned through normal network training, the scaling factors are sparsely regularized in the training process, the importance of the channels is automatically identified, and finally the channels with lower scaling factors obtained through training are pruned.

8. The method for high-performance pedestrian retrieval and re-identification according to claim 1, wherein the objective function of the pedestrian re-identification model pruning is as follows:

9. The method for retrieving and re-identifying pedestrians according to claim 1, wherein the step S4 is specifically:

s42, converting the pedestrian detection model and the pedestrian re-identification model trained in S2 and S3 into an fp32umodel file and a corresponding prototxt file by using a BMNNSDK2 SDK tool, wherein the fp32umodel is in a format private to a bit platform;

s46, inputting multiple paths of monitoring video streams based on the int8bmodel model quantized in the step S45, creating each frame of video on a specified chip, and using the intSbmode model quantized in the step S45 for pedestrian detection on each frame of video in each path of video stream to obtain a pedestrian boundary frame and confidence coefficient;

10. A high performance pedestrian retrieval and re-identification apparatus comprising a memory and one or more processors, the memory having stored therein executable code, wherein the processors, when executing the executable code, are configured to implement the high performance pedestrian retrieval and re-identification method according to any one of claims 1 to 9.