CN110659576A

CN110659576A - Pedestrian searching method and device based on joint judgment and generation learning

Info

Publication number: CN110659576A
Application number: CN201910783692.4A
Authority: CN
Inventors: 张斯尧; 谢喜林; 王思远; 黄晋; 蒋杰; 张�诚; 文戎
Original assignee: Shenzhen Jiu Ling Software Engineering Co Ltd
Current assignee: Shenzhen Jiu Ling Software Engineering Co Ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2020-01-07

Abstract

The embodiment of the invention provides a pedestrian searching method and device based on joint judgment and generation learning, wherein the method comprises the steps of obtaining a pedestrian re-identification system model; segmenting the pedestrian video to be identified based on the key frame of the pedestrian video to be identified, and extracting a pedestrian feature vector from the segmented pedestrian video to be identified through the pedestrian re-identification system model; and calculating the similarity between the feature vector of the target pedestrian and the pedestrian feature vector in the pedestrian video to be recognized, and obtaining a retrieval image of the target pedestrian from the pedestrian video to be recognized according to the similarity. By the embodiment of the invention, the efficiency, accuracy and real-time performance of pedestrian video search can be improved.

Description

Pedestrian searching method and device based on joint judgment and generation learning

Technical Field

The invention belongs to the technical field of computer vision and intelligent traffic, and particularly relates to a pedestrian searching method, a pedestrian searching device, terminal equipment and a computer readable medium based on joint judgment and generation learning.

Background

With the continuous development of artificial intelligence, computer vision and hardware technology, video image processing technology attracts more and more attention of related researchers. However, the current method for dynamically identifying and monitoring pedestrians by directly using videos is not mature, because on one hand, each frame image in the videos is in dynamic refreshing, and the shooting speed of the camera is faster and faster; on the other hand, the existing video pedestrian searching method is low in efficiency and low in accuracy. Therefore, it is difficult to accurately search for a target pedestrian in real time for a large amount of video.

Disclosure of Invention

In view of this, embodiments of the present invention provide a pedestrian search method, apparatus, terminal device and computer readable medium based on joint judgment and generation learning, which can improve efficiency, accuracy and real-time performance of video pedestrian search.

A first aspect of an embodiment of the present invention provides a pedestrian search method based on joint judgment and generative learning, including:

acquiring a pedestrian re-identification system model;

segmenting the pedestrian video to be identified based on the key frame of the pedestrian video to be identified, and extracting a pedestrian feature vector from the segmented pedestrian video to be identified through the pedestrian re-identification system model;

and calculating the similarity between the feature vector of the target pedestrian and the pedestrian feature vector in the pedestrian video to be recognized, and obtaining a retrieval image of the target pedestrian from the pedestrian video to be recognized according to the similarity.

A second aspect of an embodiment of the present invention provides a pedestrian search apparatus based on joint determination and generation learning, including:

the acquisition module is used for acquiring a pedestrian re-identification system model;

the extraction module is used for carrying out segmentation processing on the pedestrian video to be recognized based on the key frame of the pedestrian video to be recognized and extracting a pedestrian feature vector from the segmented pedestrian video to be recognized through the pedestrian re-recognition system model;

and the retrieval module is used for calculating the similarity between the feature vector of the target pedestrian and the pedestrian feature vector in the pedestrian video to be identified and obtaining the retrieval image of the target pedestrian from the pedestrian video to be identified according to the similarity.

A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the above pedestrian search method based on joint judgment and generation learning when executing the computer program.

A sixth aspect of the embodiments of the present invention provides a computer-readable medium storing a computer program which, when being processed and executed, realizes the above-mentioned steps of the pedestrian search method based on joint determination and generation learning.

In the pedestrian searching method based on joint judgment and generation learning provided by the embodiment of the invention, a pedestrian re-identification system model can be obtained, the pedestrian video to be identified is segmented based on the key frame of the pedestrian video to be identified, the pedestrian feature vector is extracted from the segmented pedestrian video to be identified through the pedestrian re-identification system model, the similarity between the feature vector of a target pedestrian and the pedestrian feature vector in the pedestrian video to be identified is calculated, and the retrieval image of the target pedestrian is obtained from the pedestrian video to be identified according to the similarity, so that the efficiency and the accuracy of video pedestrian searching can be improved, and the requirement of real-time searching can be met.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a pedestrian search method based on joint judgment and generative learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of reconstruction of a given pedestrian image in a pedestrian search method based on joint judgment and generation learning according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of cross-synthesis of given pedestrian images with different identities in a pedestrian search method based on joint judgment and generation learning according to an embodiment of the present invention;

fig. 4 is a schematic diagram illustrating Map and Reduce processing performed on a pedestrian video in a pedestrian search method based on joint judgment and generation learning according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a pedestrian searching apparatus based on joint judgment and generation learning according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a detailed structure of the acquisition module in FIG. 5;

FIG. 7 is a schematic diagram of a refined structure of the extraction module in FIG. 5;

FIG. 8 is a schematic diagram of a detailed structure of the retrieval module in FIG. 5;

fig. 9 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Referring to fig. 1, fig. 1 is a pedestrian searching method based on joint judgment and generation learning according to an embodiment of the present invention. As shown in fig. 1, the pedestrian search method based on joint determination and generation learning of the present embodiment includes the steps of:

s101: and acquiring a pedestrian re-identification system model.

In the embodiment of the invention, a learning convolutional neural network module can be constructed and generated based on a given pedestrian image; the generation learning convolution neural network module comprises a self-generation network and a cross generation network, wherein the self-generation network is used for reconstructing a given pedestrian image to generate a reconstructed image, and the cross generation network is used for synthesizing pedestrian images with different identity characteristics to generate a synthesized image; then, a joint discrimination module is constructed according to the generated learning convolution neural network module, wherein the joint discrimination module is used for learning the preliminary characteristics of the reconstructed image and the synthesized image and mining the fine-grained characteristics of the reconstructed image and the synthesized image; finally, determining a loss function of the whole network of the generated learning convolutional neural network module and the combined judgment module based on the loss function of the generated learning convolutional neural network module and the loss function of the combined judgment module, and determining a pedestrian re-identification system model according to the loss function of the whole network; the pedestrian re-identification system model is used for re-identifying pedestrians on the basis of pedestrian images.

Specifically, a Spark (named Apache Spark, a parallel computing architecture based on HDFS from Apache corporation) can be built based on an MLbase (which is a part of Spark ecosphere and is specially responsible for machine learning) machine learning libraryA big data platform. Over the past decade, an extensible distributed programming framework has emerged to manage large data. The first programming model is MapReduce and its open source implementation Apache Hadoop (distributed system infrastructure developed by the Apache foundation, Hadoop for short). In recent years, a new distributed framework apache spark has emerged, which is a platform for fast and versatile large-scale data processing. The Spark platform, based on memory computation, is naturally adapted to big data processing and analysis. Spark has the advantages of Hadoop MapReduce (the full name of MapReduce); but different from MapReduce, the intermediate output result of Job can be saved in the memory, and HDFS (Hadoop distributed file system) is not required to be read and written, so Spark can be better suitable for MapReduce algorithms requiring iteration, such as data mining and machine learning. Since the video data is stored in the HDFS file system, Spark accesses a data source in a TCP socket (TCP socket) based manner, and performs intelligent video analysis using a MapReduce distributed computing model. In the component structure of Spark, MLlib is the implementation library of Spark to commonly used machine learning algorithms. MLlib currently supports four common machine learning problems: binary classification, regression, clustering and collaborative filtering, and also comprises a bottom gradient descent optimization basic algorithm. The machine learning algorithm comprises two parts of training and predicting, wherein a model is trained, and then an unknown sample is predicted. MLbase is automatically optimized for distributed execution, and algorithm selection is achieved according to MLbase best practices and cost-based models. The embodiment of the invention uses MLbase as a tool to process information feature detection and training processing of vehicles, human faces, pedestrians, left-over articles and the like in videos. After a Spark big data platform based on an MLbase machine learning library can be built according to the corresponding components, video images are accessed to the platform, and subsequent algorithm operation is carried out. In the subsequent operation of building a pedestrian re-identification system model, firstly, a learning convolutional neural network module is built and generated based on a pedestrian image obtained from a Spark big data platform, and usually, a pedestrian image can be defined firstly

The identity tag is

Where N is the number of images, y_i∈[1,K]And K is the category or the amount of identity information in the data set. Given two realistic pedestrian images x in the training set_iAnd x_jThe learning convolutional neural network generation module provided by the embodiment of the invention can generate a new pedestrian image through removing the surface layer structure information code reconstruction of the given image. The learning convolutional neural network module is generated by a surface layer decoder E_a:x_i→a_iA structure decoder E_s∶x_j→s_jAnd a surface and structure encoder

And (4) forming. In order to make the generated image more controllable and to better match the data distribution of the real data set, the algorithm provided by the embodiment of the invention strengthens two aspects contained in the generation of the learning convolutional neural network module: 1. self-generating a network; 2. and (4) generating a network in a crossed mode.

With respect to the self-generated network, any one pedestrian image x is given_iThe generative learning convolutional neural network module first learns how to reconstruct the image from the image itself. The general method is as shown in FIG. 2, giving two pedestrian images x_iAnd x_tWherein, the identity characteristics of the two pedestrian images are the same (namely, the same person), so the structure information codes of the two pedestrian images are the same (namely, y)_iIs equal to y_t). For pedestrian image x_iBased on its surface information code a_iAnd a structural information code s_iReconfigurable generated image

For pedestrian image x_tUsing its surface layer information code a_tAnd combined with the pedestrian image x_iStructural information code s of_iReconfigurable generated image

This type of simple self-reconstruction task provides an important regularization for the entire generative learning. The image reconstruction loss function for reconstructing a pedestrian image in the present invention is:

wherein, G (a)_i,s_i) Representation based on pedestrian image x_iReconstructed image a_iRepresenting a pedestrian image x_iSurface layer information code in s_iRepresenting a pedestrian image x_iThe structural information code in (1), E is a desired operator (the same applies below). Meanwhile, in order to strengthen the distinction of the surface layer information codes of different images, the embodiment of the invention constructs the identity loss function in the self-generating network to distinguish different identity characteristics of different images, and the formula is as follows:

wherein, p (y)_i|x_i) Is based on the image x in the surface layer information code of the image_iMembership to dataset y_iIs predicted.

Regarding the cross identity generation network, unlike the self generation network, the cross identity generation network mainly performs generation learning for different identity characteristics of images. More specifically, the cross generation network is used for synthesizing images of pedestrians with different identity characteristics to generate a synthesized image. The present invention uses a latent code reconstruction technique based on a surface information code and a structure information code to control the generation of an image. Overall given two pedestrian images x, as shown in FIG. 3_iAnd x_jThe two pedestrian images have different identity characteristics (i.e. are not the same person), so the two pedestrian images have different structure information codes (i.e. y)_iIs not equal to y_j) Further, since the clothes, trousers, shoes, and the like of pedestrians in the two pedestrian images are different, the two pedestrian imagesThe skin information code of the image is also different. Based on image x_iSurface layer information code a of_iAnd image x_jStructural information code s of_jReconfigurable generated image

Based on image x_jSurface layer information code a of_jAnd image x_iStructural information code s of_iReconfigurable generated image

More specifically, given two realistic pedestrian images x of different identities_iAnd x_j，x_iIs not equal to y_jThen the generated image is

Generating an image requires maintaining an image x_iSurface layer information code a in_iAnd maintaining a pedestrian image x_jStructural information code s in (1)_jThe information of (1). We can then reconstruct the two subcode information after generating the image. The specific formula is as follows:

wherein the content of the first and second substances,

a loss function is reconstructed for the surface layer information codes used to reconstruct the surface layer information codes in the pedestrian image,

a loss function is reconstructed for the structure information code used to reconstruct the structure information code in the pedestrian image.

Similarly, the loss function for distinguishing different identity features of different images in a cross-identity generation network is:

wherein the content of the first and second substances,

is a synthesized image

Middle identity label y_iFor pedestrian image x_iThe overlapping probability of the true significance of. In addition, the embodiment of the present invention provides a loss-fighting function for generating image matching for real data distribution as follows:

L_adv＝E[logD(x_i)]+log(1-D(G(a_j,s_j))) (5)

based on the loss function and the picture processing method, a learning convolutional neural network module can be constructed and generated by combining the existing residual error network 50(Res-net50) model.

Further, regarding the construction of the joint discrimination module, the joint discrimination module can be mainly divided into two parts: main feature learning and fine-grained feature mining. Through a priori experience, the generated images on the line (for example, the reconstructed image and the synthetic image generated by the generation learning convolutional neural network module) can be better utilized based on the two parts. Because these two parts typically focus on different aspects of the generated image, the algorithm provided by embodiments of the present invention can branch off two lightweight head weights at the top of the appearance encoder for both large parts of feature learning. Regarding the learning of the main features (which may also be referred to as preliminary features), the image generated in S101 may be considered to be similar to a training sample in an existing model. However, the change in the pedestrian images across categories and across ID combinations allows embodiments of the present invention to employ a teacher-student style supervision method of dynamic soft tags. The teaching model in the teacher-student type supervision method is only an original training set of baseline CNN (convolutional neural network) training and loss recognition, and like the prior art, the teaching model is not repeated herein. In order to train the discriminant model with the main features, the algorithm provided in the embodiment of the present invention minimizes the Kullback-Leibler divergence (referred to as KL divergence) between the probability distribution p predicted by the joint discriminant module and the probability distribution q predicted by the teaching model, and uses the following loss function:

where K is the amount of identity information, where,

is a synthesized image predicted by the joint discrimination module

For pedestrian image x_iThe probability of overlap of the true significance of,

is a composite image predicted by the teaching model

For pedestrian image x_iThe overlapping probability of the true significance of. In other words, the loss function may be used to learn preliminary features of the reconstructed image and the composite image. Compared with other feature labels, the dynamic feature label used by the method is more suitable for the combined discrimination model, and the reliability of the discrimination model with the main features can be enhanced and improved.

Regarding mining of fine-grained features, unlike the direct effect of main feature learning on the acquisition of generated data, mining of fine-grained features is mainly reflected in the transfer of interest points on images of general pedestrians. The fine-grained feature mining is mainly to train the pedestrian images in the training library or the images generated in S101, and the joint discrimination module is forced to learn fine-grained ID-related attributes (such as hair, hat, bag, body type, etc.) which are not related to clothes. In the discrimination model of the fine-grained features of the part, the algorithm provided by the embodiment of the invention treats an image generated by combining one structural information code and different apparent information codes as the same type of real image providing the structural code. In order to realize the above functions, a feature discrimination model of a fine-grained feature in the joint discrimination module is obtained by training, and a loss function strengthened by the method on the specific classification is as follows:

such a loss function may impose additional identity supervision on the identification discrimination module in a multitasking manner. Compared with the existing method for mining the fine-grained feature samples, the algorithm does not need to explicitly search the hard training samples with the fine granularity, and the identification module of the algorithm is focused on the fine identity feature attributes of the pedestrian images through the fine-grained feature mining of the method.

It can be generally considered that a high quality composite image can be considered "portal" in nature (as opposed to "outliers") because the images generated by the generative learning model preserve and recombine visual content from the real data. Through the two characteristic learning tasks, the combined judgment module of the method can enable the integrally built model network to specifically use the generated data according to the operation on the appearance information code and the structure information code.

The method of the invention does not use single supervision like all previous methods, but processes the generated image from two different angles through the learning of primary features (also called main features) and the mining of fine-grained features, wherein the former focuses on the costume external information with invariable structural information, and the latter focuses on the structural clue with invariable apparent information.

Further, regarding the determination of the model of the pedestrian re-identification system, the embodiment of the present invention trains the surface layer decoder, the structure decoder, the surface layer and structure encoder, and the joint discriminator (corresponding joint discrimination module) together to optimize the overall objective, that is, based on the image reconstruction loss function, the loss function for distinguishing different identity features of different images in the self-generated network, the surface layer information code reconstruction loss function, the structure information code reconstruction loss function, the loss function for distinguishing different identity features of different images in the cross-generation network, and the countervailing loss function, the overall network loss function of the learning convolutional neural network module and the joint discrimination module can be constructed:

wherein the content of the first and second substances,

and

respectively, image reconstruction loss functions for reconstructing different pedestrian images,

is a potential information code reconstruction loss function in cross-identity generation learning. Lambda [ alpha ]_img、λ_id、λ_primAnd λ_fineIs the weight of the importance of each part controlling the relevant loss term, and in the actual image-to-image conversion process, a large weight lambda is generally used_imgThe image reconstruction loss is calculated as 5. Since the cross-ID generated image is of low quality at the beginning, the loss function is identified

May make the training unstable, so a smaller weight λ needs to be set_id0.5. Meanwhile, before the generation quality is stable, the method does not involve the identification of the characteristic learning loss function L_primAnd L_fine. After the integral model function is determined, the generated learning convolution spirit can be subjected to the integral network loss functionTraining the whole network of the network module and the combined discrimination module to obtain and output a pedestrian re-identification system model. The pedestrian re-identification system model can be used for re-identifying pedestrians or retrieving large data amount based on pedestrian images by using a postcursor.

S102: and carrying out segmentation processing on the pedestrian video to be recognized based on the key frame of the pedestrian video to be recognized, and extracting pedestrian feature vectors from the segmented pedestrian video to be recognized through the pedestrian re-recognition system model.

In the embodiment of the present invention, a real-time pedestrian video or video file may be transmitted to a Spark big data platform, video image data is segmented into video segments by a Map (mapping) method, then extraction processing of pedestrian features may be performed, and automatic aggregation and data storage are performed on pedestrian images and pedestrian information (including pedestrian feature vectors and the like) obtained after processing an extraction result by a Reduce method, as specifically shown in fig. 4.

The video segmentation processing is carried out based on the video key frames, so that the video can be better parallelized. The segmentation process is roughly divided into the following three steps: 1. distinguishing I frame data and P frame data according to different frame data in the pedestrian video to be identified, and taking out key frame information of the pedestrian video to be identified; 2. based on the key frame information, taking a key frame of the pedestrian video to be recognized as a segmentation point of the pedestrian video to be recognized, wherein the key frame comprises the pedestrian to be recognized and a preset marker (such as a building, a shop sign, a guideboard, a license plate and the like); 3. according to a moving object detection algorithm, the time point when the moving object appears or disappears is used as the starting and ending time point of the segmented video of the pedestrian video to be identified, and the file position of the segmented video can be obtained according to the appearance or disappearance of the moving object. The moving object detection algorithm is the same as the prior art, and therefore, the details are not repeated herein. It can be seen that the basis of video segmentation is mainly 3 constraint terms as follows: the time point of appearance or disappearance of a moving target in the video is used as the starting and stopping time point of the segmented video; the video segmentation points are required to be video key frames, and the complete video image can be obtained only by the video file segmented according to the video key frames; the video clip cannot be less than 30 seconds long and cannot exceed 6 minutes long. Finally, the segmented video image may be output. In addition, considering the requirement of target detection in practical application, the embodiment of the present invention may set the size of the search area as: the aspect ratio of the picture is variable, and the overall picture size is not changed. This not only helps to increase the processing requirements of the video image, but also greatly reduces the amount of computation. For the original input picture, the regional candidate network (RPN network) will get about twenty thousand search boxes. In practical application, some search boxes beyond the picture boundary can be eliminated; meanwhile, for the search frames overlapped and covered by the same target, a Non-Maximum Suppression (NMS) method can be adopted for processing, so as to achieve the purpose of removing the overlapped search frames. The strategy can remarkably improve the searching efficiency of the candidate target frame. And finally, extracting the pedestrian feature vector from the segmented pedestrian video to be recognized through the pedestrian re-recognition system model obtained in the step S101. It is noted that the more training images are input in S101, the more accurate the model is and the wider the coverage is. The pedestrian target training and a large amount of on-site system adjustment and testing are carried out through a huge amount of pedestrian image learning samples, various characteristics such as appearance outlines, relative positions, colors and textures of various parts such as clothes, faces, upper half bodies and lower half bodies can be collected and described, a large amount of auxiliary classification information is formed, and a comprehensive credibility score can be finally obtained together with the results such as the ages and the sexes of pedestrians.

S103: and calculating the similarity between the feature vector of the target pedestrian and the pedestrian feature vector in the pedestrian video to be recognized, and obtaining a retrieval image of the target pedestrian from the pedestrian video to be recognized according to the similarity.

In the embodiment of the invention, the similarity between the feature vector of the target pedestrian and the pedestrian feature vector in the pedestrian video to be identified can be calculated according to the following formula

Wherein the content of the first and second substances,is the feature vector of the target pedestrian,

for the pedestrian feature vector in the pedestrian video to be identified, | x | isIs | y |

Norm of theta is

And

the smaller the calculated value of the included angle is, the higher the similarity is. And then obtaining a retrieval image of the target pedestrian from the pedestrian video to be identified according to the similarity. The feature vector of the pedestrian in the retrieval image of the target pedestrian is the same as or similar to the feature vector of the target pedestrian. Specifically, the pedestrian images and the pedestrian information which are the same as or most similar to the target pedestrian can be obtained according to the similarity ranking. Assuming that a pedestrian a (target pedestrian) is searched, by the method, retrieval images of the pedestrian a can be obtained from pedestrian videos (pedestrian videos to be identified) shot by a large number of cameras, and the retrieval images also contain the pedestrian a, so that the specific track of the pedestrian a can be determined through the retrieval images and pedestrian information in the images.

In the pedestrian searching method based on joint judgment and generation learning provided in fig. 1, a pedestrian re-identification system model can be obtained, the pedestrian video to be identified is segmented based on a key frame of the pedestrian video to be identified, a pedestrian feature vector is extracted from the segmented pedestrian video to be identified through the pedestrian re-identification system model, the similarity between the feature vector of a target pedestrian and the pedestrian feature vector in the pedestrian video to be identified is calculated, and a retrieval image of the target pedestrian is obtained from the pedestrian video to be identified according to the similarity, so that the efficiency and the accuracy of pedestrian searching of massive videos can be improved, and the requirement of real-time searching can be met. The embodiment of the invention is based on the structural composition of a Spark big data platform and the intelligent analysis of big data, sequentially realizes the intelligent segmentation algorithm of the video key frame based on deep learning and the pedestrian searching technology based on the video, and has the advantages of high system reliability, good identification degree, good robustness, simple step calculation, high efficiency maintenance and real-time performance meeting the requirements.

Referring to fig. 5, fig. 5 is a block diagram illustrating a pedestrian searching apparatus based on joint judgment and generation learning according to an embodiment of the present invention. As shown in fig. 5, the pedestrian search apparatus 50 based on joint determination and generation learning of the present embodiment includes an acquisition module 501, an extraction module 502, and a retrieval module 503. The obtaining module 501, the extracting module 502 and the retrieving module 503 are respectively configured to perform the specific methods in S101, S102 and S103 in fig. 1, and details can be referred to in the related introduction of fig. 1 and are only briefly described here:

the obtaining module 501 is configured to obtain a pedestrian re-identification system model.

The extraction module 502 is configured to perform segmentation processing on the to-be-identified pedestrian video based on the keyframe of the to-be-identified pedestrian video, and extract a pedestrian feature vector from the segmented to-be-identified pedestrian video through the pedestrian re-identification system model.

The retrieval module 503 is configured to calculate a similarity between a feature vector of a target pedestrian and the pedestrian feature vector in the to-be-identified pedestrian video, and obtain a retrieval image of the target pedestrian from the to-be-identified pedestrian video according to the similarity.

Further, as can be seen in fig. 6, the obtaining module 501 may specifically include a learning network building unit 5011, a discrimination building unit 5012, and a re-recognition model determining unit 5013:

a learning network construction unit 5011 for constructing and generating a learning convolutional neural network module based on a given pedestrian image; the learning convolutional neural network generation module comprises a self-generation network and a cross generation network; the self-generation network is used for reconstructing a given pedestrian image to generate a reconstructed image, and the cross generation network is used for synthesizing pedestrian images with different identity characteristics to generate a synthesized image.

The judgment building unit 5012 is used for building a joint judgment module based on the generated learning convolutional neural network module; the joint discrimination module is used for learning the preliminary features of the reconstructed image and the synthesized image and mining the fine-grained features of the reconstructed image and the synthesized image.

A re-recognition model determining unit 5013, configured to determine a loss function of an overall network of the generated learning convolutional neural network module and the joint discrimination module based on the loss function of the generated learning convolutional neural network module and the loss function of the joint discrimination module, and determine a pedestrian re-recognition system model according to the overall network loss function; the pedestrian re-identification system model is used for re-identifying pedestrians on the basis of pedestrian images.

Further, referring to fig. 7, the extraction module 502 may specifically include a distinguishing unit 5021, a first dividing unit 5022, a second dividing unit 5023, and an extraction unit 5024:

the distinguishing unit 5021 is configured to distinguish I-frame data and P-frame data according to differences of different frame data in the pedestrian video to be identified, and extract key frame information of the pedestrian video to be identified.

A first dividing unit 5022, configured to use the keyframe of the to-be-identified pedestrian video as a segmentation point of the to-be-identified pedestrian video based on the keyframe information; the keyframes include pedestrians to be identified and preset markers (e.g., buildings, store signs, road boards, license plates, etc.).

And the second dividing unit 5023 is configured to use a time point when a moving object appears or disappears as a start-stop time point of the segmented video of the pedestrian video to be identified according to a moving object detection algorithm.

And the extraction unit 5024 is used for extracting the pedestrian feature vector from the segmented pedestrian video to be identified through the pedestrian re-identification system model.

Further, referring to fig. 8, the retrieving module 503 may specifically include a calculating unit 5031 and a searching unit 5032:

a calculating unit 5031, configured to calculate a similarity between a feature vector of a target pedestrian and the pedestrian feature vector in the to-be-identified pedestrian video

Wherein the content of the first and second substances,

is the feature vector of the target pedestrian,

for the pedestrian feature vector in the pedestrian video to be identified, | x | is

Is | y |Norm of theta is

And

the included angle therebetween.

A searching unit 5032, configured to obtain, according to the similarity, a retrieval image of the target pedestrian from the pedestrian video to be recognized.

The pedestrian searching device based on joint judgment and generation learning provided in fig. 5 can acquire a pedestrian re-identification system model, perform segmentation processing on a pedestrian video to be identified based on a keyframe of the pedestrian video to be identified, extract a pedestrian feature vector from the pedestrian video to be identified after the segmentation processing through the pedestrian re-identification system model, calculate the similarity between the feature vector of a target pedestrian and the pedestrian feature vector in the pedestrian video to be identified, and obtain a retrieval image of the target pedestrian from the pedestrian video to be identified according to the similarity, so that the efficiency and accuracy of pedestrian searching of massive videos can be improved, and the requirement of real-time searching can be met. The embodiment of the invention is based on the structural composition of a Spark big data platform and the intelligent analysis of big data, sequentially realizes the intelligent segmentation algorithm of the video key frame based on deep learning and the pedestrian searching technology based on the video, and has the advantages of high system reliability, good identification degree, good robustness, simple step calculation, high efficiency maintenance and real-time performance meeting the requirements.

Fig. 9 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 8, the terminal device 9 of this embodiment includes: a processor 90, a memory 91 and a computer program 92 stored in said memory 91 and operable on said processor 90, such as a program for performing a pedestrian search based on joint judgment and generation learning. The processor 90, when executing the computer program 92, implements the steps in the above-described method embodiments, e.g., S101 to S103 shown in fig. 1. Alternatively, the processor 90, when executing the computer program 92, implements the functions of each module/unit in the above-mentioned device embodiments, for example, the functions of the modules 501 to 503 shown in fig. 5.

Illustratively, the computer program 92 may be partitioned into one or more modules/units that are stored in the memory 91 and executed by the processor 90 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 92 in the terminal device 9. For example, the computer program 92 may be divided into an acquisition module 501, an extraction module 502 and a retrieval module 503. (modules in the virtual device), the specific functions of each module are as follows:

The terminal device 9 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. Terminal device 9 may include, but is not limited to, a processor 90, a memory 91. Those skilled in the art will appreciate that fig. 9 is only an example of a terminal device 9, and does not constitute a limitation to the terminal device 9, and may include more or less components than those shown, or combine some components, or different components, for example, the terminal device may also include an input-output device, a network access device, a bus, etc.

The Processor 90 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 91 may be an internal storage unit of the terminal device 9, such as a hard disk or a memory of the terminal device 9. The memory 91 may also be an external storage device of the terminal device 9, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) and the like provided on the terminal device 9. Further, the memory 91 may also include both an internal storage unit of the terminal device 9 and an external storage device. The memory 91 is used for storing the computer program and other programs and data required by the terminal device 9. The memory 91 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A pedestrian searching method based on joint judgment and generation learning is characterized by comprising the following steps:

acquiring a pedestrian re-identification system model;

2. The pedestrian search method based on joint judgment and generation learning according to claim 1, wherein the obtaining of the pedestrian re-identification system model comprises:

building and generating a learning convolution neural network module based on a given pedestrian image; the learning convolutional neural network generation module comprises a self-generation network and a cross generation network; the self-generation network is used for reconstructing a given pedestrian image to generate a reconstructed image, and the cross generation network is used for synthesizing pedestrian images with different identity characteristics to generate a synthesized image;

building a combined judgment module according to the generated learning convolutional neural network module; the joint discrimination module is used for learning the preliminary features of the reconstructed image and the synthesized image and mining the fine-grained features of the reconstructed image and the synthesized image;

determining a loss function of an overall network of the generated learning convolutional neural network module and the combined discrimination module based on the loss function of the generated learning convolutional neural network module and the loss function of the combined discrimination module, and determining a pedestrian re-identification system model according to the loss function of the overall network; the pedestrian re-identification system model is used for re-identifying pedestrians on the basis of pedestrian images.

3. The pedestrian search method based on joint judgment and generation learning according to claim 1, wherein the step of segmenting the pedestrian video to be recognized based on the keyframe of the pedestrian video to be recognized and extracting the pedestrian feature vector from the segmented pedestrian video to be recognized through the pedestrian re-recognition system model comprises the steps of:

distinguishing I frame data and P frame data according to different frame data in the pedestrian video to be identified, and taking out key frame information of the pedestrian video to be identified;

based on the key frame information, taking the key frame of the pedestrian video to be identified as a segmentation point of the pedestrian video to be identified; the key frame comprises a pedestrian to be identified and a preset marker;

according to a moving target detection algorithm, taking the time point when a moving target appears or disappears as the starting and stopping time point of the segmented video of the pedestrian video to be identified;

and extracting pedestrian feature vectors from the segmented pedestrian video to be recognized through the pedestrian re-recognition system model.

4. The pedestrian search method based on joint judgment and generation learning according to claim 1, wherein the calculating a similarity between a feature vector of a target pedestrian and the pedestrian feature vector in the pedestrian video to be recognized and obtaining a retrieval image of the target pedestrian from the pedestrian video to be recognized according to the similarity comprises:

calculating the similarity between the feature vector of the target pedestrian and the pedestrian feature vector in the pedestrian video to be identified

Wherein the content of the first and second substances,

is the feature vector of the target pedestrian,

Is | y |Norm of theta is

And

the included angle between them;

and obtaining a retrieval image of the target pedestrian from the pedestrian video to be identified according to the similarity.

5. A pedestrian search device based on joint judgment and generation learning, characterized by comprising:

6. The pedestrian search device based on joint determination and generation learning according to claim 5, wherein the acquisition module includes:

the learning network building unit is used for building and generating a learning convolution neural network module based on a given pedestrian image; the learning convolutional neural network generation module comprises a self-generation network and a cross generation network; the self-generation network is used for reconstructing a given pedestrian image to generate a reconstructed image, and the cross generation network is used for synthesizing pedestrian images with different identity characteristics to generate a synthesized image;

the discrimination building unit is used for building a combined discrimination module based on the generated learning convolutional neural network module; the joint discrimination module is used for learning the preliminary features of the reconstructed image and the synthesized image and mining the fine-grained features of the reconstructed image and the synthesized image;

the re-recognition model determining unit is used for determining a loss function of an overall network of the generated learning convolutional neural network module and the joint discrimination module based on the loss function of the generated learning convolutional neural network module and the loss function of the joint discrimination module, and determining a pedestrian re-recognition system model according to the overall network loss function; the pedestrian re-identification system model is used for re-identifying pedestrians on the basis of pedestrian images.

7. The pedestrian search apparatus based on joint judgment and generation learning according to claim 5, wherein the extraction module includes:

the distinguishing unit is used for distinguishing I frame data and P frame data according to the difference of different frame data in the pedestrian video to be identified and extracting key frame information of the pedestrian video to be identified;

the first dividing unit is used for taking the key frame of the pedestrian video to be identified as the segmentation point of the pedestrian video to be identified based on the key frame information; the key frame comprises a pedestrian to be identified and a preset marker;

the second dividing unit is used for taking the time point when the moving target appears or disappears as the starting and stopping time point of the segmented video of the pedestrian video to be identified according to a moving target detection algorithm;

and the extraction unit is used for extracting the pedestrian feature vector from the segmented pedestrian video to be identified through the pedestrian re-identification system model.

8. The pedestrian search apparatus based on joint judgment and generation learning according to claim 5, wherein the retrieval module includes:

a calculation unit for calculating the similarity between the feature vector of the target pedestrian and the pedestrian feature vector in the pedestrian video to be identified

Wherein the content of the first and second substances,

is the feature vector of the target pedestrian,

Is | y |

Norm of theta is

And

the included angle between them;

and the searching unit is used for obtaining the retrieval image of the target pedestrian from the pedestrian video to be identified according to the similarity.

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1-4 when executing the computer program.

10. A computer-readable medium, in which a computer program is stored which, when being processed and executed, carries out the steps of the method according to any one of claims 1 to 4.