CN112381024B

CN112381024B - Multi-mode-fused unsupervised pedestrian re-identification rearrangement method

Info

Publication number: CN112381024B
Application number: CN202011313048.XA
Authority: CN
Inventors: 吕建明; 林少川; 梁天保; 胡超杰; 莫晚成
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2023-06-23
Anticipated expiration: 2040-11-20
Also published as: CN112381024A

Abstract

The invention discloses a multi-mode integrated unsupervised pedestrian re-identification rearrangement method, which comprises the following steps: collecting multi-mode information of pedestrians in the walking process; extracting pedestrian characteristics by using a convolutional neural network model, and calculating visual similarity; constructing image space-time distribution by utilizing the image space-time information; constructing WiFi space-time distribution by utilizing WiFi information; and merging the visual similarity, the image space-time distribution and the WiFi space-time distribution, and rearranging the sequencing results of the pedestrian re-identification. The method disclosed by the invention synthesizes the multi-mode information to carry out secondary rearrangement, is an effective measure for reducing the search space, and effectively overcomes the defect that the traditional pedestrian re-identification method based on visual characteristics is sensitive to the monitoring environment.

Description

Multi-mode-fused unsupervised pedestrian re-identification rearrangement method

Technical Field

The invention belongs to the field of multi-mode intelligent security, and particularly relates to a multi-mode integrated unsupervised pedestrian re-identification rearrangement method.

Background

At present, pedestrian re-identification, also called pedestrian re-identification, aims to quickly and effectively search out target characters in massive monitoring videos, can track the target characters, identify the identity, locate the missing population and the like, and plays an important role in safe cities. Pedestrian re-identification attracts extensive research due to its huge application value and the challenging problems of angle, illumination, shielding, face blurring and the like.

The current mainstream pedestrian re-identification method mainly utilizes a labeled data set to carry out model training, but in reality, the labeled data consumes a great deal of manpower and financial resources, and the acquisition difficulty is high. The existing unsupervised pedestrian re-recognition research is mainly based on the appearance characteristics of pedestrian images. In existing appearance-based pedestrian re-recognition studies, researchers have developed many approaches around feature extraction and similarity metrics. The former focus is on designing a robust and reliable pedestrian image characteristic representation model, namely different pedestrians can be distinguished, and meanwhile, the influence of illumination and visual angle change can be avoided; the latter focus is on learning a distance function conforming to the characteristic distribution of the pedestrian image characteristics, so that the characteristic distance of the same pedestrian image is smaller, and the characteristic distances of different pedestrian images are larger. However, these methods still present significant challenges for application to actual monitoring services. Mainly expressed in that pictures in the problem of re-identification of pedestrians are sourced from different cameras, and the appearance characteristics of the pictures of the same pedestrian are changed to a certain extent due to the influence of the angles, illumination and other environments of the different cameras; in contrast, due to variations in pedestrian pose and camera angle, the appearance characteristics of different pedestrians may be more similar than the appearance characteristics of the same person in different cameras.

In order to minimize the interference of uncontrollable monitoring environmental factors, the existing pedestrian re-recognition technology has to provide a group of images in the recognition result of each monitoring device for people to select, and then refine the recognition result through an interactive related feedback method. The processing mode not only increases the workload of manual research and judgment and reduces the automation degree of video analysis, but also can greatly change the appearance characteristics of pedestrians due to the differences of visual angles and illumination, and can lead to the fact that the provided result with the earlier sequence is not necessarily more reliable.

Disclosure of Invention

The invention aims to solve the defects in the prior art and provides a multi-mode integrated unsupervised pedestrian re-identification rearrangement method.

The aim of the invention can be achieved by adopting the following technical scheme:

a method for re-identifying and re-arranging an unsupervised pedestrian fused with multiple modes, comprising the following steps:

s1, acquiring multi-mode information of a pedestrian when the pedestrian walks, wherein the multi-mode information comprises a pedestrian image, camera ID information acquired by the pedestrian image, time information acquired by the pedestrian image and WiFi information captured by the pedestrian when the pedestrian passes through each camera;

s2, extracting relevant characteristics of pedestrians from the pedestrian images obtained in the step S1 by using a convolutional neural network model, calculating visual similarity between the pedestrian images, and then sequencing to obtain an original pedestrian re-identification sequencing result;

s3, carrying out statistical calculation on the camera ID information acquired by the pedestrian image and the time information acquired by the pedestrian image obtained in the step S1, and constructing image space-time distribution;

s4, carrying out statistical calculation on the WiFi information captured by the pedestrians passing through each camera obtained in the step S1, and constructing WiFi space-time distribution;

and S5, re-fusing the visual similarity obtained in the step S2, the spatial-temporal distribution of the images obtained in the step S3 and the WiFi spatial-temporal distribution obtained in the step S4, and re-arranging the original pedestrian re-identification sequencing result obtained in the step S2 for the second time.

Further, the process of step S1 is as follows:

s11, acquiring a pedestrian image from a monitoring video acquired by a specified road section crossing camera equipment, dividing the monitoring video acquired by the camera equipment into video frames of one frame by one frame, and then carrying out pedestrian detection on the video frames by an SSD pedestrian detection algorithm or a fast RCNN pedestrian detection algorithm;

s12, recording time information displayed in the monitoring video when each pedestrian image is acquired while acquiring the pedestrian image, and taking the time information as the time information of the pedestrian image acquired under the camera equipment; recording the ID information of the camera equipment as the space information of the pedestrian image acquired under the camera equipment;

s13, collecting WiFi signals sent by the mobile terminal when pedestrians pass through each camera by utilizing a WiFi collector arranged near the camera in the moving process of the pedestrians, wherein the WiFi information comprises unique MAC information of the mobile terminal equipment, time information of capturing the WiFi information and camera equipment ID information of capturing the WiFi information in a certain camera equipment;

s14, dividing the pedestrian image acquired in the step S11 into a query set and a candidate set randomly or proportionally.

Further, in the step S2, relevant features of pedestrians are extracted by using a res net-50 convolutional neural network model, then visual similarity between pedestrian images is calculated by using cosine similarity, and the obtained visual similarity is ordered in a descending order to obtain an original pedestrian re-identification ordering result, wherein the network structure of the res net-50 convolutional neural network model is sequentially connected from an input layer to an output layer as follows: the method comprises the steps of (1) adding a layer of a flexible material, (C) and (B) to the layer of the flexible material, and (C) adding a layer of a flexible material (C) to the layer of the flexible material, wherein the layer of the flexible material is a flexible material, and the (C) is a flexible material, and is a flexible material for the (C) and a flexible material (L) and (B) and (C) is a flexible material.

Further, the process of step S3 is as follows:

s31, splicing the visual similarity between the pedestrian images obtained in the step S2 into a visual similarity matrix, wherein the matrix size is Q multiplied by P, Q is the number of pedestrian images in the query set, and P is the number of pedestrian images in the candidate set;

s32, sorting the visual similarity matrixes according to a descending order of rows, and selecting K images with the largest similarity for each row to obtain a Q multiplied by K screening matrix;

s33, regarding each row in the screening matrix as the same person, and then performing image space-time distribution calculation.

Further, the process of step S33 is as follows:

s331, calculating barrel time difference of the same person moving across cameras by using time information and space information of the pedestrian image obtained in the step S1:

wherein t is _i And t _j Respectively denoted by c _i And c _j Time, t, of each of two pedestrian images appearing under camera _interval A sub-bucket representing a difference in migration time across the camera;

s332, carrying out frequency statistics on the calculated value:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is->

The statistical sum of occurrence frequency, i is any one of the sub-bucket time differences, < >>

Statistics of occurrence frequency for any one sub-bucket time difference and +.>

Sum of all frequency statistics;

s333, on a two-dimensional rectangular coordinate axis

Is a horizontal axis, in->

And drawing a dot connecting line for the vertical axis to obtain the space-time distribution of the image.

Further, the process of step S4 is as follows:

s41, calculating a barrel time difference of the same MAC information in the migration of the cross-camera equipment by utilizing the unique MAC information of the mobile terminal equipment, the time information of capturing the WiFi information and the camera equipment ID information of capturing the WiFi information in a certain camera equipment, which are obtained in the step S13:

wherein t is _i And t _j Respectively denoted by c _i And c _j Time, t ', of each of two pieces of WiFi information appearing under camera' _interval A sub-bucket representing a difference in migration time across the camera;

s42, carrying out frequency statistics on the calculated value:

is->

The statistical sum of occurrence frequency, l' is any one of the sub-barrel time differences,/>

Sum of all frequency statistics;

s43, on a two-dimensional rectangular coordinate axis

Is a horizontal axis, in->

And (5) drawing a point connecting line for the longitudinal axis to obtain WiFi space-time distribution.

Further, the process of step S5 is as follows:

s51, calculating visual probability according to the visual similarity obtained in the step S2:

representative camera c _i And c _j V respectively obtained _m And v _n Visual similarity of the two images, delta and beta are super parameters;

s52, calculating the space-time probability of the image according to the space-time distribution of the image obtained in the step S3:

representative camera c _i And c _j The pedestrian shifts with k barrel time difference, epsilon and alpha are super parameters;

s53, calculating WiFi space-time probability for the WiFi space-time distribution obtained in the step S4:

representing camera c' _i And c' _j The MAC information is migrated according to the time difference of k' sub-barrels, and mu and gamma are super parameters;

s54, fusing the visual probability, the image space-time probability and the WiFi space-time probability through the following formula:

s55, utilize Pr _fuse And (3) rearranging the original pedestrian re-identification sequencing result obtained in the step (S2).

Compared with the prior art, the invention has the following advantages and effects:

the thought of merging multiple modes to conduct rearrangement is an effective measure for reducing search space, has good popularization value, and has reference function on the detection, tracking and retrieval of suspected targets in massive monitoring video big data. Compared with the traditional pedestrian re-identification method based on the visual characteristics of the appearance of the human body, the method of the invention has the following advantages and positive effects:

(1) According to the method, the pedestrian migration in the monitoring equipment is skillfully utilized, the WiFi information sent by the mobile terminal equipment is migrated, the image space-time probability and the WiFi space-time probability are introduced to jointly measure the matching probability of various pedestrians on the basis of the original visual probability, and the reliability of the pedestrian re-identification result is remarkably improved;

(2) The method introduces the image and WiFi space-time information, and the space-time information is not influenced by the shooting environments such as illumination, visual angles and the like, so that the defect that the traditional pedestrian re-identification method based on visual characteristics is sensitive to the shooting environments is effectively overcome.

Drawings

FIG. 1 is a flow chart of a multi-modality fused unsupervised pedestrian re-identification rearrangement method of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

In practice, if the pedestrian image recognition results of the respective monitoring devices on the travel path are regarded as a whole, there should be a strong space-time dependency relationship therebetween. For example, the same pedestrian cannot appear in different monitoring devices with a physical position at the same time, the time difference of the pedestrian appearing in the different monitoring devices must have a reasonable relationship with the distance between the monitoring devices and the traveling speed of the pedestrian in common sense, and the time of the pedestrian appearing in the rear monitoring device on the traveling path should not be earlier than the front monitoring. However, the spatial-temporal dependency relationship related to the pedestrian image needs to know in advance which pedestrian images are migrated across cameras, and more noise is brought to the spatial-temporal dependency relationship related to the pedestrian image constructed under an unsupervised condition.

Meanwhile, the pedestrian-related WiFi information also has strong time-space dependent information. Further, wiFi information has a natural advantage for unsupervised-we rely on the feature that each mobile terminal device has unique MAC information, which WiFi information has been migrated across cameras can be known in advance. However, when the WiFi information is collected, a lot of non-human information is collected, so that more noise exists in the time-space dependence of WiFi.

The method mainly integrates visual information related to the pedestrian image, spatial-temporal dependency relation related to the pedestrian image and WiFi spatial-temporal dependency relation related to the pedestrian, and performs secondary rearrangement on the pedestrian re-recognition result, so that the search space is reduced, the defect of sensitivity to the shooting environment when the visual features are simply relied on is overcome, and the noise influence of the spatial-temporal dependency relation related to the pedestrian and the WiFi spatial-temporal dependency relation is reduced.

Based on the above ideas, the present embodiment discloses a multi-mode fused unsupervised pedestrian re-identification rearrangement method, which includes the following steps:

s1, acquiring multi-mode information of pedestrians during walking, wherein the multi-mode information comprises pedestrian images, camera ID information acquired by the pedestrian images, time information acquired by the pedestrian images, and WiFi information captured by the pedestrians during passing through each camera;

s5, further re-fusing the visual similarity obtained in the step S2, the spatial-temporal distribution of the images obtained in the step S3 and the WiFi spatial-temporal distribution obtained in the step S4, and performing secondary rearrangement on the original pedestrian re-identification sequencing result obtained in the step S2.

In this embodiment, the specific implementation process of the foregoing step S1 is as follows:

s11, firstly, acquiring a pedestrian image from a monitoring video acquired by a certain road section crossing the camera equipment.

For example, the method for acquiring the pedestrian image from the monitoring video may be that firstly, the monitoring video acquired by the camera device is divided into video frames of one frame and one frame, and then pedestrian detection is performed on the video frames through a pedestrian detection algorithm. The pedestrian detection algorithm can adopt an SSD algorithm or a Faster RCNN algorithm, and can achieve the purpose of acquiring a pedestrian image in a frame of video frame. The embodiment of the invention does not limit the pedestrian detection algorithm, and the person skilled in the art can select the pedestrian detection algorithm according to actual conditions.

S12, recording time information displayed in the monitoring video when each pedestrian image is acquired while acquiring the pedestrian image, and taking the time information as the time information of the pedestrian image acquired under the camera equipment; the camera device ID information is recorded as spatial information of the pedestrian image acquired under the camera device. The embodiment of the invention refers to the time information and the space information of the pedestrian image as the space-time information of the pedestrian image.

S13, in the moving process of pedestrians, wiFi signal sent by the mobile terminal when the pedestrians pass through each camera is collected by utilizing a WiFi collector arranged near the camera. The WiFi information comprises unique MAC information of the mobile terminal equipment, and the time information of the WiFi information captured is used as the time information of the WiFi information captured near the camera; the camera device ID information of the WiFi information captured at a certain camera device is used as the spatial information of the WiFi information. In this embodiment, the time information and the space information of the WiFi information are collectively referred to as the space-time information of the WiFi information.

Illustratively, the WiFi collector in the embodiment of the invention is developed on a development board of Haishi kylin HiKey970, and a person skilled in the art can independently develop the WiFi collector according to actual situations.

S14, dividing the pedestrian image acquired in the S11 into a query set and a candidate set.

For example, the partitioning method may be random partitioning, for example, 30% of the pedestrian images may be used as the query set, and the remaining pedestrian images may be used as the candidate set. The method for acquiring the query set and the candidate set is not limited, and can be determined according to the needs.

In this embodiment, the specific implementation process of the foregoing step S2 is as follows:

and (3) extracting relevant characteristics of pedestrians from the pedestrian image obtained in the step (S1) by using a convolutional neural network model, and calculating visual similarity between pedestrian images by using cosine similarity. And performing descending order sequencing on the acquired visual similarity to obtain an original pedestrian re-identification sequencing result. However, this method, which relies solely on visual features, has the disadvantage of being sensitive to the imaging environment. The subsequent step rearranges the re-identification and ordering result of the original pedestrians.

Illustratively, the convolutional neural network model this embodiment employs a ResNet-50 model, and the network structure of the ResNet-50 model is specifically as follows:

the input layer is connected with the output layer in turn: the method comprises the steps of (1) adding a layer of a flexible material, (C) and (B) to the layer of the flexible material, and (C) adding a layer of a flexible material (C) to the layer of the flexible material, wherein the layer of the flexible material is a flexible material, and the (C) is a flexible material, and is a flexible material for the (C) and (B) flexible material, and a flexible layer of the (C) and (B) flexible material.

In this embodiment, the specific implementation process of the foregoing step S3 is as follows:

s31, splicing the visual similarity between the pedestrian images obtained in the step S2 into a visual similarity matrix, wherein the matrix size is Q multiplied by P. Wherein Q is the number of pedestrian images in the query set, and P is the number of pedestrian images in the candidate set.

S32, sorting the visual similarity matrix obtained in S31 in a descending order according to rows, and further selecting K images with the largest similarity for each row to finally obtain a Q multiplied by K screening matrix.

S33, in the screening matrix, each row is temporarily treated as the same person, and then image space-time distribution calculation is carried out.

According to the embodiment of the invention, the time difference of barrel division of the same person moving across the camera is calculated by utilizing the space-time information of the pedestrian image obtained in the step S1:

wherein t is _i And t _j Respectively denoted by c _i And c _j Time, t, of each of two pedestrian images appearing under camera _interval Representing the buckets of the time differences migrated across the cameras. The purpose of binning the migration time differences is to make the final statistics smoother. Then, the calculated value is subjected to frequency statistics:

is->

The sum of the frequency, i is any one of the barrel time differences,/>

Is the sum of all frequency statistics. Next, on a two-dimensional rectangular coordinate axis, to

Is a horizontal axis, in->

In this embodiment, the specific implementation process of the foregoing step S4 is as follows:

according to the embodiment of the invention, the MAC information and the space-time information of the WiFi information obtained in the step S1 are utilized to calculate the barrel time difference of the same MAC information in the migration of the cross-camera equipment:

wherein t is _i And t _j Respectively denoted by c _i And c _j Time, t ', of each of two pieces of WiFi information appearing under camera' _interval Representing the buckets of the time differences migrated across the cameras. The purpose of binning the migration time differences is to make the final statistics smoother. Then, the calculated value is subjected to frequency statistics:

is->

The sum of the frequency, l' is any one of the sub-bucket time differences, < >>

Is a horizontal axis, in->

In this embodiment, the specific implementation process of the foregoing step S5 is as follows:

s51, calculating the visual probability of the visual similarity obtained in the step S2:

representative camera c _i And c _j V respectively obtained _m And v _n Visual similarity of two images, delta and beta are super parameters. In the present embodiment, δ and β are set to 5 and 1, respectively. The embodiment of the invention does not limit the specific delta and beta, and can be set according to the actual situation.

S52, calculating the space-time probability of the image for the space-time distribution of the image obtained in the step S3:

representative camera c _i And c _j The pedestrian shifts with k barrel time difference, epsilon and alpha are super parameters. In this embodiment, ε and α are both set to 10. The embodiment of the invention does not limit the specific epsilon and alpha, and can be set according to the actual situation.

representing camera c' _i And c' _j The frequency of migration of the MAC information is carried out by the time difference of k' in the barrel, and mu and gamma are super parameters. In this embodiment, μ and γ are set to 1 and 10, respectively. The embodiment of the invention does not limit the specific mu and gamma, and can be set according to the actual situation.

s55, utilize Pr _fuse And (3) rearranging the original pedestrian re-identification sequencing result obtained in the step (S2). The method integrates the multi-mode information to carry out secondary rearrangement, is an effective measure for reducing the search space, and not only overcomes the defect that the camera shooting environment is purely dependent on visual characteristicsAnd the sensitivity defect also reduces the noise influence of the pedestrian correlation space-time dependency and the WiFi space-time dependency.

Example two

The present embodiment also provides a computer storage medium storing computer executable instructions that can perform the fused multi-modality unsupervised pedestrian re-identification rearrangement method of the first embodiment. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), or a Solid State Drive (SSD); the storage medium may also comprise a combination of memories of the kind described above.

It is apparent that the above examples are given by way of illustration only and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the invention.

Claims

1. The multi-mode integrated unsupervised pedestrian re-identification rearrangement method is characterized by comprising the following steps of:

s4, carrying out statistical calculation on the WiFi information captured by the pedestrians passing through each camera obtained in the step S1, and constructing WiFi space-time distribution; the process of the step S4 is as follows:

s42, carrying out frequency statistics on the calculated value:

is->

For all frequenciesSum of sub-statistics;

s43, on a two-dimensional rectangular coordinate axis

Is a horizontal axis, in->

Drawing a point connecting line for a longitudinal axis to obtain WiFi space-time distribution;

s5, re-fusing the visual similarity obtained in the step S2, the spatial-temporal distribution of the images obtained in the step S3 and the WiFi spatial-temporal distribution obtained in the step S4, and re-rearranging the original pedestrian re-identification sequencing result obtained in the step S2; the process of the step S5 is as follows:

2. The method for re-identification and rearrangement of an unsupervised pedestrian fused with multiple modes according to claim 1, wherein the procedure of the step S1 is as follows:

3. The method for re-identifying and rearranging the unsupervised pedestrians in a multi-mode fusion manner according to claim 1, wherein in the step S2, relevant features of the pedestrians are extracted by using a res net-50 convolutional neural network model, visual similarity among pedestrian images is calculated by using cosine similarity, the obtained visual similarity is ordered in a descending order, and an original pedestrian re-identifying and ordering result is obtained, wherein the network structure of the res net-50 convolutional neural network model is sequentially connected from an input layer to an output layer: the method comprises the steps of (1) adding a layer of a flexible material, (C) and (B) to the layer of the flexible material, and (C) adding a layer of a flexible material (C) to the layer of the flexible material, wherein the layer of the flexible material is a flexible material, and the (C) is a flexible material, and is a flexible material for the (C) and a flexible material (L) and (B) and (C) is a flexible material.

4. The method for re-identification and rearrangement of an unsupervised pedestrian fused with multiple modes according to claim 1, wherein the procedure of the step S3 is as follows:

5. The method for re-recognition rearrangement of an unsupervised pedestrian fused with multiple modes according to claim 4, wherein the procedure of step S33 is as follows:

s332, carrying out frequency statistics on the calculated value:

is->

Sum of all frequency statistics;

s333, on a two-dimensional rectangular coordinate axis

Is a horizontal axis, in->