CN112906557A

CN112906557A - Multi-granularity characteristic aggregation target re-identification method and system under multiple visual angles

Info

Publication number: CN112906557A
Application number: CN202110183597.8A
Authority: CN
Inventors: 彭德光
Original assignee: Chongqing Megalight Technology Co ltd
Current assignee: Chongqing Megalight Technology Co ltd
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-06-04
Anticipated expiration: 2041-02-08
Also published as: CN112906557B

Abstract

The invention provides a method and a system for re-identifying a multi-granularity feature aggregation target under multiple visual angles, which comprise the following steps: constructing a multi-view neural network, and acquiring target characteristics of a target object from multiple views through the multi-view neural network; constructing a multi-granularity hypergraph based on the target characteristics of each target object in a set time period; inputting a target graph to be queried, and acquiring a neighboring feature set of the target image to be queried from the multi-granularity hypergraph; carrying out similarity comparison on the adjacent feature set of the target image to be inquired and the adjacent feature set of each target object in the multi-granularity hypergraph to obtain a target object re-identification result; the invention can effectively improve the re-identification precision.

Description

Multi-granularity characteristic aggregation target re-identification method and system under multiple visual angles

Technical Field

The invention relates to the field, in particular to a method and a system for re-identifying a multi-granularity feature aggregation target under multiple visual angles.

Background

Pedestrian re-identification based on video sequences is widely discussed because rich time information can be used for solving visual ambiguity, and the current classical method for video pedestrian re-identification is to adopt a deep learning method to a projection high-dimensional feature space for a video sequence and then perform identity matching sorting by calculating the distance between samples, and mainly comprises the steps of adopting a recurrent neural network to aggregate frame-level time features to represent video pedestrian features and using an optical flow field to extract video frame dynamic time information to learn the time features. The prior art has the following disadvantages: 1. the video learning based on the recurrent neural network cannot learn the most discriminant features, and the training of the model to the long-segment video is complex and takes long time. 2. The method for extracting the time characteristics by means of the light field exploration stream structure is easy to generate optical flow estimation errors due to the fact that adjacent frames of a certain section of video clip are not aligned. In order to solve the above problems, the invention 1 provides a video pedestrian re-identification method based on multi-granularity feature aggregation under multiple viewing angles, which captures multi-granularity spatial information and time information of a video sequence at the same time, and retains and enhances diversity discrimination feature representations of different spatial granularities by adopting a simple and efficient hypergraph construction mode.

Disclosure of Invention

In view of the problems in the prior art, the invention provides a method and a system for re-identifying a multi-granularity feature aggregation target under multiple viewing angles, and mainly solves the problems that the training time consumption is long and the accuracy is low in the conventional method.

In order to achieve the above and other objects, the present invention adopts the following technical solutions.

A multi-granularity feature aggregation target re-identification method under multiple visual angles comprises the following steps:

constructing a multi-view neural network, and acquiring target characteristics of a target object from multiple views through the multi-view neural network;

constructing a multi-granularity hypergraph based on the target characteristics of each target object in a set time period;

inputting a target graph to be queried, and acquiring a neighboring feature set of the target image to be queried from the multi-granularity hypergraph;

and carrying out similarity comparison on the adjacent feature set of the target image to be inquired and the adjacent feature set of each target object in the multi-granularity hypergraph to obtain a target object re-identification result.

Optionally, the multi-view neural network includes a convolutional neural network and a classification output layer, and the image is subjected to feature extraction by the convolutional neural network and then input to the classification output layer to obtain target feature outputs of different views.

Optionally, the multi-view neural network is pre-trained by inputting a set containing pre-labeled images with different views into the multi-view neural network, constructing a loss function through cross entropy, and updating network parameters by adopting back propagation.

Optionally, the loss function is expressed as:

wherein, y_iFor the label corresponding to the angle of view,

for categorizing the prediction results, N is the number of views.

Optionally, the target object comprises a pedestrian or a vehicle.

Optionally, the obtaining a set of proximity features of the target image to be queried from the multi-granularity hypergraph includes:

calculating Euclidean distances among target features in the multi-granularity hypergraph, and acquiring the first K target features with the closest feature distances corresponding to the target image to be inquired;

and acquiring an adjacent set of each target feature in the K target features, and selecting the adjacent sets containing the corresponding features of the target image to be inquired from the adjacent sets to form an adjacent feature set of the target image to be inquired.

Optionally, the similarity comparison is performed between the neighboring feature set of the target image to be queried and the neighboring feature set of each target object in the multi-granularity hypergraph, and a target object re-identification result is obtained, including:

and measuring the similarity among the adjacent feature sets through the Jaccard distance, and selecting a target object corresponding to the adjacent feature set with the similarity reaching a set threshold value as re-recognition output.

Optionally, the similarity calculation manner is expressed as:

wherein, I_i，I_jRespectively representing two frames of images, R (I)_iK) represents an image I_iThe set of neighboring features of (a).

A multi-granularity feature aggregation target re-identification system under multiple views comprises:

the network construction module is used for constructing a multi-view neural network and acquiring target characteristics of a target object from multiple views through the multi-view neural network;

the hypergraph construction module is used for constructing a multi-granularity hypergraph based on the target characteristics of each target object in a set time period;

the feature set acquisition module is used for inputting a target graph to be inquired and acquiring a neighboring feature set of the target image to be inquired from the multi-granularity hypergraph;

and the identification module is used for comparing the similarity of the adjacent feature set of the target image to be inquired with the adjacent feature set of each target object in the multi-granularity hypergraph to obtain a target object re-identification result.

As described above, the method and system for re-identifying the multi-granularity feature aggregation target under multiple viewing angles of the present invention have the following advantages.

Visual angle information is increased, and the problems of shielding, visual angle difference and the like are solved; and enhancing the re-identification precision through the neighbor feature set.

Drawings

Fig. 1 is a flowchart of a method for re-identifying a multi-granularity feature aggregation target under multiple viewing angles in an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present invention provides a method for re-identifying a multi-granularity feature aggregation target under multiple viewing angles, which includes steps S01-S04.

In step S01, a multi-view neural network is constructed, and target features of a target object from multiple views are acquired through the multi-view neural network:

in one embodiment, the target object may include a pedestrian, a vehicle, etc., a video image including the target object is captured in advance, and the video sequence is acquired as an input of the multi-view neural network.

In one embodiment, the multi-view neural network comprises a convolutional neural network and a classification output layer, and after the image is subjected to feature extraction through the convolutional neural network, the image is input into the classification output layer to obtain target feature outputs of different views.

In one embodiment, a set containing pre-labeled images of different visual angles is input into a multi-visual-angle neural network, a loss function is constructed through cross entropy, network parameters are updated through back propagation, and the multi-visual-angle neural network is pre-trained.

Specifically, a ternary classification output layer is added after the traditional CNN, and a label image x is utilized_iAs input, it corresponds to the view label y_iAs supervisory signals, for the prediction results

And (3) carrying out supervision training by adopting cross entropy, wherein a cross entropy loss function can be expressed as:

the update calculation of the loss function is done using a forward and backward algorithm.

Extracting video frame characteristics;

for a video sequence I containing a picture I ═ I₁,I₂,...,I_TAnd performing feature extraction on each image by using the constructed multi-view neural network, wherein the feature extraction can be specifically expressed as:

F_i＝CNN(I_i),i＝1,...,T,

wherein, F_iA three-dimensional tensor, which represents dimensions C × H × W, represents a channel size, and represents the height and width of the feature map, respectively.

In step S02, a multi-granularity hypergraph is constructed based on the target features of the target objects within a set time period:

and dividing the image features extracted in the step S01 into p e {1,2,4,8} horizontal blocks in a horizontal division mode, and carrying out average combination on the divided feature maps to construct a partial feature vector. For each granularity, the entire sequence generates N_pT × p partial level features, respectively

A first granularity of a video sequence comprises a single global feature vector, and other granularities comprise partial feature vectors.

First of all using v_i∈V_p,i∈{1,2,...,N_pThe points represent the preparation candidate nodes needed for constructing the hypergraph, and a group of hyperedges E are defined for capturing time information_pTo model short-term to long-term correlations in the hypergraph. Specifically, for any one candidate node v_iIs selected at time T_tMost similar K adjacent nodes in the inner

Using the super edge, these K +1 nodes are as followsThe following are shown in the following table:

updating characteristics of the hypergraph;

for a certain node v of the hypergraph_iDefinition of

And representing all the super edges related to the point, wherein the point related to one super edge has strong relevance, so that the super edge is defined by adopting the aggregation operation as follows:

wherein the content of the first and second substances,

denotes v_jNode characteristics at the layer. Calculating the similarity of the association relationship between the node features and the association features of the super edges

Wherein the content of the first and second substances,

representing the similarity between features. In addition, SoftMax normalized similarity weight is adopted, and super-side information is aggregated to obtain the similarity weight through respective calculation

After the aggregated super-edge information is obtained, the node characteristics can be associated through a full connection layer:

wherein W^lRepresenting the weight matrix and sigma the excitation equation. Therefore, repeating the update mechanism more than L times can calculate a series of output node characteristics

Hypergraph feature aggregation based on an attention mechanism;

after obtaining the final updated node characteristics for each hypergraph, it is considered that in one hypergraph, different nodes have different importance. For example: the lower the importance of the shielded part or background is, the better the feature discrimination is. Thus, we design a discriminant computation based on the attention mechanism, with nodes attentive to each hypergraph

Wherein, W_uA weight matrix is represented. The hypergraph features can therefore be computed as a weight aggregation of node features:

minimizing retention loss aggregation multi-granularity hypergraphs based on mutual information;

to optimize the framework, cross-entropy loss and triplet state loss are employed to co-supervise the training process:

wherein y is_iRepresentation feature

The labels of (1), N and C respectively represent the size of the mini-batch and the class number of the training set,

respectively representing a query sample and a positive sample and a negative sample thereof when the partition granularity is p. After training the model based on the two loss terms, each hypergraph will output a distinct graph-level feature.

In order to obtain the characteristics of the fused multi-granularity hypergraph information, mutual information minimization loss is adopted, mutual information among different hypergraph characteristics is reduced, and further the uncertainty of final video representation is increased by combining all the characteristics. Thus, for hypergraph features of different granularity p, a mutual information minimization loss is defined:

kappa is used to measure the mutual information established by the characteristics of different hypergraphs. And finally combining the loss functions of all parts as the formula (13), and adopting a forward-backward algorithm to complete the updating calculation of the loss functions.

L_all＝L_xent+L_tri+L_MI

In step S03, a target image to be queried is input, and the set of neighboring features of the target image to be queried is obtained from the multi-granularity hypergraph:

in one embodiment, the Euclidean distance between the target features in the multi-granularity hypergraph is calculated, and the first K target features with the closest feature distance corresponding to the target image to be inquired are obtained;

Specifically, the euclidean distance d between the hypergraph features obtained by step S03 is calculated_m(F′i,F′_j) Calculating a neighbor set N (probe, k) corresponding to k nearest distances of the query image probe, where the set includes both positive samples and negative samples, and is defined as:

wherein the content of the first and second substances,

respectively, samples with Euclidean distances of 1 st, 2 nd and k th from probe. At the same time, for each of the neighbor sets N

And also has its own neighbor set N ', which are adjacent to each other if probes are included in N', and which are not adjacent to each other otherwise. Therefore, k-mutual adjacent set R of the probe can be obtained, and all elements in R are target objects which are adjacent to the probe.

R(probe,k)＝{(t_i∈N(probe,k)∩(p∈N(t_i，k))} (16)

The set can be regarded as k-mutual adjacent features of the probe, and is more suitable for similarity measurement between pedestrians compared with the hypergraph feature.

In step S04, the similarity between the neighboring feature set of the target image to be queried and the neighboring feature set of each target object in the multi-granularity hypergraph is compared to obtain a target object re-identification result:

in an embodiment, similarity between the neighboring feature sets is measured through a Jaccard distance, and a target object corresponding to the neighboring feature set with the similarity reaching a set threshold is selected as a re-recognition output, specifically:

for describing any two images I in detail from the perspective of collection_i，I_jThe difference between the nearest neighbor sets defines the Jaccard distance between the two neighbor sets

And measuring the similarity between the target objects according to the distance, and re-identifying the query target object.

The embodiment also provides a system for re-identifying a multi-granularity feature aggregation target under multiple views, which is used for executing the method for re-identifying the multi-granularity feature aggregation target under multiple views in the method embodiment. Since the technical principle of the system embodiment is similar to that of the method embodiment, repeated description of the same technical details is omitted.

In one embodiment, a multi-granularity feature aggregation target re-identification system under multiple views comprises: the system comprises a network construction module, a hypergraph construction module, a feature set acquisition module and an identification module, wherein the network construction module is used for assisting in executing the step S01 in the embodiment of the method; the hypergraph construction module is used to assist in performing step S02 in the foregoing method embodiment; the feature set acquisition module is used for assisting in executing step S03 in the foregoing method embodiment; the identification module is used to assist in performing step S04 in the aforementioned method embodiments.

In summary, the multi-granularity feature aggregation target re-identification method and system under the visual angle adopt the ternary visual angle classification to enable the pedestrian features to contain visual angle information in subsequent processing, and solve the problems of shielding, visual angle difference and the like; the hypergraph neural network structure can simultaneously extract the spatial characteristics and the time dependence of video frames, and the hypergraph diversity corresponding to different spatial granularities can be reserved and enhanced by minimizing loss by using mutual information; the method of coding the adjacent k-ones improves the re-identification precision of the pedestrian, and makes up the defect that the hypergraph learning excessively focuses on the local information. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A multi-granularity feature aggregation target re-identification method under multiple visual angles is characterized by comprising the following steps:

2. The method according to claim 1, wherein the multi-view neural network comprises a convolutional neural network and a classification output layer, and after feature extraction is performed on the image by the convolutional neural network, the image is input to the classification output layer to obtain target feature outputs of different views.

3. The method for re-identifying the multi-granularity feature aggregation target under the multi-view angle according to claim 2, wherein different pre-labeled view angle image sets are input into the multi-view angle neural network, a loss function is constructed through cross entropy, network parameters are updated through back propagation, and the multi-view angle neural network is pre-trained.

4. The method for multi-granularity feature aggregation target re-identification under multiple views according to claim 3, wherein the loss function is expressed as:

wherein, y_iFor the label corresponding to the angle of view,

for categorizing the prediction results, N is the number of views.

5. The method according to claim 1, wherein the target object comprises a pedestrian or a vehicle.

6. The method for re-identifying the multi-granularity feature aggregation target under the multi-view angle according to claim 1, wherein the step of obtaining the neighboring feature set of the target image to be queried from the multi-granularity hypergraph comprises the following steps:

7. The method according to claim 1, wherein the comparing the similarity between the neighboring feature set of the target image to be queried and the neighboring feature set of each target object in the multi-granularity hypergraph to obtain a target object re-recognition result comprises:

8. The method for re-identifying the multi-granularity feature aggregation target under the multi-view angle according to claim 7, wherein the similarity calculation mode is expressed as:

9. A multi-granularity feature aggregation target re-identification system under multiple views is characterized by comprising: