CN110766093A

CN110766093A - Video target re-identification method based on multi-frame feature fusion

Info

Publication number: CN110766093A
Application number: CN201911055853.4A
Authority: CN
Inventors: 李冠华; 徐晓刚; 管慧艳; 刘静
Original assignee: Smart Vision Hangzhou Technology Development Co Ltd
Current assignee: Smart Vision Hangzhou Technology Development Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-07

Abstract

The invention discloses a video target re-identification method based on multi-frame feature fusion, which comprises the following steps: acquiring multi-frame continuous images of the same target; classifying the images according to the orientation of the target; extracting target features of all the images; performing feature fusion and pooling on the images in the same orientation to obtain fusion features; identifying the orientation of a target to be identified, and extracting the characteristic to be identified; taking the product of the similarity of the feature to be identified and the fused feature corresponding to the orientation and the weight factor of the orientation as the final similarity; if the maximum value of the final similarity is larger than the given threshold value, the recognition is successful, the target corresponding to the maximum value of the final similarity is output as a re-recognition result, and otherwise, the recognition is failed. The invention performs correlation on the target on the time axis and solves the matching problem of different orientations of the target.

Description

Video target re-identification method based on multi-frame feature fusion

Technical Field

The invention relates to the technical field of image recognition, in particular to a video target re-recognition method based on multi-frame feature fusion.

Background

Searching for specific pedestrians in a common video is a problem which needs to be solved urgently, and is particularly used for searching for a suspected target in case detection. Generally, the resolution of a pedestrian image area is low due to a small pedestrian target in a video, and the pedestrian image area cannot be authenticated by a face recognition method, and therefore, a pedestrian re-recognition method based on human body appearance characteristics is widely researched, but most of the current methods focus on extracting image characteristics, and have the following defects:

1. the video has time continuity, and the feature extraction method based on the picture ignores continuous features on a time axis and is not accurate enough for feature extraction;

2. the same pedestrian can be in different orientations in the video, and the different orientations have great influence on the final recognition.

Disclosure of Invention

The invention aims to provide a video target re-identification method based on multi-frame feature fusion, which is used for carrying out correlation on a target on a time axis and solving the matching problem of different orientations of the target.

In order to achieve the purpose, the invention provides the following technical scheme:

a video target re-identification method based on multi-frame feature fusion is characterized by comprising the following steps:

s1, acquiring multi-frame continuous images of the same target;

s2, classifying the images according to the orientation of the target;

s3, extracting target features of all the images;

s4, performing feature fusion and pooling on the images in the same orientation to obtain fusion features;

s5, identifying the orientation of the target to be identified, and extracting the characteristic to be identified according to S3 and S4;

s6, taking the product of the similarity of the feature to be identified and the fused feature corresponding to the orientation and the weight factor of the orientation as the final similarity;

and S7, if the maximum value of the final similarity is larger than a given threshold value, the recognition is successful, and the target corresponding to the maximum value of the final similarity is output as a re-recognition result, otherwise, the recognition is failed.

Further, the classification of the orientation in S2 employs a deep neural network model.

Further, the S2 includes training a deep neural network model, and the image with the manually labeled orientation is used as a sample to train the deep neural network model.

Further, the extraction of the target feature in S3 adopts a CNN network.

Further, the S4 feature fusion uses an RNN network, and uses a linear combination of the target feature input at the current time and the feature vector of the RNN network at the previous time as an output, specifically:

o^(t)＝W_if^(t)+W_sr^(t-1)

r^(t)＝Tanh(o^(t))

wherein o is^(t)Is the output of the RNN network at the current time t; w_iAnd W_sIs a weight coefficient; f. of^(t)Target features input for the current time t; r is^(t-1)The characteristic vector of the RNN at the last moment t-1 is obtained; tanh (. cndot.) is the excitation function.

Further, the pooling is an average pooling:

wherein, V_yFor the fused feature, T is duration.

Further, the calculation of the final similarity in S6 is specifically as follows:

S_o＝wS(V_x,V_y)

wherein, V_xIs a feature to be identified; v_yIs a fusion feature; s (-) is a similarity calculation function; w is a weight factor for orientation, W ∈ W, W ═ W [ W ]_s,w_d,w_n}，w_sIs a V_xAnd V_yTowards the same weight factor, w_dIs a V_xAnd V_yOppositely oriented weighting factors, w_nIs a V_xAnd V_yTowards neighboring weight factors; s_oThe final similarity.

Further, the value of the weight factor is w_s＝[0.8,0.9]，w_d＝[0.4,0.5]，w_n＝[0.55,0.65]。

Further, the given threshold is 0.6.

Compared with the prior art, the invention has the beneficial effects that: according to the invention, the human body is divided into four orientations, and corresponding weight factors are set according to different orientations, so that the matching problem of different orientations of the target is solved; on the other hand, when multi-frame features are fused, the features in the unified orientation are classified according to the orientation, and time sequence fusion is carried out on the features in the unified orientation, so that correlation of the target on a time axis is realized.

Drawings

FIG. 1 is an overall process flow diagram of the present invention.

Fig. 2 is a diagram of an RNN network model.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a video object re-identification method based on multi-frame feature fusion, which includes the following steps:

s1, acquiring multi-frame continuous images of the same target;

s2, classifying the images according to the orientation of the target to obtain corresponding orientation information;

specifically, the present invention divides the pose of the target into four orientations, front, back, right and left, respectively. The classification algorithm adopts a deep neural network model trained in advance. And in the training of the deep neural network model, the artificially marked image is used as a sample to train the deep neural network model. The multiple frames of consecutive images in S1 are thus classified into multiple classes, one for each class, according to the orientation of the object.

And S3, performing target feature extraction on all the images, preferably adopting a CNN network.

By i^(t)Representing the image at time t. Will i^(t)Inputting into CNN network, and calculating and outputting target feature represented as f^(t)＝C(i^(t))。

And S4, performing feature fusion and pooling on the images in the same orientation to obtain fused features.

Specifically, the feature fusion adopts an RNN network and only images in the same orientation are fused, so that feature instability caused by different orientations is avoided. As shown in fig. 2, a linear combination of the target feature input at the current time and the feature vector of the RNN network at the previous time is used as an output, and specifically:

o^(t)＝W_if^(t)+W_sr^(t-1)

r^(t)＝Tanh(o^(t))

After RNN network fusion, pooling operations, specifically average pooling, need to be performed:

wherein, V_yT is the duration of the resulting fusion signature.

Thus, for each orientation of the image set classified at S2, there is a corresponding fusion feature V_yAnd forming a fused feature set.

specifically, for a given target to be recognized, the orientation recognition is performed according to the deep neural network model trained in S2. Following the steps of S3 and S4,inputting the target to be identified into the CNN network to obtain the characteristic V to be identified_x。

specifically, the final similarity is calculated as follows:

S_o＝wS(V_x,V_y)

wherein, V_xIs a feature to be identified; v_yIs a fusion feature; s (-) is a similarity calculation function, specifically a cosine distance; w is a weight factor for orientation, W ∈ W, W ═ W [ W ]_s,w_d,w_n}，w_sIs a V_xAnd V_yTowards the same weight factor, w_dIs a V_xAnd V_yOppositely oriented weighting factors, w_nIs a V_xAnd V_yTowards neighboring weight factors; s_oThe final similarity. Preferably, the value of the weighting factor is w_s＝[0.8,0.9]，w_d＝[0.4,0.5]，w_n＝[0.55,0.65]. In particular w_s＝0.85，w_d＝0.45，w_n＝0.6。

It is worth mentioning that the mutual relationship of the orientations is as follows: taking the front face as an example, the opposite direction is the back face, and the directions adjacent to it are the left side and the right side.

And S7, if the maximum value of the final similarity is greater than the given threshold value of 0.6, the recognition is successful, and the target corresponding to the maximum value of the final similarity is output as a re-recognition result, otherwise, the recognition is failed.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A video target re-identification method based on multi-frame feature fusion is characterized by comprising the following steps:

s1, acquiring multi-frame continuous images of the same target;

s2, classifying the images according to the orientation of the target;

s3, extracting target features of all the images;

2. The method for re-identifying the video target based on the multi-frame feature fusion of the claim 1, wherein the classification of the orientation in the S2 adopts a deep neural network model.

3. The method for re-identifying the video target based on the multi-frame feature fusion of claim 2, wherein the S2 further includes training a deep neural network model, and the deep neural network model is trained by using the artificially oriented pictures as samples.

4. The method for re-identifying the video target based on the multi-frame feature fusion as claimed in claim 1, wherein the extraction of the target feature in S3 adopts a CNN network.

5. The method according to claim 1, wherein the S4 feature fusion uses RNN network, and uses a linear combination of the target feature input at the current time and the feature vector of the RNN network at the previous time as an output, specifically:

o^(t)＝W_if^(t)+W_sr^(t-1)

r^(t)＝Tanh(o^(t))

6. The method for re-identifying the video target based on the multi-frame feature fusion as claimed in claim 5, wherein the pooling is an average pooling:

wherein, V_yFor the fused feature, T is duration.

7. The method for re-identifying the video target based on the multi-frame feature fusion as claimed in claim 1, wherein the final similarity in S6 is calculated as follows:

S_o＝wS(V_x,V_y)

8. The multi-frame based device of claim 7The method for re-identifying the sign-fused video target is characterized in that the value of the weight factor is w_s＝[0.8,0.9]，w_d＝[0.4,0.5]，w_n＝[0.55,0.65]。

9. The method according to claim 1, wherein the given threshold is 0.6.