CN113947782A

CN113947782A - Pedestrian target alignment method based on attention mechanism

Info

Publication number: CN113947782A
Application number: CN202111197529.3A
Authority: CN
Inventors: 郑丽颖; 郑薪竹; 张钰渤
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2022-01-18

Abstract

The invention provides a pedestrian target alignment method based on an attention mechanism, aiming at solving the problems of target loss and shielding, and improving the alignment precision and the alignment performance through process optimization and local feature extraction. The invention has high alignment precision, fully utilizes the global structure information and the local original information of the pedestrian image characteristics, effectively processes the pedestrian target shielding problem and relieves the influence of the local characteristic misalignment problem on the algorithm performance.

Description

Pedestrian target alignment method based on attention mechanism

Technical Field

The invention relates to the technical field of pedestrian target alignment, in particular to a pedestrian target alignment method based on an attention mechanism.

Background

Pedestrian target alignment is a technique that identifies an initial target region from a sequence of pedestrian images and retrieves the target in a subsequent cross-camera cross-scene. The technology can be used as an important supplement of a face recognition technology, and is widely applied to various military and civil fields such as pedestrian detection, video monitoring, action recognition, skeleton key point detection, posture recognition and the like.

In recent years, the problem of pedestrian target alignment, especially in a complex background, has gradually become a research focus, and various pedestrian target alignment techniques and methods have been proposed. Among them, the attention-based method is gaining favor in academic and industrial fields due to its excellent performance and high running speed. The target identification method based on significance learning adds the significance characteristics of the target pedestrian into the patch matching, so that the algorithm can effectively find distinctive and reliable patch matching characteristics. Attention is converged with the algorithm of the convolutional neural network, and hard areas and soft pixels are jointly learned to achieve target alignment for optimizing the image misalignment condition. The spatiotemporal attention-based algorithm extracts useful region information of each image frame using a plurality of spatial attention algorithms and integrates the output through a temporal attention model, allowing the extraction of useful region information from all frames, improving robustness. However, the performance of the existing alignment techniques is still unsatisfactory due to target occlusion, visual disparity, changes in lighting conditions, and the like.

Disclosure of Invention

The invention aims to provide a pedestrian target alignment method based on an attention mechanism, which aims to solve the problems of target loss and occlusion.

The purpose of the invention is realized as follows:

1. an attention mechanism-based pedestrian target alignment method comprises the following steps:

step 1: calculating a feature map of the image based on a residual error network of an attention mechanism;

step 1.1: and roughly extracting the characteristics of I by using a CNN layer containing the primary volume block.

Step 1.2: obtaining a target intermediate feature tensor X belonging to R by utilizing a first residual block of a residual network^C×H×WWherein R is a real number set, and W, H, C is the width, height and channel number of the feature map respectively;

step 1.3: constructing a spatial relationship perception attention block, taking a C-dimensional feature vector at each spatial position as a feature node, and using the C-dimensional feature vector to learn a spatial attention map with the size of H multiplied by W:

step 1.3.1: scanningSpatial location, representing N feature nodes as x_i∈R^CWherein i 1.., N;

step 1.3.2: computing the pairwise relationship r between node i and node j_i,j：

Wherein ReLU (. cndot.) represents a linear rectification function,

s₁is a predefined positive integer for controlling the size reduction rate. Similarly, the pairwise relationship from node j to node i is calculated as r_j,i. Use (r)_i,j,r_j,i) To describe x_iAnd x_jTwo-way relationship between, using affinity matrix R_s∈R^N×NRepresenting the pair-wise relationship between all nodes;

step 1.3.3: calculating the relation vector r of the ith characteristic node_i＝[R_s(i,:),R_s(:,i)]∈R^2NWherein i is 1, 2.. times.n;

step 1.3.4: computing spatial relationship perceptual features y_i：

y_i＝[pool_C(Re L U(W_ψr_i)),Re L U(W_φr_i)]

Wherein the content of the first and second substances,

pool_C(. cndot.) represents a global average pooling operation along the channel dimension. Thus, the global structure information and the local original information related to the feature can be fully utilized;

step 1.3.5: according to the spatial attention value a of the ith characteristic node_i：

a_i＝Sigmoid(W₂ReLU(W₁y_i))

Wherein, the weight value W₁And W₂By 1 × 1 convolutionSigmoid (. cndot.) now indicates that a batch standardization operation is performed;

step 1.3.6: calculating the spatial attention values of all the feature nodes according to the step 1.3.5 to obtain a spatial attention matrix A ═ a₁,a₂,...,a_N]The target intermediate feature tensor X is updated with X ═ X × a.

Step 1.4: constructing a channel relation perception attention block, taking a d-H multiplied by W dimension feature graph at each channel position as a feature node, and learning a C-dimension channel attention vector;

step 1.4.1: scanning the channel position, representing the feature node as z_i∈R^dWherein i 1.., C;

step 1.4.2: according to the steps 1.3.2-1.3.6, a channel attention matrix B is calculated, and the target intermediate feature tensor X is updated by using X ═ X × B.

Step 1.5: and (5) repeatedly executing the step 1.2 to the step 1.4 until the four residual blocks of the residual network are processed, and obtaining a characteristic diagram with the size of H multiplied by W multiplied by C.

Step 2: and respectively calculating the characteristic maps of the reference pedestrian image and the test pedestrian image according to the step 1. Let j be 1, 2., M be the total number of images in the test pedestrian image set, and j be the current test pedestrian image.

And step 3: global features are computed. Extracting global features by directly applying global pooling on the feature map, and respectively representing the global features of the reference pedestrian image and the jth tested pedestrian image as Q and P_j。

And 4, step 4: local features are calculated. The global pooling in the horizontal direction is applied to extract the local features of each row, and the local features of the reference pedestrian image and the jth test pedestrian image are respectively expressed as q ═ { q ═ q { (q)₁,q₂,...q_H}，

Where H represents the number of local features.

And 5: calculating the global distance between the reference image and the test image:

where K represents the dimension of the vector.

Step 6: calculating the local distance between the reference image and the test image;

and 7: calculating the final distance between the reference image and the test image:

d_final(j)＝d(Q,P_j)+S_H,H

and 8: and (5) repeatedly executing the steps 3 to 7 until all the images in the test image set are traversed.

And step 9: image alignment is accomplished using a minimum distance method.

The invention also includes such features:

2. the step 6 specifically includes:

step 6.1: computing two local features q_mAnd

the distance between

Wherein m, n belongs to 1,2,3, H, e is Euler number, | ·| electrically non |, O₂Is a norm of l 2;

step 6.2: calculating the shortest distance of the local features:

wherein S is_m,nIs the distance from the shortest path from (1,1) to (m, n), and the local distance of the two pedestrian images is S_H,H。

Compared with the prior art, the invention has the beneficial effects that:

(1) the alignment precision is high;

(2) the global structure information and the local original information of the pedestrian image features are fully utilized;

(3) the problem of blocking the pedestrian target is effectively solved;

(4) the influence of the local feature misalignment problem on the performance of the algorithm is relieved.

Drawings

Fig. 1 is a diagram of a residual error network structure based on an attention mechanism according to the present invention.

Fig. 2 is a flow chart of the pedestrian target alignment technique of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

Let I be the original input image containing only 1 pedestrian. The invention provides a pedestrian target alignment technology based on an attention mechanism, which comprises the following specific implementation steps:

step 1: the feature map of the image is calculated using the attention-based residual network shown in fig. 1:

Step 1.2: obtaining a target intermediate feature tensor X belonging to R by utilizing a first residual block of a residual network^C×H×WWhere R is a set of real numbers and W, H, C is the width, height, and number of channels of the feature map, respectively.

step 1.3.1: scanning spatial locations, representing N feature nodes as x_i∈R^CWherein i 1.

Step 1.3.2: calculating the pairwise relationship r between the nodes i and j according to equation (1)_i,j：

Wherein ReLU (. cndot.) represents a linear rectification function,

s₁is a predefined positive integerAnd (4) a number for controlling the size reduction rate. Similarly, the pairwise relationship from node j to node i is calculated as r_j,i. Use (r)_i,j,r_j,i) To describe x_iAnd x_jTwo-way relationship between, using affinity matrix R_s∈R^N×NRepresenting the pairwise relationship between all nodes.

Step 1.3.3: calculating the relation vector r of the ith characteristic node_i＝[R_s(i,:),R_s(:,i)]∈R^2NWherein i ═ 1, 2.., N.

Step 1.3.4: calculating spatial relationship perception characteristic y according to formula (2)_i：

y_i＝[pool_C(Re L U(W_ψr_i)),Re L U(W_φr_i)] (2)

Wherein the content of the first and second substances,

pool_C(. cndot.) represents a global average pooling operation along the channel dimension. This makes full use of the global structural information and local raw information associated with the feature.

Step 1.3.5: calculating the spatial attention value a of the ith characteristic node according to the formula (3)_i：

a_i＝Sigmoid(W₂ReLU(W₁y_i)) (3)

Wherein, the weight value W₁And W₂Sigmoid (·) indicates that a batch normalization operation is performed, which is achieved by convolution of 1 × 1.

Step 1.4: constructing a channel relation perception attention block, taking a characteristic graph with d being H multiplied by W dimension at each channel position as a characteristic node, and using the characteristic node to learn a channel attention vector of C dimension:

step 1.4.1: scanning the channel position, representing the feature node as z_i∈R^dWhich is1., C.

Step 1.5: and (4) repeatedly executing the steps 1.2-1.4 until the four residual blocks of the residual network are processed, and obtaining a characteristic diagram with the size of H multiplied by W multiplied by C.

Where H represents the number of local features.

And 5: calculating the global distance between the reference image and the test image according to equation (5):

where K represents the dimension of the vector.

Step 6: calculating the local distance between the reference image and the test image:

step 6.1: computing two local features q_mAnd

the distance between

Wherein m, n belongs to 1,2,3, H, e is Euler number, | ·| electrically non |, O₂Is a norm of l 2.

Step 6.2: the local feature shortest distance is calculated according to equation (6).

And 7: the final distance between the reference image and the test image is calculated using equation (7):

d_final(j)＝d(Q,P_j)+S_H,H (7)

And step 9: image alignment is accomplished using a minimum distance method.

Claims

1. A pedestrian target alignment method based on an attention mechanism is characterized by comprising the following steps: the method comprises the following steps:

step 1.3.1: scanning spatial locations, representing N feature nodes as x_i∈R^CWherein i 1.., N;

Wherein ReLU (. cndot.) represents a linear rectification function,

step 1.3.4: computing spatial relationship perceptual features y_i：

y_i＝[pool_C(ReLU(W_ψr_i)),ReLU(W_φr_i)]

Wherein the content of the first and second substances,

a_i＝Sigmoid(W₂ReLU(W₁y_i))

Wherein, the weight value W₁And W₂Realized by convolution of 1 × 1, Sigmoid (·) indicates that batch standardization operation is performed;

step 1.3.6: calculating the spatial attention values of all the feature nodes according to the step 1.3.5 to obtain the spatial annotationThe force matrix A ═ a₁,a₂,...,a_N]The target intermediate feature tensor X is updated with X ═ X × a.

Where H represents the number of local features.

where K represents the dimension of the vector.

d_final(j)＝d(Q,P_j)+S_H,H

And step 9: image alignment is accomplished using a minimum distance method.

2. A pedestrian target alignment method based on attention mechanism as claimed in claim 1, wherein: the step 6 specifically includes:

step 6.1: computing two local features q_mAnd

the distance between

step 6.2: calculating the shortest distance of the local features: