CN113255598A

CN113255598A - Pedestrian re-identification method based on Transformer

Info

Publication number: CN113255598A
Application number: CN202110723088.XA
Authority: CN
Inventors: 王乾宇; 周金明
Original assignee: Nanjing Inspector Intelligent Technology Co Ltd
Current assignee: Nanjing Inspector Intelligent Technology Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-08-13
Anticipated expiration: 2041-06-29
Also published as: CN113255598B

Abstract

The invention discloses a pedestrian re-identification method based on a Transformer, which comprises the following steps: step 1, constructing a bottom library; the bottom library is a picture containing the whole body of the target person; detecting a pedestrian frame of a pedestrian in a picture by using a YOLOv3 model, intercepting an image corresponding to the pedestrian frame R in an original picture, detecting key points of a human body by using an HRNet model, intercepting image blocks corresponding to the key points, sending input vectors Zi of all the key points into a transform network to obtain characteristics of the pedestrian, and storing the characteristics of the pedestrian and corresponding target personnel information into a base; and 2, extracting the features of the pedestrian to be detected according to the method in the step 1, and 3, re-identifying and matching the pedestrian. The image is captured by taking the key points of the human body as the center and is used as a Transformer to input and extract features, so that the accuracy of pedestrian re-identification is improved.

Description

Pedestrian re-identification method based on Transformer

Technical Field

The invention relates to the field of image recognition technology research, in particular to pedestrian re-recognition research, and specifically relates to a pedestrian re-recognition method based on a Transformer.

Background

The pedestrian re-identification is a technology for judging whether two pedestrian images belong to the same person by using computer vision, and is widely applied to searching for specific people in a monitoring scene. At present, most of the existing re-identification technologies adopt a convolutional neural network to extract the features of pedestrians, but the features extracted by the convolutional neural network are influenced by convolution and down sampling, and detailed information is lost, so that certain limitations are realized.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a pedestrian re-identification method based on a Transformer, which can improve the accuracy of pedestrian re-identification. The technical scheme is as follows:

step 1, constructing a bottom library;

the bottom library is a picture containing the whole body of the target person;

detecting a pedestrian frame of a pedestrian in a picture using a first modelR=(x,y,w,h)Wherein（x,y）The point at the upper left corner of the pedestrian frame, w is the width of the pedestrian frame, and h is the height of the pedestrian frame;

intercepting pedestrian frame in original pictureRDetecting a human body key point P = (P) by using a second model according to the corresponding image₁,p₂,…,p_k) K is the number of key points of the human body;

for each human body key point p_iAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original image to obtain the coordinates Q of the key points of the human body in the original image_iUsing the coordinates Q of each original image human body key point_iTaking the key point as the center, intercepting the image block corresponding to the key point, expanding outwards by taking the key point as the center in an intercepting mode to obtain a rectangular or circular image block, flattening the intercepted image block into an image vector I_i。

Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely J_i=(x_pi,y_pi,r_i) Wherein (x)_pi,y_pi) Is a human body key point p_iCoordinate Q of_i，r_iFor random encoding, the random encoding satisfies the following conditions: (1) r is_iTo obey an n-dimensional random vector of an n-dimensional Gaussian distribution, (2) r_iThe modular length of (1) is fixed, (3) any two key points have different corresponding random codes;

finally, connecting the position code and the image vector to obtain an input vector Z_i=concat(J_i,I_i)；

Inputting vectors Z of all key points_iSending the pedestrian feature into a Transformer network to obtain the pedestrian feature f,wherein f = T (Z)₁,Z₂,…,Z_k) Finally, the characteristics F, F = (F) of all target persons are obtained₁，f₂…, f_m) (ii) a And storing the characteristic F and the corresponding target personnel information into a bottom library.

Step 2, extracting the pedestrian features to be detected,

acquiring a pedestrian image to be detected;

detecting the positions and sizes of all pedestrians in the pedestrian image by using a first model to obtain a plurality of pedestrian frames R_i=(x_i,y_i,w_i,h_i) Wherein（x_i,y_i ）As a point at the upper left corner of the pedestrian frame, w_iWidth of pedestrian frame, h_iIs the height of the pedestrian frame; adjusting the aspect ratio or resolution of the pedestrian image is the same as the requirement of the first model in step 1.

For each detected pedestrian frame, intercepting the pedestrian frame from the original pedestrian imageRCorresponding image, using the second model to detect the key point P of human body_i,j=(p_i,1,p_i,2,…,p_i,k) K is the number of key points of the human body, and j belongs to {1,2, …, k }; the resolution of the truncated image is adjusted to the same requirements as the second model in step 1.

For each human body key point p_i，jAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original pedestrian image to obtain the coordinates Q of the key points of the human body in the original pedestrian image_i，jWith each human body key point coordinate Q_i，jTaking the key point as the center, intercepting the image block corresponding to the key point, expanding outwards by taking the key point as the center in an intercepting mode to obtain a rectangular or circular image block, flattening the intercepted image block into an image vector I_i，j(ii) a The specific interception is the same as the interception employed in step 1.

Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely J_i=(x_pi,j, y_pi,j,r_i,j) Wherein (x)_pi,j, y_pi,j) Is a human body key point p_i,jCoordinate Q of_i,j，r_i,jIs randomEncoding, the random encoding satisfying the following condition: (1) r is_i,jTo obey an n-dimensional random vector of an n-dimensional Gaussian distribution, (2) r_i,jThe modular length of (1) is fixed, and (3) the random codes corresponding to any two key points are different.

Finally, connecting the position code and the image vector to obtain an input vector Z_i，j=concat(J_i,j,I_i,j)；

Image Z of all key points_i,1,Z_i,2,…,Z_i,kSending the pedestrian features into a Transformer network to finally obtain the pedestrian features g_i，g_i =T(Z_i,1,Z_i,2,…,Z_i,k)。

Step 3, carrying out pedestrian re-identification matching;

characterizing the pedestrian g_iAnd comparing the pedestrian with the characteristics in the bottom library to determine whether the detected pedestrian is the target person.

Preferably, the first model in step 1 is a YOLOv3 model.

Preferably, the HRNet model is used as the second model in step 1.

Preferably, when the first model is used to detect the pedestrian frame in step 1, if the aspect ratio or the resolution of the picture does not meet the requirement of the first model, black borders are added to the picture, so that the aspect ratio of the picture is adjusted to the aspect ratio corresponding to the first model, and the picture is scaled to the resolution required by the first model.

Preferably, when the second model is used to detect the key points of the human body, if the resolution of the captured image does not meet the requirement of the second model, the image is scaled to the resolution required by the second model.

Preferably, the human key points correspond to the right ankle, the right knee, the right hip, the left knee, the left ankle, the right wrist, the right elbow, the right shoulder, the left elbow, the left wrist, the neck, the head center and the body center of the human body.

Preferably, the intercepting manner in step 1 is expanded outward with the key point as the center, specifically: taking the key as the center, a rectangle with the side length being the width 1/3 of the pedestrian frame is cut out, or a circle with the radius being the width 1/6 of the pedestrian frame is cut out.

Preferably, step 3 specifically comprises: calculating pedestrian characteristics g_iAnd when the maximum similarity is greater than the similarity threshold value with the similarity s of each bottom library feature in the bottom libraries, the matching is successful, namely the detected pedestrian belongs to the target person in the bottom libraries, otherwise, the matching is failed.

Preferably, step 3 calculates pedestrian features g_iThe similarity s with each bottom library feature in the bottom library specifically is as follows: for each bottom library feature f in the bottom library_aAnd a belongs to {1,2, …, m }, and calculating the pedestrian characteristic g_iAnd f_aThe degree of similarity s of (a) to (b),

the maximum similarity and the corresponding target person information t are as follows:

setting a similarity threshold s_thresholdIf s is_max>s_thresholdAnd if the matching is successful, returning the target personnel information corresponding to the t, otherwise, failing to match.

Compared with the prior art, one of the technical schemes has the following beneficial effects: by taking the key points of the human body as the center, the image is captured and used as the transform input extraction features, the key point images are ensured to be complete when each key point part is sent into the network, so that the network can better extract the features of the pedestrian, and the accuracy of pedestrian re-identification is improved. Meanwhile, the image is intercepted and sent to the network on the basis of the key points, and the influence of the background is also removed.

Detailed Description

In order to clarify the technical solution and the working principle of the present invention, the embodiments of the present disclosure will be described in further detail below. All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

The terms "step 1," "step 2," "step 3," and the like in the description and claims of this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be practiced in sequences other than those described herein.

The embodiment of the disclosure provides a pedestrian re-identification method based on a Transformer, which mainly comprises the following steps:

step 1, constructing a bottom library;

the bottom library is a picture containing the whole body of the target person;

detecting a pedestrian frame of a pedestrian in a picture using a first model (YOLOv 3 model)R=(x,y,w,h)Wherein（x,y）Is the point at the top left corner of the pedestrian frame, w is the width of the pedestrian frame, and h is the height of the pedestrian frame.

Preferably, when the first model (YOLOv 3 model) is used to detect the pedestrian frame, if the aspect ratio or resolution of the picture does not meet the requirements of the first model (YOLOv 3 model), black edges are added to the picture, so that the aspect ratio of the picture is adjusted to the aspect ratio (for example, 16: 9) corresponding to the first model (YOLOv 3 model), and the picture is scaled to the resolution (for example, 512 × 288) required by the first model (YOLOv 3 model).

Intercepting pedestrian frame in original pictureRDetecting a human body key point P = (P) by using a second model (HRNet model) in the corresponding image₁,p₂,…,p_k) And k is the number of key points of the human body.

Preferably, when the second model (HRNet model) is used to detect the human body key points, if the resolution of the cut-out image does not meet the requirement of the second model (HRNet model), the image is scaled to the resolution required by the second model (HRNet model) (for example, the image is scaled to 128 × 256 resolution).

Preferably, the intercepting mode is expanded outwards by taking the key point as a center, and specifically comprises the following steps: taking the key as the center, a rectangle with the side length being the width 1/3 of the pedestrian frame is cut out, or a circle with the radius being the width 1/6 of the pedestrian frame is cut out.

The existing method directly divides the image of the pedestrian into image blocks with fixed length according to the grids, so that the characteristics of the pedestrian can be divided, for example, the head image is divided into a left image and a right image, the images corresponding to the key points are the images of human joints and the head and the body, the images of the joint parts can reflect the posture of the human body, and the related information of clothes is provided to a certain extent, the head information contains the related information of human faces, and the images of the body part can help to identify the related information of the clothes. Meanwhile, the image is intercepted and sent to the network on the basis of the key points, and the influence of the background is also removed.

Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely J_i=(x_pi,y_pi,r_i) Wherein (x)_pi,y_pi) Is a human body key point p_iCoordinate Q of_i，r_iFor random encoding, the random encoding satisfies the following conditions: (1) r is_iTo obey n-dimensional random vectors of n-dimensional Gaussian distribution (such as taking n to 8), (2) r_iFixed die length (e.g. | r)_iI | = 1), (3) any two key points, which have different corresponding random codes.

Finally, connecting the position code and the image vector to obtain an input vector Z_i=concat(J_i,I_i) (ii) a The positions of the human key points are added into the position codes of the input vectors, so that the relative positions of the human key points in the space can be expressed, and the model is helped to learn the human posture information; fixed random codes are added into position codes to represent different human body key points, the different human body key points are semantically distinguished, the fixed mould long random vector can be adopted to express that the human body key points are unordered for pedestrian re-identification, and the importance is equivalent.

Inputting vectors Z of all key points_iSending the pedestrian features into a Transformer network to obtain pedestrian features f, wherein f = T (Z)₁,Z₂,…,Z_k) Finally, the characteristics F, F = (F) of all target persons are obtained₁，f₂…, f_m) (ii) a Storing the characteristic F and the corresponding target personnel information into a bottom library; the transform is adopted to extract features, the encoder of the transform model strengthens the extraction of local features, the decoder of the transform model learns the association between local features and global features, detailed information is kept as much as possible, and the accuracy of identification is improved.

Step 2, extracting the pedestrian features to be detected,

acquiring a pedestrian image to be detected;

the first model (YOLOv 3 model) is used for detecting the positions and sizes of all pedestrians in the pedestrian image to obtain a plurality of pedestrian frames R_i=(x_i,y_i,w_i,h_i) Wherein（x_i,y_i ）As a point at the upper left corner of the pedestrian frame, w_iWidth of pedestrian frame, h_iIs the height of the pedestrian frame; the aspect ratio or resolution of the pedestrian image is adjusted to be the same as the requirement of the first model (YOLOv 3 model) in step 1.

For each detected pedestrian frame, intercepting the pedestrian frame from the original pedestrian imageRDetecting key point P of human body by using second model (HRNet model) corresponding to the image_i,j=(p_i,1,p_i,2,…,p_i,k) K being key points of the human bodyThe number, j ∈ {1,2, …, k }; the resolution of the cut-out image is adjusted to the same level as the second model (HRNet model) in step 1.

Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely J_i=(x_pi,j, y_pi,j,r_i,j) Wherein (x)_pi,j, y_pi,j) Is a human body key point p_i,jCoordinate Q of_i,j，r_i,jFor random encoding, the random encoding satisfies the following conditions: (1) r is_i,jTo obey n-dimensional random vectors of n-dimensional Gaussian distribution (such as taking n to 8), (2) r_i,jFixed die length (e.g. | r)_i,jI | = 1), (3) any two key points, which have different corresponding random codes.

Characterizing the pedestrian g_iAnd comparing the pedestrian with the features in the bottom library to determine whether the pedestrian is the target person.

Step 3, carrying out pedestrian re-identification matching;

calculating pedestrian characteristics g_iAnd the similarity s of each bottom library feature in the bottom library, when the maximum similarity is greater than the similarity threshold value, the matching is successful,namely, the detected pedestrian belongs to the target person in the bottom bank, otherwise, the matching fails.

The invention has been described above by way of example, it is obvious that the specific implementation of the invention is not limited by the above-described manner, and that various insubstantial modifications are possible using the method concepts and technical solutions of the invention; or directly apply the conception and the technical scheme of the invention to other occasions without improvement and equivalent replacement, and the invention is within the protection scope of the invention.

Claims

1. A pedestrian re-identification method based on a Transformer is characterized by comprising the following steps:

step 1, constructing a bottom library;

the bottom library is a picture containing the whole body of the target person;

detecting a pedestrian frame R = (x, y, w, h) of a pedestrian in the picture by using a first model, wherein (x, y) is a point at the upper left corner of the pedestrian frame, w is the width of the pedestrian frame, and h is the height of the pedestrian frame;

intercepting an image corresponding to the pedestrian frame R in the original image, and detecting the key of the human body by using a second modelPoint P = (P)₁,p₂,…,p_k) K is the number of key points of the human body;

for each human body key point p_iAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original image to obtain the coordinates Q of the key points of the human body in the original image_iUsing the coordinates Q of each original image human body key point_iTaking the key point as the center, intercepting the image block corresponding to the key point, expanding outwards by taking the key point as the center in an intercepting mode to obtain a rectangular or circular image block, flattening the intercepted image block into an image vector I_i；

Inputting vectors Z of all key points_iSending the pedestrian features into a Transformer network to obtain pedestrian features f, wherein f = T (Z)₁,Z₂,…,Z_k) Finally, the characteristics F, F = (F) of all target persons are obtained₁，f₂…, f_m) (ii) a Storing the characteristic F and the corresponding target personnel information into a bottom library;

step 2, extracting the pedestrian features to be detected,

acquiring a pedestrian image to be detected;

detecting the positions and sizes of all pedestrians in the pedestrian image by using a first model to obtain a plurality of pedestrian frames R_i=(x_i,y_i,w_i,h_i) Wherein（x_i,y_i ）As a point at the upper left corner of the pedestrian frame, w_iWidth of pedestrian frame, h_iIs the height of the pedestrian frame;adjusting the aspect ratio or resolution of the pedestrian image to be the same as the requirement of the first model in the step 1;

for each detected pedestrian frame, intercepting the pedestrian frame from the original pedestrian imageRCorresponding image, using the second model to detect the key point P of human body_i,j=(p_i,1,p_i,2,…,p_i,k) K is the number of key points of the human body, and j belongs to {1,2, …, k }; adjusting the resolution of the cut-out image to be the same as the requirement of the second model in the step 1;

for each human body key point p_i，jAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original pedestrian image to obtain the coordinates Q of the key points of the human body in the original pedestrian image_i，jWith each human body key point coordinate Q_i，jTaking the key point as the center, intercepting the image block corresponding to the key point, expanding outwards by taking the key point as the center in an intercepting mode to obtain a rectangular or circular image block, flattening the intercepted image block into an image vector I_i，j(ii) a The specific intercepting mode is the same as the intercepting mode adopted in the step 1;

adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely J_i=(x_pi,j, y_pi,j, r_i,j) Wherein (x)_pi,j, y_pi,j) Is a human body key point p_i,jCoordinate Q of_i,j，r_i,jFor random encoding, the random encoding satisfies the following conditions: (1) r is_i,jTo obey an n-dimensional random vector of an n-dimensional Gaussian distribution, (2) r_i,jThe modular length of (1) is fixed, (3) any two key points have different corresponding random codes;

finally, connecting the position code and the image vector to obtain an input vector Z_i，j=concat(J_i,j, I_i,j)；

Image Z of all key points_i,1,Z_i,2,…,Z_i,kSending the pedestrian features into a Transformer network to finally obtain the pedestrian features g_i，g_i=T(Z_i,1,Z_i,2,…,Z_i,k);

Step 3, carrying out pedestrian re-identification matching;

2. The method for pedestrian re-identification based on Transformer as claimed in claim 1, wherein the first model in step 1 is a YOLOv3 model.

3. The method for pedestrian re-identification based on Transformer as claimed in claim 1, wherein the HRNet model is selected as the second model in step 1.

4. The method as claimed in claim 1, wherein when the first model is used to detect the pedestrian frame in step 1, if the aspect ratio or resolution of the picture does not meet the requirement of the first model, black borders are added to the picture, so that the aspect ratio of the picture is adjusted to the aspect ratio corresponding to the first model, and the picture is scaled to the resolution required by the first model.

5. The method as claimed in claim 1, wherein in the step 1, when the second model is used to detect the key points of the human body, if the resolution of the captured image does not meet the requirement of the second model, the image is scaled to the resolution required by the second model.

6. The method for re-identifying pedestrians based on Transformer according to claim 1, wherein the intercepting manner in step 1 is expanded outwards with a key point as a center, specifically: taking the key as the center, a rectangle with the side length being the width 1/3 of the pedestrian frame is cut out, or a circle with the radius being the width 1/6 of the pedestrian frame is cut out.

7. The transform-based pedestrian weight recognition method of any one of claims 1-6, wherein the human body key points correspond to a right ankle, a right knee, a right hip, a left knee, a left ankle, a right wrist, a right elbow, a right shoulder, a left elbow, a left wrist, a neck, a head center, and a body center of the human body.

8. The method for re-identifying pedestrians based on Transformer according to claim 7, wherein the step 3 is specifically as follows: calculating pedestrian characteristics g_iAnd when the maximum similarity is greater than the similarity threshold value with the similarity s of each bottom library feature in the bottom libraries, the matching is successful, namely the detected pedestrian belongs to the target person in the bottom libraries, otherwise, the matching is failed.

9. The Transformer-based pedestrian re-identification method according to claim 8, wherein the pedestrian feature g is calculated in step 3_iThe similarity s with each bottom library feature in the bottom library specifically is as follows: for each bottom library feature f in the bottom library_aAnd a belongs to {1,2, …, m }, and calculating the pedestrian characteristic g_iAnd f_aThe degree of similarity s of (a) to (b),