Pedestrian re-identification method based on Transformer
Technical Field
The invention relates to the field of image recognition technology research, in particular to pedestrian re-recognition research, and specifically relates to a pedestrian re-recognition method based on a Transformer.
Background
The pedestrian re-identification is a technology for judging whether two pedestrian images belong to the same person by using computer vision, and is widely applied to searching for specific people in a monitoring scene. At present, most of the existing re-identification technologies adopt a convolutional neural network to extract the features of pedestrians, but the features extracted by the convolutional neural network are influenced by convolution and down sampling, and detailed information is lost, so that certain limitations are realized.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a pedestrian re-identification method based on a Transformer, which can improve the accuracy of pedestrian re-identification. The technical scheme is as follows:
step 1, constructing a bottom library;
the bottom library is a picture containing the whole body of the target person;
detecting a pedestrian frame of a pedestrian in a picture using a first modelR=(x,y,w,h)Wherein(x,y)The point at the upper left corner of the pedestrian frame, w is the width of the pedestrian frame, and h is the height of the pedestrian frame;
intercepting pedestrian frame in original pictureRDetecting a human body key point P = (P) by using a second model according to the corresponding image1,p2,…,pk) K is the number of key points of the human body;
for each human body key point piAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original image to obtain the human body in the original imageCoordinates Q of the key pointsiUsing the coordinates Q of each original image human body key pointiTaking the key point as the center, intercepting the image block corresponding to the key point, expanding outwards by taking the key point as the center in an intercepting mode to obtain a rectangular or circular image block, flattening the intercepted image block into an image vector Ii,i∈{1,2,…,k}。
Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely Ji=(xpi,ypi,ri) Wherein (x)pi,ypi) Is a human body key point piCoordinate Q ofi,riFor random encoding, the random encoding satisfies the following conditions: (1) r isiTo obey an n-dimensional random vector of an n-dimensional Gaussian distribution, (2) riThe modular length of (1) is fixed, (3) any two key points have different corresponding random codes;
finally, connecting the position code and the image vector to obtain an input vector Zi=concat(Ji,Ii);
Inputting vectors Z of all key pointsiSending the pedestrian features into a Transformer network to obtain pedestrian features f, wherein f = T (Z)1,Z2,…,Zk) Finally, the characteristics F, F = (F) of all target persons are obtained1,f2…, fm) (ii) a And storing the characteristic F and the corresponding target personnel information into a bottom library. m refers to the number of target persons; t denotes a Transformer network, T (Z)1,Z2,…,Zk) Input vector Z representing all key pointsiAnd sending the pedestrian features into a transform network to obtain the pedestrian features f.
Step 2, extracting the pedestrian features to be detected,
acquiring a pedestrian image to be detected;
detecting the positions and sizes of all pedestrians in the pedestrian image by using a first model to obtain a plurality of pedestrian frames Rj=(xj,yj,wj,hj) Wherein(xj,yj )As a point at the upper left corner of the pedestrian frame, wjWidth of pedestrian frame, hjIs the height of the pedestrian frame; adjust the sameThe aspect ratio or resolution of the pedestrian image is the same as the requirement of the first model in step 1.
For each detected pedestrian frame, intercepting the pedestrian frame from the original pedestrian imageRCorresponding image, using the second model to detect the key point P of human bodyj=(pj,1,pj,2,…,pj,k) K is the number of key points of the human body; the resolution of the truncated image is adjusted to the same requirements as the second model in step 1.
For each human body key point pj,iAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original pedestrian image to obtain the coordinates Q of the key points of the human body in the original pedestrian imagej,iWith each human body key point coordinate Qj,iTaking the key point as the center, intercepting the image block corresponding to the key point, expanding outwards by taking the key point as the center in an intercepting mode to obtain a rectangular or circular image block, flattening the intercepted image block into an image vector Vj,i(ii) a The specific interception is the same as the interception employed in step 1.
Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely Ij=(xpj,i, ypj,i,rj,i) Wherein (x)pj,i, ypj,i) Is a human body key point pj,iCoordinate Q ofj,i,rj,iFor random encoding, the random encoding satisfies the following conditions: (1) r isj,iTo obey an n-dimensional random vector of an n-dimensional Gaussian distribution, (2) rj,iThe modular length of (1) is fixed, and (3) the random codes corresponding to any two key points are different.
Finally, connecting the position code and the image vector to obtain an input vector Zj,i=concat(Ij,i,Vj,i);
Image Z of all key pointsj,1,Zj,2,…,Zj,kSending the pedestrian features into a Transformer network to finally obtain the pedestrian features gjj,gjj =T(Zj,1,Zj,2,…,Zj,k). T denotes a Transformer network, T (Z)j,1,Zj,2,…,Zj,k) Representing all key pointsInput vector Zj,iAnd sending the pedestrian features into a transform network to obtain the pedestrian features f.
Step 3, carrying out pedestrian re-identification matching;
characterizing the pedestrian gjAnd comparing the pedestrian with the characteristics in the bottom library to determine whether the detected pedestrian is the target person.
Preferably, the first model in step 1 is a YOLOv3 model.
Preferably, the HRNet model is used as the second model in step 1.
Preferably, when the first model is used to detect the pedestrian frame in step 1, if the aspect ratio or the resolution of the picture does not meet the requirement of the first model, black borders are added to the picture, so that the aspect ratio of the picture is adjusted to the aspect ratio corresponding to the first model, and the picture is scaled to the resolution required by the first model.
Preferably, when the second model is used to detect the key points of the human body, if the resolution of the captured image does not meet the requirement of the second model, the image is scaled to the resolution required by the second model.
Preferably, the human key points correspond to the right ankle, the right knee, the right hip, the left knee, the left ankle, the right wrist, the right elbow, the right shoulder, the left elbow, the left wrist, the neck, the head center and the body center of the human body.
Preferably, the intercepting manner in step 1 is expanded outward with the key point as the center, specifically: taking the key as the center, a rectangle with the side length being the width 1/3 of the pedestrian frame is cut out, or a circle with the radius being the width 1/6 of the pedestrian frame is cut out.
Preferably, step 3 specifically comprises: calculating pedestrian characteristics gjAnd when the maximum similarity is greater than the similarity threshold value with the similarity s of each bottom library feature in the bottom libraries, the matching is successful, namely the detected pedestrian belongs to the target person in the bottom libraries, otherwise, the matching is failed.
Preferably, step 3 calculates pedestrian features gjThe similarity s with each bottom library feature in the bottom library specifically is as follows: for each bottom library feature f in the bottom libraryaAnd a belongs to {1,2, …, m }, and calculating the pedestrian characteristic gjAnd faThe degree of similarity s of (a) to (b),
the maximum similarity and the corresponding target person information t are as follows:
setting a similarity threshold sthresholdIf s ismax>sthresholdAnd if the matching is successful, returning the target personnel information corresponding to the t, otherwise, failing to match.
Compared with the prior art, one of the technical schemes has the following beneficial effects: by taking the key points of the human body as the center, the image is captured and used as the transform input extraction features, the key point images are ensured to be complete when each key point part is sent into the network, so that the network can better extract the features of the pedestrian, and the accuracy of pedestrian re-identification is improved. Meanwhile, the image is intercepted and sent to the network on the basis of the key points, and the influence of the background is also removed.
Detailed Description
In order to clarify the technical solution and the working principle of the present invention, the embodiments of the present disclosure will be described in further detail below. All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
The terms "step 1," "step 2," "step 3," and the like in the description and claims of this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the application described herein may be practiced in sequences other than those described herein.
The embodiment of the disclosure provides a pedestrian re-identification method based on a Transformer, which mainly comprises the following steps:
step 1, constructing a bottom library;
the bottom library is a picture containing the whole body of the target person;
detecting a pedestrian frame of a pedestrian in a picture using a first model (YOLOv 3 model)R=(x,y,w,h)Wherein(x,y)Is the point at the top left corner of the pedestrian frame, w is the width of the pedestrian frame, and h is the height of the pedestrian frame.
Preferably, when the first model (YOLOv 3 model) is used to detect the pedestrian frame, if the aspect ratio or resolution of the picture does not meet the requirements of the first model (YOLOv 3 model), black edges are added to the picture, so that the aspect ratio of the picture is adjusted to the aspect ratio (for example, 16: 9) corresponding to the first model (YOLOv 3 model), and the picture is scaled to the resolution (for example, 512 × 288) required by the first model (YOLOv 3 model).
Intercepting pedestrian frame in original pictureRDetecting a human body key point P = (P) by using a second model (HRNet model) in the corresponding image1,p2,…,pk) And k is the number of key points of the human body.
Preferably, when the second model (HRNet model) is used to detect the human body key points, if the resolution of the cut-out image does not meet the requirement of the second model (HRNet model), the image is scaled to the resolution required by the second model (HRNet model) (for example, the image is scaled to 128 × 256 resolution).
Preferably, the human key points correspond to the right ankle, the right knee, the right hip, the left knee, the left ankle, the right wrist, the right elbow, the right shoulder, the left elbow, the left wrist, the neck, the head center and the body center of the human body.
For each human body key point piAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original image to obtain the coordinates Q of the key points of the human body in the original imageiUsing the coordinates Q of each original image human body key pointiTaking the key point as the center, intercepting the image block corresponding to the key point, expanding outwards by taking the key point as the center in an intercepting mode to obtain a rectangular or circular image block, flattening the intercepted image block into an image vector Ii。
Preferably, the intercepting mode is expanded outwards by taking the key point as a center, and specifically comprises the following steps: taking the key as the center, a rectangle with the side length being the width 1/3 of the pedestrian frame is cut out, or a circle with the radius being the width 1/6 of the pedestrian frame is cut out.
The existing method directly divides the image of the pedestrian into image blocks with fixed length according to the grids, so that the characteristics of the pedestrian can be divided, for example, the head image is divided into a left image and a right image, the images corresponding to the key points are the images of human joints and the head and the body, the images of the joint parts can reflect the posture of the human body, and the related information of clothes is provided to a certain extent, the head information contains the related information of human faces, and the images of the body part can help to identify the related information of the clothes. Meanwhile, the image is intercepted and sent to the network on the basis of the key points, and the influence of the background is also removed.
Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely Ji=(xpi,ypi,ri) Wherein (x)pi,ypi) Is a human body key point piCoordinate Q ofi,riFor random encoding, the random encoding satisfies the following conditions: (1) r isiTo obey n-dimensional random vectors of n-dimensional Gaussian distribution (such as taking n to 8), (2) riFixed die length (e.g. | r)iI | = 1), (3) any two key points, which have different corresponding random codes.
Finally, connecting the position code and the image vector to obtain an input vector Zi=concat(Ji,Ii) (ii) a The positions of the human key points are added into the position codes of the input vectors, so that the relative positions of the human key points in the space can be expressed, and the model is helped to learn the human posture information; adding fixed random codes to position codes to represent different key points of human body, making semantic distinction of different key points of human body, adopting fixed mould long random vector to express key points of human bodyIt is unordered for pedestrian re-identification and the importance is equivalent.
Inputting vectors Z of all key pointsiSending the pedestrian features into a Transformer network to obtain pedestrian features f, wherein f = T (Z)1,Z2,…,Zk) Finally, the characteristics F, F = (F) of all target persons are obtained1,f2…, fm) (ii) a Storing the characteristic F and the corresponding target personnel information into a bottom library; the transform is adopted to extract features, the encoder of the transform model strengthens the extraction of local features, the decoder of the transform model learns the association between local features and global features, detailed information is kept as much as possible, and the accuracy of identification is improved.
Step 2, extracting the pedestrian features to be detected,
acquiring a pedestrian image to be detected;
the first model (YOLOv 3 model) is used for detecting the positions and sizes of all pedestrians in the pedestrian image to obtain a plurality of pedestrian frames Rj=(xj,yj,wj,hj) Wherein(xj,yj )As a point at the upper left corner of the pedestrian frame, wjWidth of pedestrian frame, hjIs the height of the pedestrian frame; the aspect ratio or resolution of the pedestrian image is adjusted to be the same as the requirement of the first model (YOLOv 3 model) in step 1.
For each detected pedestrian frame, intercepting the pedestrian frame from the original pedestrian imageRDetecting key point P of human body by using second model (HRNet model) corresponding to the imagej=(pj,1,pj,2,…,pj,k) K is the number of key points of the human body; the resolution of the cut-out image is adjusted to the same level as the second model (HRNet model) in step 1.
For each human body key point pj,iAdding the coordinates of the upper left corner of the pedestrian frame R to restore the position of the original pedestrian image to obtain the coordinates Q of the key points of the human body in the original pedestrian imagej,iWith each human body key point coordinate Qj,iTaking the key point as the center, intercepting the image block corresponding to the key point, and outwards expanding the intercepted image block by taking the key point as the center to obtain a rectangular or circular imageFlattening the clipped image block into an image vector Vj,i(ii) a The specific interception is the same as the interception employed in step 1.
Adding position codes in front of the image vectors, wherein the position codes comprise coordinates of key points of a human body and fixed random codes, namely Ij=(xpj,i, ypj,i,rj,i) Wherein (x)pj,i, ypj,i) Is a human body key point pj,iCoordinate Q ofj,i,rj,iFor random encoding, the random encoding satisfies the following conditions: (1) r isj,iTo obey n-dimensional random vectors of n-dimensional Gaussian distribution (such as taking n to 8), (2) rj,iFixed die length (e.g. | r)j,iI | = 1), (3) any two key points, which have different corresponding random codes.
Finally, connecting the position code and the image vector to obtain an input vector Zj,i=concat(Ij,i,Vj,i);
Image Z of all key pointsj,1,Zj,2,…,Zj,kSending the pedestrian features into a Transformer network to finally obtain the pedestrian featuresjgj,gjj =T(Zj,1,Zj,2,…,Zj,k)。
Characterizing the pedestrian gjjAnd comparing the pedestrian with the features in the bottom library to determine whether the pedestrian is the target person.
Step 3, carrying out pedestrian re-identification matching;
calculating pedestrian characteristics gjAnd when the maximum similarity is greater than the similarity threshold value with the similarity s of each bottom library feature in the bottom libraries, the matching is successful, namely the detected pedestrian belongs to the target person in the bottom libraries, otherwise, the matching is failed.
Preferably, step 3 calculates pedestrian features gjThe similarity s with each bottom library feature in the bottom library specifically is as follows: for each bottom library feature f in the bottom libraryaAnd a belongs to {1,2, …, m }, and calculating the pedestrian characteristic gjAnd faThe degree of similarity s of (a) to (b),
the maximum similarity and the corresponding target person information t are as follows:
setting a similarity threshold sthresholdIf s ismax>sthresholdAnd if the matching is successful, returning the target personnel information corresponding to the t, otherwise, failing to match.
The invention has been described above by way of example, it is obvious that the specific implementation of the invention is not limited by the above-described manner, and that various insubstantial modifications are possible using the method concepts and technical solutions of the invention; or directly apply the conception and the technical scheme of the invention to other occasions without improvement and equivalent replacement, and the invention is within the protection scope of the invention.