CN116311379A

CN116311379A - Pedestrian re-recognition method and device based on Transformer network model and computer equipment

Info

Publication number: CN116311379A
Application number: CN202310351000.5A
Authority: CN
Inventors: 刘歆; 赵义铭; 钱鹰; 陈奉; 曾奎; 孟雅朋; 姜美兰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-06-23

Abstract

The invention belongs to the field of computer vision, and relates to a pedestrian re-identification method and device based on a Transformer network model and computer equipment; the method comprises the steps of obtaining a target pedestrian image, preprocessing the target pedestrian image, and generating a standard pedestrian image; dividing a standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window; carrying out horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix; and inputting the target pedestrian image into a pre-trained improved transducer network model, and predicting to obtain a recognition result of the target pedestrian image. The invention divides the input image into a plurality of square small blocks with overlapped parts by utilizing the sliding window, so that the characteristics of pedestrians at the boundary edge of the shielding object and the pedestrians are highlighted; the improved Transformer network structure is utilized, the association of pedestrian characteristics in all directions is enhanced, and the pedestrian re-recognition accuracy is improved.

Description

Pedestrian re-recognition method and device based on Transformer network model and computer equipment

Technical Field

The invention belongs to the field of computer vision, and relates to a pedestrian re-identification method and device based on a Transformer network model and computer equipment.

Background

Pedestrian Re-IDentification (Re-ID) can be regarded as a sub-problem in the image retrieval field focusing on pedestrian images. In this task, a pedestrian image to be detected is first given, and then the pedestrian is detected again from a plurality of scenes containing the pedestrian image captured by a plurality of cameras.

In a real scene, pedestrian images come from different cameras in different scenes. The images of pedestrians are inevitably affected by shooting angles and unpredictable object shielding, so that the images of the same pedestrian have great differences. Therefore, it is necessary to design a pedestrian re-recognition method capable of effectively extracting image features under complex scene conditions.

The pedestrian re-identification method with better effect at present is carried out based on a feature fusion method. Among these, the method of extracting local features is of critical importance, such as PCB, pyramid, etc. The PCB is a method for extracting local features by equally dividing pedestrian features in the horizontal direction at first. According to the method, an input image is divided into a plurality of horizontal strips, then the horizontal strips are respectively subjected to convolutional neural network training to obtain local features, and the local features are respectively trained through a plurality of classifiers with unshared weights. The Pyramid considers different division granularities on the basis of the PCB, and the global features and the local features are fused better.

However, the above method has the following problems:

the occlusion of the object causes the partial feature extraction of the partial pedestrian image to fail, so that the feature difference between the complete image and the occluded image of the same pedestrian is too large, resulting in detection failure.

Disclosure of Invention

Based on the problems existing in the prior art, the invention provides a pedestrian re-recognition method, a device and computer equipment based on a Transformer network model, which can solve the problem of pedestrian re-recognition under the condition of object shielding.

In a first aspect of the present invention, the present invention provides a pedestrian re-recognition method based on a transform network model, the method comprising:

acquiring a target pedestrian image, preprocessing the target pedestrian image, and generating a standard pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;

dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;

performing horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;

and inputting the vertical feature matrix and the horizontal feature matrix into a pre-trained transducer network model, and predicting to obtain the recognition result of the target pedestrian image.

In a second aspect of the present invention, the present invention further provides a pedestrian re-recognition device based on a fransformer network model, the device comprising:

the image acquisition module is used for acquiring a target pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;

the image preprocessing module is used for preprocessing the target pedestrian image to generate a standard pedestrian image;

the image segmentation module is used for dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;

the image mapping module is used for carrying out horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;

and the image recognition module is used for inputting the horizontal feature matrix and the vertical feature matrix into a pre-trained improved transducer network model, and predicting to obtain a recognition result of the target pedestrian image.

In a third aspect of the present invention, the present invention further provides a computer device, the computer device comprising a processor and a memory, the processor and the memory being connected to each other, wherein the memory is configured to store a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform a pedestrian re-recognition method based on a transponder network model according to the first aspect of the present invention.

The invention has the beneficial effects that:

aiming at the shielding problem, the invention provides a pedestrian re-identification method, a pedestrian re-identification device and computer equipment based on a Transformer network model. The invention divides the pedestrian image into a plurality of sub-image inputs, thereby reducing the influence of the shielding object on the pedestrian characteristics. And the input image is divided into a plurality of square sub-images with overlapped parts by utilizing the sliding window, so that the characteristics of pedestrians at the boundary edge of the shielding object and the pedestrians are highlighted. The invention also constructs an improved transducer network, and a dual multi-scale transducer structure based on a multi-branch multi-head attention module is constructed in the first layer of the network for acquiring the pedestrian characteristics of the image in the horizontal direction and the vertical direction, so that the characteristic representation of pedestrians is enriched, and the pedestrian re-recognition accuracy is improved.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a schematic flow chart of a pedestrian re-recognition method based on a Transformer network model;

FIG. 2 is a schematic diagram of a network structure of a pedestrian re-recognition method based on a transform network model according to the present invention;

fig. 3 is a schematic diagram of a multi-branch multi-head attention module in a pedestrian re-recognition method based on a Transformer network model.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The application provides a pedestrian re-identification method based on a Transformer network model, which can be used for extracting image characteristics, matching images and acquiring and processing related data based on an artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

FIG. 1 is a schematic flow chart of a pedestrian re-recognition method based on a Transformer network model; as shown in fig. 1, the method comprises the steps of:

101. acquiring a target pedestrian image, preprocessing the target pedestrian image, and generating a standard pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;

in the embodiment of the invention, the pedestrian image to be subjected to pedestrian re-recognition can be a training set image in the existing pedestrian image data set, can be a verification set image in the existing pedestrian image data set, and can also be a test set image of the existing pedestrian image data set; and more particularly, the video image from the video monitoring collection, wherein the video image comprises a pedestrian image. In the embodiment of the invention, the standard pedestrian image is a pedestrian image with the same preset size, and the standard pedestrian image with the size of 256×128 can be obtained after preprocessing the pedestrian image in the data set by taking the mark-1501 as an example of the data set.

In some embodiments, after the normalization of the pedestrian images in the dataset, 33217 images of 1501 pedestrians can be divided into a training set and a test set according to a ratio of 6:4, wherein the training set comprises 19926 pedestrian images and the test set comprises 13291 pedestrian images.

Fig. 2 is a schematic diagram of a network structure of a pedestrian re-recognition method based on a transducer network model according to the present invention, as shown in fig. 2, after a standard pedestrian image is obtained, a sliding window is used to block the standard pedestrian image, and a horizontal linear projection and a vertical linear projection are respectively used to input the standard pedestrian image into an improved transducer network model, so as to finally obtain a corresponding recognition result, and the network structure will be described in detail below in connection with steps 102 to 104.

102. Dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;

in some embodiments of the invention, the standard pedestrian image may be segmented into a plurality of non-overlapping sub-images; and dividing each grouping sub-image into a plurality of square sub-images with overlapping parts by adopting a sliding window.

Wherein each of the standard pedestrian images is divided into N square sub-images having overlapping portions, wherein N is calculated as follows:

wherein H is the height of a third standard pedestrian image, W is the width of the standard pedestrian image, S is the step length, P is the height and width of the square sub-images, K represents the total number of divisions of the sub-images, and K=K1×K2; k1 represents the number of vertically divided sub-images, and K2 represents the number of horizontally divided sub-images.

For convenience of description, the embodiment divides the standard pedestrian image into upper and lower sub-images, and divides the upper and lower sub-images into N square sub-images with overlapping portions by using a sliding window, where the calculation formula of N is as follows:

wherein, in the above formula, k=2, k1=2, k2=1.

In the embodiment of the present invention, taking 256×128 as an example of a standard pedestrian image, as shown in fig. 2, the standard pedestrian image may be divided into upper and lower sub-images a and b, and the size is 128×128. Each sub-image of the group is divided into N square sub-images having overlapping portions using a sliding window. In this example, the square image patch is 16×16 in size and 14 in step size, so the pedestrian image is divided into 162 square image patches of 16×16.

It can be understood that the invention divides the pedestrian image into a plurality of sub-image inputs, the large probability of the shielding object and the pedestrian can be divided into different sub-image blocks, and the network model can extract the characteristics of the pedestrian and the characteristics of the shielding object respectively when the characteristics of the sub-images are extracted respectively, so that the influence of the shielding object on the characteristics of the pedestrian can be reduced.

The improved model of the invention additionally performs feature extraction on the pedestrian parts at the edges, thereby highlighting the features of pedestrians at the boundary edges of the shielding object and the pedestrians. 103. Performing horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;

in the embodiment of the invention, after linear projection is carried out on each square sub-image obtained in the step 102 in a horizontal unfolding mode, splicing is carried out in the vertical direction, and a horizontal feature matrix is obtained; after linear projection is carried out on each square image in a vertical unfolding mode, splicing is carried out in the horizontal direction, and a vertical feature matrix is obtained; vertical feature matrix Z _h Horizontal feature matrix Z _w Wherein Z is _w Dimension N×3P ² ，Z _h Dimension 3P ² ×N。

Z _h ＝[F _h (x ₁ )；F _h (x ₂ )；…F _h (x _N )

Z _w ＝[F _w (x ₁ )；F _w (x ₂ )；…F _w (x _N )

Wherein { x _i I=1, 2, …, N } represents N different square sub-images, F _w Represents horizontal expansion, F _h Representing a vertical expansion.

Also taking 256×128 as an example of the standard line person image in step 102, the input image is divided into 162 pieces of 16×16 square image, and is stitched in the vertical direction by the linear projection in the horizontal expansion mode, and is stitched in the horizontal direction by the linear projection in the vertical expansion mode, whereby the horizontal feature matrix Z _w Size 162×768, vertical feature matrix Z _h The dimensions are 768×162.

104. And inputting the vertical feature matrix and the horizontal feature matrix into a pre-trained transducer network model, and predicting to obtain the recognition result of the target pedestrian image.

In the embodiment of the invention, the improved transducer network model is an improved transducer network model, as shown in fig. 2, and the improved transducer network model comprises a first layer of dual multi-scale transducer structure, an L-1 layer Origin Transformer structure and a last layer of full-connection layer; the dual multi-scale transducer structure is used for calculating to obtain two groups of global feature matrixes and local feature matrixes with different scales, and carrying out feature fusion on the global feature matrixes and the local feature matrixes; the L-1 layer Origin Transformer structure is used for processing the fused characteristics and extracting pedestrian characteristics; and the full-connection layer is used for processing the pedestrian characteristics and predicting to obtain the recognition result of the target pedestrian image. In this example, l=10, and there are 10 layers of transducer structures. Of course, the specific layer number of L can be set according to actual conditions, and at least L is ensured to be more than or equal to 2.

Wherein the dual multi-scale transducer structure comprises two multi-scale transducer structures, which are respectively named as a horizontal multi-scale transducer structure and a vertical multi-scale transducer structure. The horizontal multi-scale transducer structure calculates a horizontal feature matrix, and the vertical multi-scale transducer structure calculates a vertical feature matrix. Each Multi-scale transducer structure comprises a Multi-Branch Multi-Head Attention module (MBMHA) based on a top Branch, a middle Branch and a bottom Branch, wherein the top Branch directly extracts global features of pedestrians without any processing on an input feature matrix, the middle Branch halves the input feature matrix to extract local features of the pedestrians, and the bottom Branch trisects the input feature matrix to extract local features of the pedestrians. Fusing and splicing the global features and the local features of the first horizontal dimension transducer structure, and outputting a horizontal fusion feature matrix; fusing and splicing the global features and the local features of the vertical multi-scale transducer structure, and outputting a vertical fusion feature matrix; projecting the horizontal fusion feature matrix by using a horizontal weight matrix to obtain a feature matrix of a horizontal multi-scale transducer structure; and projecting the vertical fusion features by using a vertical weight matrix to obtain a feature matrix of the vertical multi-scale transducer structure. And then carrying out feature fusion on the horizontal feature matrix and the vertical feature matrix in a weight matrix projection mode to obtain the feature matrix of the first-layer dual multi-scale transducer structure.

It can be understood that in the embodiment of the present invention, corresponding position information and grouping information G are also required to be added to the horizontal feature matrix and the vertical feature matrix, respectively, where the dimension of G is nx2, to form two horizontal unbedding feature matrices (E _w ) And a vertical emmbedding feature matrix (E _h )，E _h Dimension is (3P) ² +2)×N，E _w Dimension N× (3P ² +2)，E _h ,E _w Can be expressed as:

E _h ＝Concat(Z _h ,G ^T )

E _w ＝Concat(Z _w ,G)

also taking 256×128 as an example of the standard pedestrian image in step 102, the horizontal feature matrix Z _w Size 162×768, vertical feature matrix Z _h Size 768×162; since the dimension of the position and grouping information G is n×2, where n=162, this corresponds to a feature matrix of which two dimensions are newly added, and thus the horizontal dimension of the feature matrix is 162×770, and the vertical dimension of the feature matrix is 770×162.

Will be horizontal feature matrix E _w And a vertical feature matrix E _h Dual multi-scale Tra input to the first layer respectivelyIn the transformer structure, three branches Up-Layer, mid-Layer and Down-Layer are used for extracting pedestrian characteristics in multiple scales through three different branches as shown in fig. 3.

For Up-Layer, not for E _h ,E _w The feature matrix is segmented to extract global features, and Mid-Layer, down-Layer respectively performs horizontal halving and trisecting on the emmbedding feature matrix to extract local features.

For Mid-Layer, E _w Is divided into two dimensions of

Feature matrix +.>

And

E _h is divided into two dimensions +>

Feature matrix +.>

And->

For Down-Layer, E _w Is divided into three dimensions of

Feature matrix +.>

And +.>

E _h Is divided into three dimensions +>

Feature matrix +.>

And +.>

And then the three branches respectively perform the operation of the multi-head attention mechanism.

In calculating multi-branch multi-head self-attention, for each feature matrix, a weight matrix W is first used ^Q ,W ^K And W is ^V Projecting the attention matrix into Q, K and V, calculating the attention moment matrix head of a single head, splicing the attention matrixes of M heads, and passing through a weight matrix W ^O To obtain horizontal and vertical output feature matrices on one branch.

The horizontal and vertical multi-branch multi-headed self-attention values can be expressed as

And->

Where v is equal to 1,2,3, respectively, for three branches, each branch having M heads therein.

For the Up-Layer of the glass,

and->

The calculation formula is as follows:

wherein the method comprises the steps of

All representing a weight matrix of dimension N x N +.>

The representation dimension is (3P ² +2)×(3P ² +2) weight matrix. />

Dimension is (3P) ² +2)×N，

Dimension N× (3P ² +2)，/>

The representation dimension is (3P ² +2)×(M(3P ² +2)) weight matrix, for example, for the case of a weight matrix of ∈2->

The representation dimension is (M (3P) ² +2))×(3P ² +2) weight matrix.

For the Mid-Layer,

and->

The calculation formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

respectively E _h The dimension of the matrix after the horizontal halving of the characteristic matrix is as follows

Respectively E _w The dimension of the matrix after the horizontal halving of the characteristic matrix is as follows

Weight matrix representing dimension N +.>

The representation dimension is (3P ² +2)×(3P ² +2) weight matrix. />

And->

The dimensions of (2) remain the same as in the Up-Layer.

For a Down-Layer,

and->

The calculation formula is as follows:

respectively E _h Matrix with horizontal trisection of characteristic matrix and dimension of +.>

Respectively E _w Matrix with horizontal trisection of characteristic matrix and dimension of +.>

Weight matrix representing dimension N +.>

The representation dimension is (3P ² +2)×(3P ² +2) weight matrix. />

And->

The dimensions of (2) remain the same as in the Up-Layer.

In conclusion, the method comprises the steps of,

and->

For the two last output feature matrices, +.>

Comprises three dimensions of N× (3P ² +2) feature matrix, ++>

Comprises three dimensions (3P) ² +2) x N.

In this example, for Up-Layer, the horizontal emmbedding feature matrix dimension input to each header is 162×770, and the vertical emmbedding feature matrix dimension is 770×162. Mapping a horizontal emmbed feature matrix dimension of dimension 162 x 770 to Q _up-w ,K _up-w ,V _up-w Mapping a vertical emmbed feature matrix of dimension 770 x 162 to Q _up-h ,K _up-h ,V _up-h . Wherein the method comprises the steps of

The dimensions are 770×770, and ∈770>

The dimensions are 162×162. The formula is as follows:

calculated Q _up-w ,K _up-w ,V _up-w The dimensions are 162 multiplied by 770, Q _up-h ,K _up-h ,V _up-h The dimension is 770×162. Next, self-Attention moment array Attention (Q, K, V) of the individual head is calculated as follows:

where d is a constant, taking 64 in this example.

Horizontal Attention (Q, K, V) _up-w Dimension 162X 770, vertical Attention (Q, K, V) _up-h The dimension is 770 multiplied by 162, and the dimension is spliced. The horizontal feature matrix is spliced to obtain a 162 multiplied by 9240 feature matrix and a matrix with dimension of 9240 multiplied by 770

The dot product yields a 162 x 770 matrix. The vertical feature matrix is spliced to obtain a 9240×162 feature matrix, and the dimension is 770×9240 matrix +.>

And multiplying the points by the matrix to obtain a 770 multiplied by 162 matrix. Thus, the horizontal characteristic matrix and the vertical characteristic matrix of the Up-Layer output are obtained.

The feature vectors of the Mid-Layer and Down-Layer branches in the double structure are extracted respectively according to the process, so that two groups of feature matrices are obtained, wherein one group of horizontal matrices has three feature matrices with dimensions of 162 multiplied by 770, and the other group of vertical matrices has three feature matrices with dimensions of 770 multiplied by 162, and each group has a global feature and two local features.

It can be understood that the dual-transducer network structure extracts features from the pedestrian image in the vertical direction and the horizontal direction respectively, so that the connection of each pixel point in the pedestrian image in the horizontal direction and the vertical direction is enhanced, and the accuracy of pedestrian re-recognition is improved. The multi-branch multi-head attention module extracts the characteristics of pedestrians on different scales, and the characteristics are fused to greatly optimize the performance of the model. In the embodiment of the invention, two groups of global feature matrixes and local feature matrixes with different scales can be obtained through the calculation, and the two groups of global feature matrixes and the local feature matrixes are respectively subjected to feature fusion to obtain two fusion feature matrixes, namely, the global feature matrixes and the local feature matrixes of the vertical feature matrixes are subjected to splicing fusion, and the vertical fusion feature matrixes are output; splicing and fusing the global feature vector and the local feature matrix of the horizontal feature matrix; outputting a horizontal fusion feature matrix; projecting the horizontal fusion feature matrix by using a horizontal weight matrix to obtain an output feature matrix of a horizontal multi-scale transducer structure; and projecting the vertical fusion matrix by using a vertical weight matrix to obtain an output characteristic matrix of the vertical multi-scale transducer structure.

Taking the matrix dimensions corresponding to the above embodiment as an example, in this example, a 162×2310 horizontal feature matrix and a 2310×162 vertical matrix are obtained after stitching. Horizontal feature matrix with dimension 162×2310 and matrix with dimension 2310×770

The final output matrix of the horizontal multi-scale transducer structure is obtained after dot multiplication, and the dimension is 162×770. Matrix with dimension 770×2310 +.>

And multiplying the vertical feature matrix with the dimension of 2310 multiplied by 162 with points to obtain a final output matrix of the vertical multi-scale transducer structure, wherein the dimension is 770 multiplied by 162. And finally, fusing the characteristic matrixes of the final output matrixes of the dual-conversion structure to obtain the output matrixes of the dual-multi-scale conversion layer. In this example, the horizontal feature matrix with dimension 162×770 is multiplied by the vertical matrix with dimension 770×162 to obtain the feature matrix with dimension 162×162, and the matrix with dimension 162×770 is multiplied by the W to obtain the final output matrix with dimension 162×770.

In the embodiment of the invention, after the dual multi-scale Transformer structure is processed, the pedestrian characteristics are extracted through the L-1 layer Origin Transformer structure, and then the full-connection layer is utilized to predict the pedestrian characteristics so as to obtain the recognition result of the target pedestrian image; the structure of the L-1 layer Origin Transformer and the full connection layer are similar to those of the conventional transducer network model, and the description of the invention is omitted.

In the embodiment of the invention, the transducer network model can be trained in advance through the training set image, and can be used as the recognition prediction of the target pedestrian image after the training of the transducer network model is completed.

In the embodiment of the invention, the training of the transducer network model can train the whole process and the model by using the loss function to obtain the trained model parameters. And finally, inputting the pedestrian image to be identified into the network model to obtain an identification result.

In an embodiment of the present invention, the pre-training process for improving the transducer network model includes:

acquiring a training pedestrian image, preprocessing the training pedestrian image, and generating a standard pedestrian image; the training pedestrian image is used for training an improved transducer network model;

dividing a standard pedestrian image corresponding to the training pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;

performing horizontal linear projection and vertical linear projection on each square sub-image corresponding to the training pedestrian image to obtain a vertical feature matrix and a horizontal feature matrix;

inputting the horizontal feature matrix and the vertical feature matrix corresponding to the training pedestrian image into an improved transducer network model, and training parameters of the network model by using a loss function;

and when the improved transducer network model reaches the preset iteration times or converges, the training of the improved transducer network model is completed.

The pedestrian characteristics obtained by the step training set images after the steps 101-104 can be compared with the real pedestrian characteristics of the training set images for training, namely, the whole process and the model are trained and learned for multiple times by using a loss function, wherein the loss function comprises identity information loss and triplet loss in the embodiment.

Identity loss (L) _id ) The formula is as follows:

where y represents the label of the image and c represents the number of identities.

Triple loss (L) _t ) The formula is as follows:

L _t ＝d(x _a ,x _p )-d(x _a ,x _n )+m] ₊

wherein x is _a As reference sample, x _p Is a positive sample, x _n Is a negative sample, d is a distance representation function, m is the distance lost by the triplet, [] ₊ Equivalent to max (, 0).

The overall loss function (L) is formulated as follows, combining the global features with the local features:

training for a number of rounds until the loss value is substantially unchanged. In this example, 64 rounds of training were performed, with an initial learning rate (learning rate) set to 0.000125 and an optimization strategy of optimal_method: sgd (random gradient descent). Learning rate hot start rounds epochs 4, decay rate weight_deca 1e ^-7 . The first 16 training rounds are performed with hot start of learning rate, the learning rate reaches stable after 16 rounds, the later 48 rounds are performed with relatively stable training on the model, the loss value basically remains unchanged after 60 rounds, the model is converged, and finally parameters of the network model are obtained.

In this example, the relevant parameters of the Transformer network model after training the training set image can be stored offline, and the network model can be used for identifying the target pedestrian image, and the test result of the network model in the invention is shown in table 1.

TABLE 1 identification test results of Transformer network model

/>

When inquiring the pedestrian image, the inquiry image is only put into the network to extract the characteristics, then the distance between the inquiry image and all the images in the test set can be quickly calculated, and the distance is in ascending order, and the sorting result is used as the output of the pedestrian re-identification model.

The embodiment of the invention also provides a pedestrian re-identification device based on the Transformer network model, which comprises:

The embodiment of the invention also provides computer equipment, which comprises a processor and a memory, wherein the processor and the memory are connected with each other, the memory is used for storing a computer program, the computer program comprises program instructions, and the processor is used for calling the program instructions to execute the pedestrian re-identification method based on the Transformer network model.

The embodiment of the invention provides a processing device, which comprises a processor and a storage device; a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement a pedestrian re-recognition method based on a Transformer network model as described above.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, etc.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A pedestrian re-recognition method based on a Transformer network model, the method comprising:

performing horizontal linear projection and vertical linear projection on each square sub-image to obtain a vertical feature matrix and a horizontal feature matrix;

and inputting the vertical feature matrix and the horizontal feature matrix into a pre-trained improved transducer network model, and predicting to obtain the recognition result of the target pedestrian image.

2. The pedestrian re-recognition method based on the Transformer network model according to claim 1, wherein the sub-image dividing process of the standard pedestrian image comprises: dividing the standard pedestrian image into a plurality of non-overlapping sub-images; each sub-image of the group is divided into a plurality of square sub-images having overlapping portions using sliding windows, respectively.

3. The pedestrian re-recognition method based on a Transformer network model according to claim 2, wherein each standard pedestrian image is divided into N square sub-images with overlapping portions, wherein the calculation formula of N is as follows:

wherein H is the height of a standard pedestrian image, W is the width of a standard pedestrian image, S is the step length, P is the height and width of square sub-images, K represents the total number of divisions of grouped sub-images, and K=K1×K2; k1 represents the number of vertically divided sub-images, and K2 represents the number of horizontally divided sub-images.

4. The pedestrian re-recognition method based on a Transformer network model according to claim 1, wherein the obtaining the horizontal feature matrix and the vertical feature matrix from the square sub-images through horizontal linear projection and vertical linear projection includes: linearly projecting each square sub-image according to a horizontal expansion mode to obtain horizontal expansion vectors of each sub-image, and vertically expanding N horizontal expansion vectorsSplicing in the straight direction to obtain NxP ² Is a horizontal feature matrix of (a); linearly projecting each square sub-image according to a vertical expansion mode to obtain vertical expansion vectors of each sub-image, and splicing N vertical expansion vectors in the horizontal direction to obtain P ² X N, where P is the height and width of the square sub-image.

5. The pedestrian re-recognition method based on a fransformer network model of claim 1, wherein the improved fransformer network model comprises a dual multi-scale fransformer structure, an L-1 layer Origin Transformer structure and a full connection layer; the dual multi-scale transducer structure is used for calculating a global feature matrix and a local feature matrix which are obtained based on a horizontal feature matrix and a vertical feature matrix and are of different scales, and feature fusion is carried out on the global feature matrix and the local feature matrix; the L-1 layer Origin Transformer structure is used for processing the fused feature matrix and extracting pedestrian features; and the full-connection layer is used for processing the pedestrian characteristics and predicting to obtain the recognition result of the target pedestrian image.

6. The pedestrian re-recognition method based on a fransformer network model according to claim 5, wherein the dual multi-scale fransformer structure comprises two multi-scale fransformer structures, namely a horizontal multi-scale fransformer structure and a vertical multi-scale fransformer structure; the horizontal multi-scale transducer structure calculates a horizontal feature matrix, and the vertical multi-scale transducer structure calculates a vertical feature matrix; each multi-scale transducer structure comprises a multi-branch multi-head attention module based on a top-layer branch, a middle-layer branch and a bottom-layer branch, wherein the top-layer branch does not process an input feature matrix and directly extracts global features of pedestrians, the middle-layer branch halves the input feature matrix and then extracts local features of the pedestrians, and the bottom-layer branch trisects the input feature matrix and then extracts the local features of the pedestrians; the global features of the horizontal multi-scale transducer structure and the local features of the two corresponding branches are fused and spliced, and a horizontal fusion feature matrix is output; fusing and splicing the global features of the vertical multi-scale transducer structure and the local features of the two corresponding branches, and outputting a vertical fusion feature matrix; projecting the horizontal fusion feature matrix by using a horizontal weight matrix to obtain a feature matrix of a horizontal multi-scale transducer structure; projecting the vertical fusion features by using a vertical weight matrix to obtain a feature matrix of a vertical multi-scale transducer structure; and then carrying out feature fusion on the horizontal feature matrix and the vertical feature matrix in a weight matrix projection mode to obtain the feature matrix of the first-layer dual multi-scale transducer structure.

7. The pedestrian re-recognition method based on a Transformer network model of claim 6, wherein the step of adding position and grouping information to the horizontal feature matrix before calculating the horizontal feature matrix; before the vertical feature matrix is calculated, adding position and grouping information into the vertical feature matrix.

8. The pedestrian re-recognition method based on a fransformer network model according to claim 1, wherein the pre-training process of the improved fransformer network model comprises:

9. A pedestrian re-recognition device based on a Transformer network model, the device comprising:

10. A computer device, characterized in that it comprises a processor and a memory, which are connected to each other, wherein the memory is adapted to store a computer program, which computer program comprises program instructions, which processor is adapted to invoke the program instructions to perform a pedestrian re-recognition method based on a Transformer network model according to any of the claims 1-8.