CN116311379A - Pedestrian re-recognition method and device based on Transformer network model and computer equipment - Google Patents

Pedestrian re-recognition method and device based on Transformer network model and computer equipment Download PDF

Info

Publication number
CN116311379A
CN116311379A CN202310351000.5A CN202310351000A CN116311379A CN 116311379 A CN116311379 A CN 116311379A CN 202310351000 A CN202310351000 A CN 202310351000A CN 116311379 A CN116311379 A CN 116311379A
Authority
CN
China
Prior art keywords
image
pedestrian
feature matrix
horizontal
vertical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310351000.5A
Other languages
Chinese (zh)
Inventor
刘歆
赵义铭
钱鹰
陈奉
曾奎
孟雅朋
姜美兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202310351000.5A priority Critical patent/CN116311379A/en
Publication of CN116311379A publication Critical patent/CN116311379A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention belongs to the field of computer vision, and relates to a pedestrian re-identification method and device based on a Transformer network model and computer equipment; the method comprises the steps of obtaining a target pedestrian image, preprocessing the target pedestrian image, and generating a standard pedestrian image; dividing a standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window; carrying out horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix; and inputting the target pedestrian image into a pre-trained improved transducer network model, and predicting to obtain a recognition result of the target pedestrian image. The invention divides the input image into a plurality of square small blocks with overlapped parts by utilizing the sliding window, so that the characteristics of pedestrians at the boundary edge of the shielding object and the pedestrians are highlighted; the improved Transformer network structure is utilized, the association of pedestrian characteristics in all directions is enhanced, and the pedestrian re-recognition accuracy is improved.

Description

Pedestrian re-recognition method and device based on Transformer network model and computer equipment
Technical Field
The invention belongs to the field of computer vision, and relates to a pedestrian re-identification method and device based on a Transformer network model and computer equipment.
Background
Pedestrian Re-IDentification (Re-ID) can be regarded as a sub-problem in the image retrieval field focusing on pedestrian images. In this task, a pedestrian image to be detected is first given, and then the pedestrian is detected again from a plurality of scenes containing the pedestrian image captured by a plurality of cameras.
In a real scene, pedestrian images come from different cameras in different scenes. The images of pedestrians are inevitably affected by shooting angles and unpredictable object shielding, so that the images of the same pedestrian have great differences. Therefore, it is necessary to design a pedestrian re-recognition method capable of effectively extracting image features under complex scene conditions.
The pedestrian re-identification method with better effect at present is carried out based on a feature fusion method. Among these, the method of extracting local features is of critical importance, such as PCB, pyramid, etc. The PCB is a method for extracting local features by equally dividing pedestrian features in the horizontal direction at first. According to the method, an input image is divided into a plurality of horizontal strips, then the horizontal strips are respectively subjected to convolutional neural network training to obtain local features, and the local features are respectively trained through a plurality of classifiers with unshared weights. The Pyramid considers different division granularities on the basis of the PCB, and the global features and the local features are fused better.
However, the above method has the following problems:
the occlusion of the object causes the partial feature extraction of the partial pedestrian image to fail, so that the feature difference between the complete image and the occluded image of the same pedestrian is too large, resulting in detection failure.
Disclosure of Invention
Based on the problems existing in the prior art, the invention provides a pedestrian re-recognition method, a device and computer equipment based on a Transformer network model, which can solve the problem of pedestrian re-recognition under the condition of object shielding.
In a first aspect of the present invention, the present invention provides a pedestrian re-recognition method based on a transform network model, the method comprising:
acquiring a target pedestrian image, preprocessing the target pedestrian image, and generating a standard pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;
dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
performing horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;
and inputting the vertical feature matrix and the horizontal feature matrix into a pre-trained transducer network model, and predicting to obtain the recognition result of the target pedestrian image.
In a second aspect of the present invention, the present invention further provides a pedestrian re-recognition device based on a fransformer network model, the device comprising:
the image acquisition module is used for acquiring a target pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;
the image preprocessing module is used for preprocessing the target pedestrian image to generate a standard pedestrian image;
the image segmentation module is used for dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
the image mapping module is used for carrying out horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;
and the image recognition module is used for inputting the horizontal feature matrix and the vertical feature matrix into a pre-trained improved transducer network model, and predicting to obtain a recognition result of the target pedestrian image.
In a third aspect of the present invention, the present invention further provides a computer device, the computer device comprising a processor and a memory, the processor and the memory being connected to each other, wherein the memory is configured to store a computer program, the computer program comprising program instructions, the processor being configured to invoke the program instructions to perform a pedestrian re-recognition method based on a transponder network model according to the first aspect of the present invention.
The invention has the beneficial effects that:
aiming at the shielding problem, the invention provides a pedestrian re-identification method, a pedestrian re-identification device and computer equipment based on a Transformer network model. The invention divides the pedestrian image into a plurality of sub-image inputs, thereby reducing the influence of the shielding object on the pedestrian characteristics. And the input image is divided into a plurality of square sub-images with overlapped parts by utilizing the sliding window, so that the characteristics of pedestrians at the boundary edge of the shielding object and the pedestrians are highlighted. The invention also constructs an improved transducer network, and a dual multi-scale transducer structure based on a multi-branch multi-head attention module is constructed in the first layer of the network for acquiring the pedestrian characteristics of the image in the horizontal direction and the vertical direction, so that the characteristic representation of pedestrians is enriched, and the pedestrian re-recognition accuracy is improved.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a schematic flow chart of a pedestrian re-recognition method based on a Transformer network model;
FIG. 2 is a schematic diagram of a network structure of a pedestrian re-recognition method based on a transform network model according to the present invention;
fig. 3 is a schematic diagram of a multi-branch multi-head attention module in a pedestrian re-recognition method based on a Transformer network model.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The application provides a pedestrian re-identification method based on a Transformer network model, which can be used for extracting image characteristics, matching images and acquiring and processing related data based on an artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
FIG. 1 is a schematic flow chart of a pedestrian re-recognition method based on a Transformer network model; as shown in fig. 1, the method comprises the steps of:
101. acquiring a target pedestrian image, preprocessing the target pedestrian image, and generating a standard pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;
in the embodiment of the invention, the pedestrian image to be subjected to pedestrian re-recognition can be a training set image in the existing pedestrian image data set, can be a verification set image in the existing pedestrian image data set, and can also be a test set image of the existing pedestrian image data set; and more particularly, the video image from the video monitoring collection, wherein the video image comprises a pedestrian image. In the embodiment of the invention, the standard pedestrian image is a pedestrian image with the same preset size, and the standard pedestrian image with the size of 256×128 can be obtained after preprocessing the pedestrian image in the data set by taking the mark-1501 as an example of the data set.
In some embodiments, after the normalization of the pedestrian images in the dataset, 33217 images of 1501 pedestrians can be divided into a training set and a test set according to a ratio of 6:4, wherein the training set comprises 19926 pedestrian images and the test set comprises 13291 pedestrian images.
Fig. 2 is a schematic diagram of a network structure of a pedestrian re-recognition method based on a transducer network model according to the present invention, as shown in fig. 2, after a standard pedestrian image is obtained, a sliding window is used to block the standard pedestrian image, and a horizontal linear projection and a vertical linear projection are respectively used to input the standard pedestrian image into an improved transducer network model, so as to finally obtain a corresponding recognition result, and the network structure will be described in detail below in connection with steps 102 to 104.
102. Dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
in some embodiments of the invention, the standard pedestrian image may be segmented into a plurality of non-overlapping sub-images; and dividing each grouping sub-image into a plurality of square sub-images with overlapping parts by adopting a sliding window.
Wherein each of the standard pedestrian images is divided into N square sub-images having overlapping portions, wherein N is calculated as follows:
Figure BDA0004161513390000051
wherein H is the height of a third standard pedestrian image, W is the width of the standard pedestrian image, S is the step length, P is the height and width of the square sub-images, K represents the total number of divisions of the sub-images, and K=K1×K2; k1 represents the number of vertically divided sub-images, and K2 represents the number of horizontally divided sub-images.
For convenience of description, the embodiment divides the standard pedestrian image into upper and lower sub-images, and divides the upper and lower sub-images into N square sub-images with overlapping portions by using a sliding window, where the calculation formula of N is as follows:
Figure BDA0004161513390000052
wherein, in the above formula, k=2, k1=2, k2=1.
In the embodiment of the present invention, taking 256×128 as an example of a standard pedestrian image, as shown in fig. 2, the standard pedestrian image may be divided into upper and lower sub-images a and b, and the size is 128×128. Each sub-image of the group is divided into N square sub-images having overlapping portions using a sliding window. In this example, the square image patch is 16×16 in size and 14 in step size, so the pedestrian image is divided into 162 square image patches of 16×16.
It can be understood that the invention divides the pedestrian image into a plurality of sub-image inputs, the large probability of the shielding object and the pedestrian can be divided into different sub-image blocks, and the network model can extract the characteristics of the pedestrian and the characteristics of the shielding object respectively when the characteristics of the sub-images are extracted respectively, so that the influence of the shielding object on the characteristics of the pedestrian can be reduced.
The improved model of the invention additionally performs feature extraction on the pedestrian parts at the edges, thereby highlighting the features of pedestrians at the boundary edges of the shielding object and the pedestrians. 103. Performing horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;
in the embodiment of the invention, after linear projection is carried out on each square sub-image obtained in the step 102 in a horizontal unfolding mode, splicing is carried out in the vertical direction, and a horizontal feature matrix is obtained; after linear projection is carried out on each square image in a vertical unfolding mode, splicing is carried out in the horizontal direction, and a vertical feature matrix is obtained; vertical feature matrix Z h Horizontal feature matrix Z w Wherein Z is w Dimension N×3P 2 ,Z h Dimension 3P 2 ×N。
Z h =[F h (x 1 );F h (x 2 );…F h (x N )
Z w =[F w (x 1 );F w (x 2 );…F w (x N )
Wherein { x i I=1, 2, …, N } represents N different square sub-images, F w Represents horizontal expansion, F h Representing a vertical expansion.
Also taking 256×128 as an example of the standard line person image in step 102, the input image is divided into 162 pieces of 16×16 square image, and is stitched in the vertical direction by the linear projection in the horizontal expansion mode, and is stitched in the horizontal direction by the linear projection in the vertical expansion mode, whereby the horizontal feature matrix Z w Size 162×768, vertical feature matrix Z h The dimensions are 768×162.
104. And inputting the vertical feature matrix and the horizontal feature matrix into a pre-trained transducer network model, and predicting to obtain the recognition result of the target pedestrian image.
In the embodiment of the invention, the improved transducer network model is an improved transducer network model, as shown in fig. 2, and the improved transducer network model comprises a first layer of dual multi-scale transducer structure, an L-1 layer Origin Transformer structure and a last layer of full-connection layer; the dual multi-scale transducer structure is used for calculating to obtain two groups of global feature matrixes and local feature matrixes with different scales, and carrying out feature fusion on the global feature matrixes and the local feature matrixes; the L-1 layer Origin Transformer structure is used for processing the fused characteristics and extracting pedestrian characteristics; and the full-connection layer is used for processing the pedestrian characteristics and predicting to obtain the recognition result of the target pedestrian image. In this example, l=10, and there are 10 layers of transducer structures. Of course, the specific layer number of L can be set according to actual conditions, and at least L is ensured to be more than or equal to 2.
Wherein the dual multi-scale transducer structure comprises two multi-scale transducer structures, which are respectively named as a horizontal multi-scale transducer structure and a vertical multi-scale transducer structure. The horizontal multi-scale transducer structure calculates a horizontal feature matrix, and the vertical multi-scale transducer structure calculates a vertical feature matrix. Each Multi-scale transducer structure comprises a Multi-Branch Multi-Head Attention module (MBMHA) based on a top Branch, a middle Branch and a bottom Branch, wherein the top Branch directly extracts global features of pedestrians without any processing on an input feature matrix, the middle Branch halves the input feature matrix to extract local features of the pedestrians, and the bottom Branch trisects the input feature matrix to extract local features of the pedestrians. Fusing and splicing the global features and the local features of the first horizontal dimension transducer structure, and outputting a horizontal fusion feature matrix; fusing and splicing the global features and the local features of the vertical multi-scale transducer structure, and outputting a vertical fusion feature matrix; projecting the horizontal fusion feature matrix by using a horizontal weight matrix to obtain a feature matrix of a horizontal multi-scale transducer structure; and projecting the vertical fusion features by using a vertical weight matrix to obtain a feature matrix of the vertical multi-scale transducer structure. And then carrying out feature fusion on the horizontal feature matrix and the vertical feature matrix in a weight matrix projection mode to obtain the feature matrix of the first-layer dual multi-scale transducer structure.
It can be understood that in the embodiment of the present invention, corresponding position information and grouping information G are also required to be added to the horizontal feature matrix and the vertical feature matrix, respectively, where the dimension of G is nx2, to form two horizontal unbedding feature matrices (E w ) And a vertical emmbedding feature matrix (E h ),E h Dimension is (3P) 2 +2)×N,E w Dimension N× (3P 2 +2),E h ,E w Can be expressed as:
E h =Concat(Z h ,G T )
E w =Concat(Z w ,G)
also taking 256×128 as an example of the standard pedestrian image in step 102, the horizontal feature matrix Z w Size 162×768, vertical feature matrix Z h Size 768×162; since the dimension of the position and grouping information G is n×2, where n=162, this corresponds to a feature matrix of which two dimensions are newly added, and thus the horizontal dimension of the feature matrix is 162×770, and the vertical dimension of the feature matrix is 770×162.
Will be horizontal feature matrix E w And a vertical feature matrix E h Dual multi-scale Tra input to the first layer respectivelyIn the transformer structure, three branches Up-Layer, mid-Layer and Down-Layer are used for extracting pedestrian characteristics in multiple scales through three different branches as shown in fig. 3.
For Up-Layer, not for E h ,E w The feature matrix is segmented to extract global features, and Mid-Layer, down-Layer respectively performs horizontal halving and trisecting on the emmbedding feature matrix to extract local features.
For Mid-Layer, E w Is divided into two dimensions of
Figure BDA0004161513390000081
Feature matrix +.>
Figure BDA0004161513390000082
And
Figure BDA0004161513390000083
E h is divided into two dimensions +>
Figure BDA0004161513390000084
Feature matrix +.>
Figure BDA0004161513390000085
And->
Figure BDA0004161513390000086
For Down-Layer, E w Is divided into three dimensions of
Figure BDA0004161513390000087
Feature matrix +.>
Figure BDA0004161513390000088
Figure BDA0004161513390000089
And +.>
Figure BDA00041615133900000810
E h Is divided into three dimensions +>
Figure BDA00041615133900000811
Feature matrix +.>
Figure BDA00041615133900000812
Figure BDA00041615133900000813
And +.>
Figure BDA00041615133900000814
And then the three branches respectively perform the operation of the multi-head attention mechanism.
In calculating multi-branch multi-head self-attention, for each feature matrix, a weight matrix W is first used Q ,W K And W is V Projecting the attention matrix into Q, K and V, calculating the attention moment matrix head of a single head, splicing the attention matrixes of M heads, and passing through a weight matrix W O To obtain horizontal and vertical output feature matrices on one branch.
The horizontal and vertical multi-branch multi-headed self-attention values can be expressed as
Figure BDA0004161513390000091
And->
Figure BDA0004161513390000092
Where v is equal to 1,2,3, respectively, for three branches, each branch having M heads therein.
For the Up-Layer of the glass,
Figure BDA0004161513390000093
and->
Figure BDA0004161513390000094
The calculation formula is as follows:
Figure BDA0004161513390000095
Figure BDA0004161513390000096
Figure BDA0004161513390000097
Figure BDA0004161513390000098
Figure BDA0004161513390000099
Figure BDA00041615133900000910
Figure BDA00041615133900000911
wherein the method comprises the steps of
Figure BDA00041615133900000912
All representing a weight matrix of dimension N x N +.>
Figure BDA00041615133900000913
Figure BDA00041615133900000914
The representation dimension is (3P 2 +2)×(3P 2 +2) weight matrix. />
Figure BDA00041615133900000915
Dimension is (3P) 2 +2)×N,
Figure BDA00041615133900000916
Dimension N× (3P 2 +2),/>
Figure BDA00041615133900000917
The representation dimension is (3P 2 +2)×(M(3P 2 +2)) weight matrix, for example, for the case of a weight matrix of ∈2->
Figure BDA00041615133900000918
The representation dimension is (M (3P) 2 +2))×(3P 2 +2) weight matrix.
For the Mid-Layer,
Figure BDA00041615133900000919
and->
Figure BDA00041615133900000920
The calculation formula is as follows:
Figure BDA00041615133900000921
Figure BDA00041615133900000922
Figure BDA00041615133900000923
Figure BDA00041615133900000924
Figure BDA00041615133900000925
Figure BDA0004161513390000101
Figure BDA0004161513390000102
Figure BDA0004161513390000103
Figure BDA0004161513390000104
Figure BDA0004161513390000105
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004161513390000106
respectively E h The dimension of the matrix after the horizontal halving of the characteristic matrix is as follows
Figure BDA0004161513390000107
Figure BDA0004161513390000108
Respectively E w The dimension of the matrix after the horizontal halving of the characteristic matrix is as follows
Figure BDA0004161513390000109
Figure BDA00041615133900001010
Weight matrix representing dimension N +.>
Figure BDA00041615133900001011
Figure BDA00041615133900001012
The representation dimension is (3P 2 +2)×(3P 2 +2) weight matrix. />
Figure BDA00041615133900001013
And->
Figure BDA00041615133900001014
The dimensions of (2) remain the same as in the Up-Layer.
For a Down-Layer,
Figure BDA00041615133900001015
and->
Figure BDA00041615133900001016
The calculation formula is as follows:
Figure BDA00041615133900001017
Figure BDA00041615133900001018
Figure BDA00041615133900001019
Figure BDA00041615133900001020
Figure BDA00041615133900001021
Figure BDA00041615133900001022
Figure BDA00041615133900001023
Figure BDA00041615133900001024
Figure BDA00041615133900001025
Figure BDA00041615133900001026
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA00041615133900001027
respectively E h Matrix with horizontal trisection of characteristic matrix and dimension of +.>
Figure BDA00041615133900001028
Figure BDA00041615133900001029
Respectively E w Matrix with horizontal trisection of characteristic matrix and dimension of +.>
Figure BDA00041615133900001030
Figure BDA00041615133900001031
Figure BDA0004161513390000111
Weight matrix representing dimension N +.>
Figure BDA0004161513390000112
Figure BDA0004161513390000113
The representation dimension is (3P 2 +2)×(3P 2 +2) weight matrix. />
Figure BDA0004161513390000114
And->
Figure BDA0004161513390000115
The dimensions of (2) remain the same as in the Up-Layer.
In conclusion, the method comprises the steps of,
Figure BDA0004161513390000116
and->
Figure BDA00041615133900001117
For the two last output feature matrices, +.>
Figure BDA0004161513390000117
Comprises three dimensions of N× (3P 2 +2) feature matrix, ++>
Figure BDA0004161513390000118
Comprises three dimensions (3P) 2 +2) x N.
In this example, for Up-Layer, the horizontal emmbedding feature matrix dimension input to each header is 162×770, and the vertical emmbedding feature matrix dimension is 770×162. Mapping a horizontal emmbed feature matrix dimension of dimension 162 x 770 to Q up-w ,K up-w ,V up-w Mapping a vertical emmbed feature matrix of dimension 770 x 162 to Q up-h ,K up-h ,V up-h . Wherein the method comprises the steps of
Figure BDA0004161513390000119
The dimensions are 770×770, and ∈770>
Figure BDA00041615133900001110
The dimensions are 162×162. The formula is as follows:
Figure BDA00041615133900001111
Figure BDA00041615133900001112
Figure BDA00041615133900001113
calculated Q up-w ,K up-w ,V up-w The dimensions are 162 multiplied by 770, Q up-h ,K up-h ,V up-h The dimension is 770×162. Next, self-Attention moment array Attention (Q, K, V) of the individual head is calculated as follows:
Figure BDA00041615133900001114
where d is a constant, taking 64 in this example.
Horizontal Attention (Q, K, V) up-w Dimension 162X 770, vertical Attention (Q, K, V) up-h The dimension is 770 multiplied by 162, and the dimension is spliced. The horizontal feature matrix is spliced to obtain a 162 multiplied by 9240 feature matrix and a matrix with dimension of 9240 multiplied by 770
Figure BDA00041615133900001115
The dot product yields a 162 x 770 matrix. The vertical feature matrix is spliced to obtain a 9240×162 feature matrix, and the dimension is 770×9240 matrix +.>
Figure BDA00041615133900001116
And multiplying the points by the matrix to obtain a 770 multiplied by 162 matrix. Thus, the horizontal characteristic matrix and the vertical characteristic matrix of the Up-Layer output are obtained.
The feature vectors of the Mid-Layer and Down-Layer branches in the double structure are extracted respectively according to the process, so that two groups of feature matrices are obtained, wherein one group of horizontal matrices has three feature matrices with dimensions of 162 multiplied by 770, and the other group of vertical matrices has three feature matrices with dimensions of 770 multiplied by 162, and each group has a global feature and two local features.
It can be understood that the dual-transducer network structure extracts features from the pedestrian image in the vertical direction and the horizontal direction respectively, so that the connection of each pixel point in the pedestrian image in the horizontal direction and the vertical direction is enhanced, and the accuracy of pedestrian re-recognition is improved. The multi-branch multi-head attention module extracts the characteristics of pedestrians on different scales, and the characteristics are fused to greatly optimize the performance of the model. In the embodiment of the invention, two groups of global feature matrixes and local feature matrixes with different scales can be obtained through the calculation, and the two groups of global feature matrixes and the local feature matrixes are respectively subjected to feature fusion to obtain two fusion feature matrixes, namely, the global feature matrixes and the local feature matrixes of the vertical feature matrixes are subjected to splicing fusion, and the vertical fusion feature matrixes are output; splicing and fusing the global feature vector and the local feature matrix of the horizontal feature matrix; outputting a horizontal fusion feature matrix; projecting the horizontal fusion feature matrix by using a horizontal weight matrix to obtain an output feature matrix of a horizontal multi-scale transducer structure; and projecting the vertical fusion matrix by using a vertical weight matrix to obtain an output characteristic matrix of the vertical multi-scale transducer structure.
Taking the matrix dimensions corresponding to the above embodiment as an example, in this example, a 162×2310 horizontal feature matrix and a 2310×162 vertical matrix are obtained after stitching. Horizontal feature matrix with dimension 162×2310 and matrix with dimension 2310×770
Figure BDA0004161513390000121
The final output matrix of the horizontal multi-scale transducer structure is obtained after dot multiplication, and the dimension is 162×770. Matrix with dimension 770×2310 +.>
Figure BDA0004161513390000122
And multiplying the vertical feature matrix with the dimension of 2310 multiplied by 162 with points to obtain a final output matrix of the vertical multi-scale transducer structure, wherein the dimension is 770 multiplied by 162. And finally, fusing the characteristic matrixes of the final output matrixes of the dual-conversion structure to obtain the output matrixes of the dual-multi-scale conversion layer. In this example, the horizontal feature matrix with dimension 162×770 is multiplied by the vertical matrix with dimension 770×162 to obtain the feature matrix with dimension 162×162, and the matrix with dimension 162×770 is multiplied by the W to obtain the final output matrix with dimension 162×770.
In the embodiment of the invention, after the dual multi-scale Transformer structure is processed, the pedestrian characteristics are extracted through the L-1 layer Origin Transformer structure, and then the full-connection layer is utilized to predict the pedestrian characteristics so as to obtain the recognition result of the target pedestrian image; the structure of the L-1 layer Origin Transformer and the full connection layer are similar to those of the conventional transducer network model, and the description of the invention is omitted.
In the embodiment of the invention, the transducer network model can be trained in advance through the training set image, and can be used as the recognition prediction of the target pedestrian image after the training of the transducer network model is completed.
In the embodiment of the invention, the training of the transducer network model can train the whole process and the model by using the loss function to obtain the trained model parameters. And finally, inputting the pedestrian image to be identified into the network model to obtain an identification result.
In an embodiment of the present invention, the pre-training process for improving the transducer network model includes:
acquiring a training pedestrian image, preprocessing the training pedestrian image, and generating a standard pedestrian image; the training pedestrian image is used for training an improved transducer network model;
dividing a standard pedestrian image corresponding to the training pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
performing horizontal linear projection and vertical linear projection on each square sub-image corresponding to the training pedestrian image to obtain a vertical feature matrix and a horizontal feature matrix;
inputting the horizontal feature matrix and the vertical feature matrix corresponding to the training pedestrian image into an improved transducer network model, and training parameters of the network model by using a loss function;
and when the improved transducer network model reaches the preset iteration times or converges, the training of the improved transducer network model is completed.
The pedestrian characteristics obtained by the step training set images after the steps 101-104 can be compared with the real pedestrian characteristics of the training set images for training, namely, the whole process and the model are trained and learned for multiple times by using a loss function, wherein the loss function comprises identity information loss and triplet loss in the embodiment.
Identity loss (L) id ) The formula is as follows:
Figure BDA0004161513390000141
where y represents the label of the image and c represents the number of identities.
Triple loss (L) t ) The formula is as follows:
L t =d(x a ,x p )-d(x a ,x n )+m] +
wherein x is a As reference sample, x p Is a positive sample, x n Is a negative sample, d is a distance representation function, m is the distance lost by the triplet, [] + Equivalent to max (, 0).
The overall loss function (L) is formulated as follows, combining the global features with the local features:
Figure BDA0004161513390000142
training for a number of rounds until the loss value is substantially unchanged. In this example, 64 rounds of training were performed, with an initial learning rate (learning rate) set to 0.000125 and an optimization strategy of optimal_method: sgd (random gradient descent). Learning rate hot start rounds epochs 4, decay rate weight_deca 1e -7 . The first 16 training rounds are performed with hot start of learning rate, the learning rate reaches stable after 16 rounds, the later 48 rounds are performed with relatively stable training on the model, the loss value basically remains unchanged after 60 rounds, the model is converged, and finally parameters of the network model are obtained.
In this example, the relevant parameters of the Transformer network model after training the training set image can be stored offline, and the network model can be used for identifying the target pedestrian image, and the test result of the network model in the invention is shown in table 1.
TABLE 1 identification test results of Transformer network model
Figure BDA0004161513390000143
/>
Figure BDA0004161513390000151
When inquiring the pedestrian image, the inquiry image is only put into the network to extract the characteristics, then the distance between the inquiry image and all the images in the test set can be quickly calculated, and the distance is in ascending order, and the sorting result is used as the output of the pedestrian re-identification model.
The embodiment of the invention also provides a pedestrian re-identification device based on the Transformer network model, which comprises:
the image acquisition module is used for acquiring a target pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;
the image preprocessing module is used for preprocessing the target pedestrian image to generate a standard pedestrian image;
the image segmentation module is used for dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
the image mapping module is used for carrying out horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;
and the image recognition module is used for inputting the horizontal feature matrix and the vertical feature matrix into a pre-trained improved transducer network model, and predicting to obtain a recognition result of the target pedestrian image.
The embodiment of the invention also provides computer equipment, which comprises a processor and a memory, wherein the processor and the memory are connected with each other, the memory is used for storing a computer program, the computer program comprises program instructions, and the processor is used for calling the program instructions to execute the pedestrian re-identification method based on the Transformer network model.
The embodiment of the invention provides a processing device, which comprises a processor and a storage device; a processor adapted to execute each program; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement a pedestrian re-recognition method based on a Transformer network model as described above.
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, etc.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A pedestrian re-recognition method based on a Transformer network model, the method comprising:
acquiring a target pedestrian image, preprocessing the target pedestrian image, and generating a standard pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;
dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
performing horizontal linear projection and vertical linear projection on each square sub-image to obtain a vertical feature matrix and a horizontal feature matrix;
and inputting the vertical feature matrix and the horizontal feature matrix into a pre-trained improved transducer network model, and predicting to obtain the recognition result of the target pedestrian image.
2. The pedestrian re-recognition method based on the Transformer network model according to claim 1, wherein the sub-image dividing process of the standard pedestrian image comprises: dividing the standard pedestrian image into a plurality of non-overlapping sub-images; each sub-image of the group is divided into a plurality of square sub-images having overlapping portions using sliding windows, respectively.
3. The pedestrian re-recognition method based on a Transformer network model according to claim 2, wherein each standard pedestrian image is divided into N square sub-images with overlapping portions, wherein the calculation formula of N is as follows:
Figure FDA0004161513380000011
wherein H is the height of a standard pedestrian image, W is the width of a standard pedestrian image, S is the step length, P is the height and width of square sub-images, K represents the total number of divisions of grouped sub-images, and K=K1×K2; k1 represents the number of vertically divided sub-images, and K2 represents the number of horizontally divided sub-images.
4. The pedestrian re-recognition method based on a Transformer network model according to claim 1, wherein the obtaining the horizontal feature matrix and the vertical feature matrix from the square sub-images through horizontal linear projection and vertical linear projection includes: linearly projecting each square sub-image according to a horizontal expansion mode to obtain horizontal expansion vectors of each sub-image, and vertically expanding N horizontal expansion vectorsSplicing in the straight direction to obtain NxP 2 Is a horizontal feature matrix of (a); linearly projecting each square sub-image according to a vertical expansion mode to obtain vertical expansion vectors of each sub-image, and splicing N vertical expansion vectors in the horizontal direction to obtain P 2 X N, where P is the height and width of the square sub-image.
5. The pedestrian re-recognition method based on a fransformer network model of claim 1, wherein the improved fransformer network model comprises a dual multi-scale fransformer structure, an L-1 layer Origin Transformer structure and a full connection layer; the dual multi-scale transducer structure is used for calculating a global feature matrix and a local feature matrix which are obtained based on a horizontal feature matrix and a vertical feature matrix and are of different scales, and feature fusion is carried out on the global feature matrix and the local feature matrix; the L-1 layer Origin Transformer structure is used for processing the fused feature matrix and extracting pedestrian features; and the full-connection layer is used for processing the pedestrian characteristics and predicting to obtain the recognition result of the target pedestrian image.
6. The pedestrian re-recognition method based on a fransformer network model according to claim 5, wherein the dual multi-scale fransformer structure comprises two multi-scale fransformer structures, namely a horizontal multi-scale fransformer structure and a vertical multi-scale fransformer structure; the horizontal multi-scale transducer structure calculates a horizontal feature matrix, and the vertical multi-scale transducer structure calculates a vertical feature matrix; each multi-scale transducer structure comprises a multi-branch multi-head attention module based on a top-layer branch, a middle-layer branch and a bottom-layer branch, wherein the top-layer branch does not process an input feature matrix and directly extracts global features of pedestrians, the middle-layer branch halves the input feature matrix and then extracts local features of the pedestrians, and the bottom-layer branch trisects the input feature matrix and then extracts the local features of the pedestrians; the global features of the horizontal multi-scale transducer structure and the local features of the two corresponding branches are fused and spliced, and a horizontal fusion feature matrix is output; fusing and splicing the global features of the vertical multi-scale transducer structure and the local features of the two corresponding branches, and outputting a vertical fusion feature matrix; projecting the horizontal fusion feature matrix by using a horizontal weight matrix to obtain a feature matrix of a horizontal multi-scale transducer structure; projecting the vertical fusion features by using a vertical weight matrix to obtain a feature matrix of a vertical multi-scale transducer structure; and then carrying out feature fusion on the horizontal feature matrix and the vertical feature matrix in a weight matrix projection mode to obtain the feature matrix of the first-layer dual multi-scale transducer structure.
7. The pedestrian re-recognition method based on a Transformer network model of claim 6, wherein the step of adding position and grouping information to the horizontal feature matrix before calculating the horizontal feature matrix; before the vertical feature matrix is calculated, adding position and grouping information into the vertical feature matrix.
8. The pedestrian re-recognition method based on a fransformer network model according to claim 1, wherein the pre-training process of the improved fransformer network model comprises:
acquiring a training pedestrian image, preprocessing the training pedestrian image, and generating a standard pedestrian image; the training pedestrian image is used for training an improved transducer network model;
dividing a standard pedestrian image corresponding to the training pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
performing horizontal linear projection and vertical linear projection on each square sub-image corresponding to the training pedestrian image to obtain a vertical feature matrix and a horizontal feature matrix;
inputting the horizontal feature matrix and the vertical feature matrix corresponding to the training pedestrian image into an improved transducer network model, and training parameters of the network model by using a loss function;
and when the improved transducer network model reaches the preset iteration times or converges, the training of the improved transducer network model is completed.
9. A pedestrian re-recognition device based on a Transformer network model, the device comprising:
the image acquisition module is used for acquiring a target pedestrian image; the target pedestrian image is a pedestrian image to be subjected to pedestrian re-recognition;
the image preprocessing module is used for preprocessing the target pedestrian image to generate a standard pedestrian image;
the image segmentation module is used for dividing the standard pedestrian image into a plurality of square sub-images with overlapping parts by adopting a sliding window;
the image mapping module is used for carrying out horizontal linear projection and vertical linear projection on each square sub-image to obtain a horizontal feature matrix and a vertical feature matrix;
and the image recognition module is used for inputting the horizontal feature matrix and the vertical feature matrix into a pre-trained improved transducer network model, and predicting to obtain a recognition result of the target pedestrian image.
10. A computer device, characterized in that it comprises a processor and a memory, which are connected to each other, wherein the memory is adapted to store a computer program, which computer program comprises program instructions, which processor is adapted to invoke the program instructions to perform a pedestrian re-recognition method based on a Transformer network model according to any of the claims 1-8.
CN202310351000.5A 2023-04-04 2023-04-04 Pedestrian re-recognition method and device based on Transformer network model and computer equipment Pending CN116311379A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310351000.5A CN116311379A (en) 2023-04-04 2023-04-04 Pedestrian re-recognition method and device based on Transformer network model and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310351000.5A CN116311379A (en) 2023-04-04 2023-04-04 Pedestrian re-recognition method and device based on Transformer network model and computer equipment

Publications (1)

Publication Number Publication Date
CN116311379A true CN116311379A (en) 2023-06-23

Family

ID=86779833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310351000.5A Pending CN116311379A (en) 2023-04-04 2023-04-04 Pedestrian re-recognition method and device based on Transformer network model and computer equipment

Country Status (1)

Country Link
CN (1) CN116311379A (en)

Similar Documents

Publication Publication Date Title
JP2023003026A (en) Method for identifying rural village area classified garbage based on deep learning
CN111259850A (en) Pedestrian re-identification method integrating random batch mask and multi-scale representation learning
CN110210551A (en) A kind of visual target tracking method based on adaptive main body sensitivity
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN105160310A (en) 3D (three-dimensional) convolutional neural network based human body behavior recognition method
CN111507370A (en) Method and device for obtaining sample image of inspection label in automatic labeling image
CN111368690A (en) Deep learning-based video image ship detection method and system under influence of sea waves
Cepni et al. Vehicle detection using different deep learning algorithms from image sequence
CN111476806A (en) Image processing method, image processing device, computer equipment and storage medium
CN114092487A (en) Target fruit instance segmentation method and system
CN114463759A (en) Lightweight character detection method and device based on anchor-frame-free algorithm
CN111582091A (en) Pedestrian identification method based on multi-branch convolutional neural network
CN111582154A (en) Pedestrian re-identification method based on multitask skeleton posture division component
CN114764870A (en) Object positioning model processing method, object positioning device and computer equipment
CN114332942A (en) Night infrared pedestrian detection method and system based on improved YOLOv3
CN114283326A (en) Underwater target re-identification method combining local perception and high-order feature reconstruction
CN113569672A (en) Lightweight target detection and fault identification method, device and system
CN117437691A (en) Real-time multi-person abnormal behavior identification method and system based on lightweight network
CN114494893B (en) Remote sensing image feature extraction method based on semantic reuse context feature pyramid
CN116311379A (en) Pedestrian re-recognition method and device based on Transformer network model and computer equipment
CN111160219B (en) Object integrity evaluation method and device, electronic equipment and storage medium
CN115131503A (en) Health monitoring method and system for iris three-dimensional recognition
CN114820723A (en) Online multi-target tracking method based on joint detection and association
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination