CN110765841A

CN110765841A - Group pedestrian re-identification system and terminal based on mixed attention mechanism

Info

Publication number: CN110765841A
Application number: CN201910827177.1A
Authority: CN
Inventors: 杨华; 阳兵; 许琪羚
Original assignee: Shanghai Jiaotong University; China Electronics Technology Group Corp CETC
Current assignee: Shanghai Jiaotong University; China Electronics Technology Group Corp CETC; Electronic Science Research Institute of CTEC
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2020-02-07

Abstract

The invention provides a group pedestrian re-identification system based on a mixed attention mechanism, which comprises a backbone model module, a mixed attention model module, a feature extraction module, a feature fitting module and a feature evaluation module, wherein the backbone model module is used for generating a plurality of groups of pedestrians; performing primary feature extraction on the group images by using a deep convolutional neural network backbone model; further extracting the preliminarily extracted features by using a mixed attention mechanism model; features that have been subjected to the mixed attention model are aligned and evaluated using least squares residual distances. A terminal for operating the system is also provided. The method utilizes a mixed attention model comprising space attention and channel attention, enables the network to pay more attention to key regions and characteristics of the group images, provides a novel least square residual distance based on a least square algorithm, better learns the measurement between the group image pairs, can well adapt to various challenges brought by the group pedestrian images, and has good diversity and universal applicability.

Description

Group pedestrian re-identification system and terminal based on mixed attention mechanism

Technical Field

The invention belongs to the technical field of computer vision, in particular to a group pedestrian re-identification system and a terminal based on a hybrid attention mechanism, and relates to pedestrian re-identification focusing on a group under a non-overlapping monitoring camera.

Background

In recent years, people pay more and more attention to public safety problems, the attention degree of the safety problems is raised to a brand new level, video monitoring networks in cities are improved, video monitoring is distributed to all corners of the cities, and the quality and the quantity of video monitoring data are greatly improved. In each part of the video monitoring network, in consideration of the flexibility and variability of pedestrian activities and the important significance of monitoring pedestrians on protecting personnel safety and criminal investigation and solution solving, the pedestrians are one of key objects concerned by monitoring videos, and video monitoring is applied to places where various people frequently appear and disappear.

At the present stage, the pedestrian monitoring means mainly monitors real-time camera shooting in a mode mainly based on manual monitoring, a large amount of manpower and material resources are needed in the process, the video data volume obtained by video monitoring is huge, and the obtained huge video data volume is difficult to be completely covered in the mode mainly based on manual monitoring, so that the monitoring mechanism is easy to miss important information and data. In this case, a pedestrian re-recognition algorithm based on deep learning and artificial intelligence is increasingly important. The pedestrian re-identification algorithm can identify the same pedestrian under the monitoring of different cameras which are not overlapped. The rapid development of pedestrian re-identification can bring huge accuracy promotion for the video monitoring system, and simultaneously, a large amount of manpower and material resources can be saved, and the method has important research significance.

The pedestrian re-identification across the cameras plays an important role in solving key problems in video monitoring. The pedestrian re-identification is a key technology in a monitoring scene, clear face images are scarce in a monitoring network, and the resolution of monitoring result images or videos of pedestrians cannot meet the requirement of face identification, so that the face identification technology is difficult to apply to monitoring videos obtained by cameras with extremely low resolution or far positions away from pedestrians. Considering that in practical situations, a monitoring camera is deployed with a plurality of vacuum areas which cannot be monitored, after a pedestrian leaves the monitoring area of the camera and enters the vacuum area, the related technology of target tracking is not applicable, and the pedestrian re-recognition function is a key for solving the cross-camera tracking.

The existing research has high attention on the problem of pedestrian re-identification of a single person, but neglects the important role of pedestrian re-identification based on a group in a pedestrian comparison task. Group-based pedestrian re-identification can solve some problems that cannot be solved by single-person pedestrian re-identification, considering that the behaviors of pedestrians are often accompanied by group information, the group behaviors are very common and the information provided by the common actions of the groups has important value. For example, when a single pedestrian is shielded by another pedestrian in the group in a large area in the camera for a long time, matching cannot be achieved by re-identifying the pedestrian by the single person, and the problem can be solved by re-identifying the pedestrian from the group angle.

Group re-identification has an important role in public safety and surveillance systems. The goal of group re-identification is to identify the same group in different non-overlapping camera views. The existing relevant research of pedestrian re-identification is mostly based on a single person, and in a real situation, the group behaviors are not negligible. Groups are very common behaviors that provide a large amount of useful information. Meanwhile, the group comparison method based on the image can be extended to the video sequence. The video sequence has more information compared with the image, and the system performance can be further improved by integrating the analysis of the spatio-temporal information.

There are relatively few studies on the re-identification of group pedestrians, and there are some problems in that the group pedestrian identification is studied by using information such as position, trajectory, speed and direction (see Ukita, Norimichi, Yusuke Moriguchi, and NorihiroHagita. "Peer re-identification access non-overlapping cameras using group features" "Computer Vision and Image Understanding 144(2016): 228-. Some work uses a sparse feature coding method and unsupervised transfer learning to transfer and learn a sparse coding dictionary which is learned by single pedestrian re-identification into a Group problem (see Lisanti, Giuseppe, et al, "Group re-identification vision dictionary of sparse features encoding," Proceedings of the IEEE International conference on Computer Vision.2017 "), but this method needs results of DPM, ACF and R-CNN simultaneously, and the cost of preprocessing is high. Some work extracts subsets of the Group and iteratively matches them using a Multi-level matching algorithm, which can fully extract the features of the Group and handle the variability of population within the Group (see Xiao, Hao, et al, "Group Re-Identification: describing and Integrating Multi-Grain information," 2018 a Multimedia Conference on Multimedia Conference. acm,2018.), but the matching process is very time consuming and not applicable to large data sets and real-world scenarios.

Aiming at the requirements of the current social public safety system, the group re-identification research of the cross-camera has an important role in the accuracy of the monitoring system and the high recall rate of target pedestrians. The problem of group re-identification across cameras will therefore be extensively studied in the present invention.

Disclosure of Invention

In view of the foregoing problems in the prior art, an object of the present invention is to provide a group pedestrian re-identification system and a terminal based on a hybrid attention system, where the system and the terminal utilize the advantages of the existing deep learning method, extract features of group images by utilizing deep learning, and simultaneously fuse the hybrid attention system, extract features with higher resolution and apply greater weight to key regions, so that the attention of a frame can be focused on the entire group images including a background to focus on more information with resolution of the key in group pedestrians, thereby improving the performance of group pedestrian re-identification under a monitoring camera.

The group behaviors can provide more information for pedestrian re-identification, and the method has important research value for further improving the result of pedestrian re-identification. The invention aims to re-identify based on groups, and can compare target groups and perform cross-camera retrieval in a video monitoring system. In the target retrieval process, the deep learning method based on group re-identification and pedestrian re-identification is far superior to manual search. With the continuous expansion of the coverage of video monitoring, the intelligent video monitoring technology has far better effect than the traditional manual monitoring in the aspect of monitoring the network. The comparison of target pedestrians is effectively controlled through an intelligent video monitoring technology, and the social security level and the management efficiency can be effectively improved. The research on the problem has very important practical significance for security and criminal investigation events.

The group features with discrimination can be focused on based on multiple attentions such as spatial attention and channel attention. By designing a network framework for group cross-camera re-identification by utilizing a deep neural network and aiming at the problem of space change in a group, the invention performs distinguishable efficient image feature extraction and expression on the group in the monitoring video by designing a feature extraction algorithm fusing global and local features based on an attention model. Meanwhile, considering that a re-recognition algorithm based on a single-frame image is limited to the visual features of static pedestrians, the practical challenges such as deformation and shielding of pedestrians under different camera viewing angles cannot be solved well, on the basis of feature extraction of the image, the spatial and temporal features of the pedestrians in the video are further utilized, a network which is based on a video sequence and extracts the spatial features and the temporal features in the image at the same time is designed, and comparison and correlation analysis of key pedestrian targets across a time-space domain are performed.

In the surveillance video, the situation that a target pedestrian is shielded by a large area often occurs, and the shielded target person is difficult to be retrieved through simple pedestrian re-identification. Therefore, the greatest advantage of group-based pedestrian re-identification lies in the problem of being able to further judge the re-identification of pedestrians whose information is insufficient by means of the matching information of the group. In the pedestrian re-recognition problem of a single person, if a person is continuously occluded by other people in a group under the view angle of a certain video camera, the person is difficult to re-recognize in the camera, and group-based re-recognition has a capability of solving such a problem. The key point of matching such images is to fully utilize the image information of pedestrians with complete images in the group, so that the pedestrians with complete information become important persons of image matching, and the matching of the group is completed by matching the important persons and simultaneously supplementing the information of the blocked pedestrians. Group identification can make up for this disadvantage of single person re-identification. After the group comparison result is applied to the single pedestrian re-identification task, the group comparison result can be further applied to the single pedestrian re-identification result based on the research basis of the group comparison in the research. When the pedestrian re-identification and the group re-identification are simultaneously applied, the accuracy of the pedestrian re-identification can be optimized by increasing more effective information provided by group comparison;

the attention mechanism in deep learning is substantially similar to the human selective visual attention mechanism. The core goal is to select more critical information from the multitude of information that is useful for the current task goal. Aiming at various challenges existing in group re-identification, a plurality of attention mechanisms including channel attention and space attention are adopted, and a feature vector with higher resolution can be obtained by extracting network features through a mixed attention model. The image of the group has more background information than the image of the individual pedestrian. Therefore, the local features of the pedestrian part are extracted by designing a feature extraction algorithm based on the attention model and fusing the global features and the local features, the extracted features are more focused on the pedestrian part, and the background part is given less weight. More discriminative feature vectors can be extracted using an attention mechanism. The spatial attention is attention corresponding to a position different from the picture in two dimensions of width and height of the image. In the traditional pedestrian comparison process, mismatching is easy to occur for different pedestrians with similar appearances. Under the condition, the spatial attention mechanism can better pay attention to local features with discrimination, and further improves the accuracy of overall pedestrian re-identification.

The invention is realized by the following technical scheme.

According to one aspect of the invention, a group pedestrian re-identification system based on a hybrid attention mechanism is provided, comprising:

a backbone model module: forming a backbone model feature extraction network P of a group pedestrian re-identification task based on a deep convolutional neural network, applying the backbone model feature extraction network P obtained by image pairs on the whole group pedestrian re-identification data set, and generating a feature vector E for each image s of group pedestrians through the backbone model feature extraction network P;

a mixed attention model module: on the basis of the backbone model feature extraction network P, adding a mixed attention mechanism network H, and further extracting the preliminarily extracted features; the mixed attention mechanism network H comprises a channel attention module C and a space attention module S, wherein the channel attention module C and the space attention module S respectively act on the features of the feature vector E in different dimensions;

a feature extraction module: capturing global dependence of the feature vector E through the mixed attention mechanism network H, and obtaining a channel attention parameter w after each input feature vector E is processed by a channel attention module C and a space attention module S in the mixed attention mechanism network H respectively₁And spatial attention parameter w₂(ii) a Applying attention parameter w₁And w₂Obtaining the integral attention weight w of the feature vector E; multiplying the feature vector E and the attention weight w to obtain features F of key areas of the pedestrian images of the more attention groups;

a feature fitting module: matching the extracted image features F with the features of the detection target; in the distance measurement stage, least square residual distance is adopted, the characteristics of a detection target and a matched object in the least square residual distance after being extracted by a backbone model characteristic extraction network P and a mixed attention mechanism network H are respectively Y and X, and a polynomial fitting function is fitted through learning

To approximate a feature Y that is close to the true detection target; wherein the polynomial fitting functionIs shown as

In the formula, A is the form of expanding the characteristic X of the matching object into a matrix, and W is the coefficient of a polynomial fitting function; fitting a function to a polynomial

Adding a regularization term; solving the polynomial fitting function added with the regularization term by using a least square method to find out the coefficient W of the optimal polynomial fitting function;

a characteristic evaluation module: fitting a polynomial to a function

And a function formed by the difference value of the fitting result of (a) and the feature Y extracted by the backbone model feature extraction network P

The features X extracted by the mixed attention mechanism network H are compared and evaluated as distances.

Preferably, in the backbone model module, the backbone model feature extraction network P mainly consists of a deep convolutional neural network, and outputs the preliminary features of each image s of the group of pedestrians.

Preferably, in the feature extraction module:

for the channel attention module C, the global pooling operation is first performed on the feature vectors, the calculation formula is as follows,

wherein X represents the features extracted by the backbone model feature extraction network P, and h, w and C respectively represent the lengths of the input vector of the attention module C in three dimensions; after the feature vectors are subjected to global pooling, the features are changed into the length which is the same as the number of the channels, attention weights of different feature channels are represented at the same time, and then the feature vectors obtained through the global pooling are respectively sent to two full-connection layers;

for the spatial attention module S, firstly pooling the feature vectors in the channel dimension, and using a full-connection layer with the kernel size of 1 × 1 and the step length of 1 to learn the attention weight of the network in the length dimension and the width dimension; and finally, the results of the space attention and the channel attention are used as weights to act on the original features, so that the features with higher resolution are obtained.

Preferably, in the feature extraction module, the channel attention parameter w₁There is a dimension, the spatial attention parameter w₂There are two different dimensions; attention parameter w₁And w₂Merging to obtain a parameter w with attention weight in h, w and c dimensions; h, w, C represent the length of the input vector of attention module C in three dimensions, respectively.

Preferably, in the feature extraction module, the feature F is a feature of the group image obtained after the mixed attention mechanism is added, and is obtained by multiplying the feature vector E without the mixed attention mechanism by the body attention weight w.

Preferably, in the feature learning module, the polynomial fitting function after adding the regularization term is:

preferably, in the feature learning module, the feature X is (X)₁,x₂,x₃,……,x_d) The expansion into a matrix is in the form of:

where d is the dimension of feature X.

Preferably, in the feature learning module, the equation for solving the coefficient W of the optimal polynomial fitting function by using the least square method to the polynomial fitting function added with the regularization term is as follows:

W＝(A^TA+βI)^-1A^TY (4)

preferably, in the feature evaluation module, the function

Comprises the following steps:

the features that have been mixed attention models are compared and evaluated as distances.

According to another aspect of the invention, there is provided a terminal comprising a memory, a processor, and a system as provided by any of the above stored on the memory and executable by the processor.

Compared with the prior art, the invention has the beneficial effects that:

1) the deep learning method is used for extracting the image features in the group re-recognition task and sequencing and recognizing the group identities, and compared with the conventional mode that the two parts of manual design features and measurement learning are completely separated, the deep learning method can extract more effective group image features.

2) The present invention uses a hybrid attention mechanism and can extract features with more resolving power. Aiming at various challenges existing in group re-identification, various attention mechanisms including channel attention and space attention are adopted, and a mixed attention model is used for extracting network features, so that feature vectors with higher resolution can be obtained, and a group re-identification task is completed better.

3) Different from the traditional distance measurement which mostly uses Euclidean distance and cosine distance, the method matches the pictures with the extracted features and can adopt the residual distance based on the least square method in the distance measurement stage, so that the association between the target image and the features of the image to be matched can be better learned, the distance between the residual errors of the features is used as the distance between the features of the two pictures, the relationship between the two pictures can be more effectively represented, and the accuracy of re-identification is improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a hybrid attention mechanism network feature extraction according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating group re-identification of challenging problems according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a least squares residual distance in accordance with an embodiment of the present invention;

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings: the present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

The embodiment of the invention provides a group pedestrian re-identification system based on a mixed attention mechanism, which comprises the following modules:

backbone model: the module forms a backbone model feature extraction network P of a group pedestrian re-identification task based on a deep convolutional neural network, applies the backbone model feature extraction network P obtained by an image on the whole group pedestrian re-identification data set, generates a feature vector E for each image s of group pedestrians through the backbone model feature extraction network P of the group pedestrian re-identification task, and outputs the initial features of the group images.

A mixed attention model module: the module adds a mixed attention mechanism network H on the basis of a backbone model feature extraction network P to further extract the preliminarily extracted features. The mixed attention mechanism network H captures global dependencies within a single feature. And adding a network H of a mixed attention mechanism to pay more attention to key areas and characteristics of the group images, wherein the network H of the mixed attention mechanism is divided into a channel attention module C and a space attention module S, and the two modules respectively act on the characteristics of the images in different dimensions after the characteristics are extracted through the backbone model characteristic extraction network P.

A feature extraction module: capturing global dependency in single feature by a mixed attention mechanism network H in a mixed attention model module, and obtaining an attention parameter w after processing each input feature E by a channel attention module C and a space attention module S in the attention model network H respectively₁And w₂Representing the weights of all channel features and location features, respectively; applying attention parameter w₁And w₂Obtaining the integral attention weight w of the feature vector E; multiplying the feature vector E and the attention weight w to obtain features F of key areas of the pedestrian images of the more attention groups;

specifically, the method comprises the following steps:

each input feature E is processed by a channel attention module C and a space attention module S in an attention model network H to obtain an attention parameter w₁And w₂Representing the weights of all channel features and location features, respectively; for the channel attention module, the global pooling operation is first performed on the feature vectors, the calculation formula is as follows,

wherein X represents the characteristics extracted by the backbone model characteristic extraction network, and h, w and c represent the lengths of the input vectors of the attention module in three dimensions respectively. After the feature vectors are subjected to global pooling, the features are changed into the length which is the same as the number of the channels, attention weights of different feature channels are represented at the same time, and then the feature vectors obtained through the global pooling are respectively sent to two full-connection layers, so that the operation increases the nonlinearity of the network, and parameters required by network training are reduced. For the spatial attention module, the feature vectors are first pooled in the channel dimension and used to learn the attention weight of the network in both the length and width dimensions through the fully connected layer with the kernel size of 1 × 1 and the step size of 1. And finally, the results of the space attention and the channel attention are used as weights to act on the original features, so that the features with higher resolution are obtained.

Applying an attention parameter w of a channel dimension and a position dimension₁And w₂Obtaining the integral attention weight w of the characteristic E; attention parameter w₁And w₂There are one and two different dimensions, respectively, for the channel attention parameter and the spatial attention parameter, respectively. Will w₁And w₂After the combination, a parameter w with attention weight in three dimensions of h, w and c is obtained.

Multiplying the feature E of the group pedestrian by the attention weight w, namely multiplying the position and channel dimension of the group feature by the weight of the position feature and the channel feature respectively to obtain the feature F of the key area of the attention group image, wherein the feature F of the group image obtained after adding the mixed attention mechanism is the product of the feature E of the image without adding the mixed attention mechanism and the attention parameter w.

A feature fitting module: and matching the extracted image features with the features of the detection target. The least squares residual distance is used in the distance measurement phase. Features of a detection target and a matching object in the least square residual distance after being extracted by a backbone model and a mixed attention model are respectively Y and X, and a polynomial fitting function is learned

To approximate a feature Y that is close to the true detection target; fitting a function to a polynomial

Adding a regularization term; solving the polynomial fitting function added with the regularization term by using a least square method to find out the coefficient W of the optimal polynomial fitting function

Specifically, the method comprises the following steps:

the features of the detection target and the matching object after being extracted by the backbone model and the mixed attention model are Y and X respectively. The characteristics of a detection target and a matching object in the least square residual distance after being extracted by a backbone model and a mixed attention model are respectivelyY and X, fitting a function by learning a polynomial

To approximate the feature Y that is close to the real detection target. Expressing a polynomial fitting function as

A form in which a is a form in which a feature X of a matching object is expanded into a matrix form, and the feature X is (X)₁,x₂,x₃,……,x_d) The expansion is carried out as a polynomial,

where d is the dimension of feature X. Problem transformation to find

The optimal solution of (a) can be obtained from the formula of the least square method^TA+βI)^-1A^TAnd Y. W is the coefficient of the target polynomial fitting function;

a model obtained by simply considering the optimal solution will have a great likelihood of overfitting and the prediction results are poor. To solve this problem, a regularization term is added to the objective function, and the objective function after the regularization term is added is,

solving the target fitting function using a least squares method to find a coefficient parameter W of the best fitting function, solving the target fitting function using a least squares method to find a coefficient parameter W of the best fitting function using an equation of,

W＝(A^TA+βI)^-1A^TY (4)

a characteristic evaluation module: will be provided withFitting function results

Function formed by difference with real network extracted characteristic Y

The features that have been mixed attention models are compared and evaluated as distances. The final residual distance is

And finishing the sequencing retrieval and retrieval of the target group by comparing the distance between the detection target and the matching object, and evaluating the accuracy of the result by using the mAP and the sequencing number.

Based on the mixed attention mechanism-based group pedestrian re-identification system provided by the above embodiment of the invention, an embodiment of the invention also provides a terminal, which comprises a memory, a processor and the system provided by the above embodiment, wherein the system is stored on the memory and can be operated by the processor.

The system and the terminal provided by the embodiment of the invention extract the characteristics of group pedestrians in the image through the mixed attention model, and can focus on the group characteristics with the discrimination based on multiple attentions such as space attention, channel attention and the like. By designing a network framework for cross-camera re-identification of groups by utilizing a deep neural network and aiming at the problem of space change in the groups, and designing a feature extraction algorithm fusing global and local features based on an attention model, the groups in the monitoring video are subjected to distinguishable efficient image feature extraction and expression. Meanwhile, the association between the target image and the characteristics of the image to be matched can be better learned by applying the least square residual distance, and the distance between the residual errors of the characteristics is used as the distance for judging the characteristics of the two images, so that the relationship between the two images can be more effectively represented, and the accuracy of re-identification is improved.

The following table 1 is a numerical comparison result of the final recognition accuracy of the performance obtained based on the system and the terminal provided by the above embodiment of the present invention. The other results for comparison are shown from top to bottom in numerical comparison with the results of the present example. It can be seen that the accuracy of the above embodiment of the invention is improved.

TABLE 1

The following table 2 is a comparison between the performance obtained by the two modules in the mixed attention model acting on the network respectively and the experimental result obtained by the whole mixed attention module, and it can be seen that each part of attention in the mixed attention model acting on the network independently can bring improvement to the result, and the result of the mixed attention model is better.

TABLE 2

Method of producing a composite material	R＝1	R＝5	R＝10	mAP
					Backbone model	80.7％	89.7％	92.6％	71.0％
Attention in space	80.9％	89.1％	94.2％	73.4％
					Channel attention	81.3％	89.9％	92.6％	73.8％
Mixed attention	82.7％	91.6％	94.6％	75.2％

In summary, the mixed attention mechanism-based group pedestrian re-identification system and the terminal provided in the above embodiments of the present invention add the mixed attention model based on the convolutional neural network, and adopt various attention mechanisms including channel attention and spatial attention to solve various challenges existing in group re-identification, and extract network features through the mixed attention model, so as to obtain feature vectors with higher resolution and better complete group re-identification tasks; through a multi-instance learning mode, input elements are changed on the basis of the original data set, and a video is changed into a video segment, so that not only is the limitation of a network training Bayer-Town data set insufficient, but also a testing stage can be accurately positioned to an abnormal segment, and the robustness of abnormal detection is improved; capturing global dependence inside the features through the attention model, correcting the features, and participating in training of the whole network in a more reasonable form; finally, the universality of the system and the terminal is improved.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A group pedestrian re-identification system based on a hybrid attentiveness mechanism, comprising:

a backbone model module: the backbone model module forms a backbone model feature extraction network P of a group pedestrian re-identification task based on a deep convolutional neural network, applies the backbone model feature extraction network P obtained by image pairs on the whole group pedestrian re-identification data set, and generates a feature vector E for each image s of group pedestrians through the backbone model feature extraction network P;

a mixed attention model module: the mixed attention model module adds a mixed attention mechanism network H on the basis of a backbone model feature extraction network P of the backbone model module, and further extracts the preliminarily extracted features; the mixed attention mechanism network H comprises a channel attention module C and a space attention module S, wherein the channel attention module C and the space attention module S respectively act on the features of the feature vector E in different dimensions;

a feature extraction module: the feature extraction module captures the global dependence of the feature vector E through the mixed attention mechanism network H of the mixed attention model module, and each input feature vector E is processed by the channel attention module C and the space attention module S in the mixed attention mechanism network H to obtain a channel attention parameter w₁And spatial attention parameter w₂(ii) a Applying attention parameter w₁And w₂Obtaining the integral attention weight w of the feature vector E; multiplying the feature vector E and the attention weight w to obtain image features F of key areas of the pedestrian images of the more attention groups;

a feature fitting module: the feature fitting module matches the image features F extracted by the feature extraction module with the features of the detection target; wherein: in the distance measurement stage, least square residual distance is adopted, the characteristics of a detection target and a matched object in the least square residual distance after being extracted by a backbone model characteristic extraction network P and a mixed attention mechanism network H are respectively Y and X, and a polynomial fitting function is fitted through learning

To approximate a feature Y that is close to the true detection target; wherein the polynomial fitting function

Is shown as

a characteristic evaluation module: the feature evaluation module fits a polynomial to a functionAnd a function formed by the difference value of the fitting result of (a) and the feature Y extracted by the backbone model feature extraction network P

2. The mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: in the backbone model module, a backbone model feature extraction network P mainly comprises a deep convolutional neural network and outputs the preliminary features of each image s of the group of pedestrians.

3. The mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: the feature extraction module is characterized in that:

for the channel attention module C, firstly, performing global pooling operation on the feature vectors; after the feature vectors are subjected to global pooling, the features are changed into the length which is the same as the number of the channels, attention weights of different feature channels are represented at the same time, and then the feature vectors obtained through the global pooling are respectively sent to two full-connection layers;

for the space attention module S, firstly pooling the feature vectors in the channel dimension, and using the feature vectors in a full connection layer for learning the attention weight of the network in the length dimension and the width dimension; and finally, the results of the space attention and the channel attention are used as weights to act on the original features, so that the features with higher resolution are obtained.

4. The mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: in the feature extraction module, a channel attention parameter w₁There is a dimension, the spatial attention parameter w₂There are two different dimensions; attention parameter w₁And w₂Merging to obtain a parameter w with attention weight in h, w and c dimensions; h, w, C represent the length of the input vector of attention module C in three dimensions, respectively.

5. The mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: in the feature extraction module, the feature F is a feature of the group image obtained after the mixed attention mechanism is added, and is obtained by multiplying the feature vector E without the mixed attention mechanism by the body attention weight w.

6. The mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: in the feature learning module, the polynomial fitting function after adding the regularization term is as follows:

7. the mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: in the feature learning module, the feature X is (X)₁，x₂，x₃，......，x_d) The expansion into a matrix is in the form of:

where d is the dimension of feature X.

8. The mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: in the feature learning module, the equation for solving the coefficient W of the optimal polynomial fitting function by using the least square method to the polynomial fitting function added with the regularization term is as follows:

W＝(A^TA+βI)^-1A^TY (4)

9. the mixed attention mechanism-based group pedestrian re-identification system of claim 1, wherein: in the feature evaluation module, a function

Comprises the following steps:

10. A terminal comprising a memory, a processor, and the system of any one of claims 1 to 9 stored on the memory and executable by the processor.