CN112183295A

CN112183295A - Pedestrian re-identification method and device, computer equipment and storage medium

Info

Publication number: CN112183295A
Application number: CN202011010031.7A
Authority: CN
Inventors: 戚风亮
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2021-01-05

Abstract

The application relates to a pedestrian re-identification method, a pedestrian re-identification device, computer equipment and a storage medium. The method comprises the following steps: inputting a target image into a pedestrian re-identification network, wherein the pedestrian re-identification network comprises a depth-separable convolutional layer, the depth-separable convolutional layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting different numbers of depth-separable convolutional modules in series; extracting the features of the target image through each parallel branch, and acquiring feature maps corresponding to different scale information output by each parallel branch; performing feature fusion processing on feature maps corresponding to different scale information output by each parallel branch through the feature fusion layer to obtain fusion feature maps; and re-identifying the pedestrian according to the fusion feature map. By adopting the method, the accuracy of the pedestrian re-identification result can be improved.

Description

Pedestrian re-identification method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of machine learning technologies, and in particular, to a pedestrian re-identification method, apparatus, computer device, and storage medium.

Background

In recent years, with the rapid development of artificial intelligence technology, pedestrian RE-identification (RE-ID) is receiving more and more attention. Pedestrian re-identification is a technique of determining whether a specific pedestrian is present in an image using a computer vision technique. In the prior art, the pedestrian re-identification task is generally completed by using a target classification network.

Existing object classification networks include convolutional layers for extracting human features of an input image, which are used to determine whether a human in the input image is a specific human.

However, the convolutional layer in the conventional target classification network includes a large amount of parameters, and the number of training samples used for training the pedestrian re-recognition function is small, so that the pedestrian re-recognition model trained based on the target classification network is prone to have a problem of overfitting.

Disclosure of Invention

In view of the above, it is necessary to provide a pedestrian re-identification method, apparatus, computer device and storage medium with less parameters and good network generalization performance.

A pedestrian re-identification method, the method comprising:

inputting a target image into a pedestrian re-identification network, wherein the pedestrian re-identification network comprises a depth separable convolution layer, the depth separable convolution layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the plurality of parallel branches, and each parallel branch is formed by connecting different numbers of depth separable convolution modules in series;

extracting the characteristics of the target image through each parallel branch, and acquiring characteristic graphs corresponding to different scale information output by each parallel branch;

performing feature fusion processing on feature maps corresponding to different scale information output by each parallel branch through a feature fusion layer to obtain fusion feature maps;

and re-identifying the pedestrian according to the fusion feature map.

In one embodiment, the depth-separable convolutional layer further comprises a channel attention layer, the channel attention layer comprising a plurality of channel attention units, each channel attention unit connected to one of the parallel branches, and each channel attention unit connected to the feature fusion layer; before feature fusion processing is performed on feature graphs corresponding to different scale information output by each parallel branch through a feature fusion layer, the method further comprises the following steps:

for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the channel weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches;

correcting the feature graphs corresponding to the information with different scales output by the parallel branches according to the channel weight to obtain corrected channel feature graphs;

the feature fusion processing is carried out on the feature graphs corresponding to the information with different scales output by each parallel branch through the feature fusion layer, and the feature fusion processing comprises the following steps:

and performing feature fusion processing on each corrected channel feature map through a feature fusion layer.

In one embodiment, the depth-separable convolutional layer further comprises a spatial attention layer, the spatial attention layer comprising a plurality of spatial attention units, each spatial attention unit connected to one of the parallel branches, and each spatial attention unit connected to the feature fusion layer; before feature fusion processing is performed on feature graphs corresponding to different scale information output by each parallel branch through a feature fusion layer, the method further comprises the following steps:

for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the spatial weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches;

correcting the feature maps corresponding to the information with different scales output by the parallel branches according to the spatial weight to obtain a corrected spatial feature map;

and performing feature fusion processing on each corrected spatial feature map through a feature fusion layer.

In one embodiment, the depth-separable convolutional layer further comprises a channel attention layer and a spatial attention layer in cascade, wherein the channel attention layer comprises a plurality of channel attention cells, the spatial attention layer comprises a plurality of spatial attention cells, each channel attention cell is connected with one parallel branch, each spatial attention cell is connected with one channel attention cell, and each spatial attention cell is connected with a feature fusion layer; before feature fusion processing is performed on feature graphs corresponding to different scale information output by each parallel branch through a feature fusion layer, the method further comprises the following steps:

correcting feature graphs corresponding to different scale information output by the parallel branches according to the channel weight to obtain a corrected first feature graph;

for each first feature map, calculating the spatial weight of each first feature map through each spatial attention unit;

correcting each first characteristic diagram according to the spatial weight to obtain a corrected second characteristic diagram;

and performing feature fusion processing on each corrected second feature map through the feature fusion layer.

In one embodiment, the feature fusion processing of the feature maps corresponding to the different scale information output by each parallel branch through the feature fusion layer includes:

and adding the feature graphs corresponding to the information with different scales output by each parallel branch according to bits through a feature fusion layer.

In one embodiment, the pedestrian re-identification network further includes a cascading pooling layer and a classification layer, the pooling layer is connected with the depth separable convolution layer, and the pedestrian re-identification is performed according to the fused feature map, including:

performing pooling treatment on the fusion characteristic diagram through a pooling layer to obtain a characteristic vector;

and carrying out pedestrian re-identification according to the feature vectors through a classification layer.

In one embodiment, the depth separable convolution module includes a cascaded point-by-point convolution layer, a depth convolution layer, a normalization layer, and an activation layer.

A pedestrian re-identification apparatus, the apparatus comprising:

the pedestrian re-identification network comprises a depth-separable convolutional layer, the depth-separable convolutional layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting different numbers of depth-separable convolutional modules in series;

the characteristic extraction module is used for extracting the characteristics of the target image through each parallel branch and acquiring the characteristic graphs corresponding to different scale information output by each parallel branch;

the feature fusion module is used for performing feature fusion processing on the feature graphs corresponding to the information with different scales output by each parallel branch through the feature fusion layer to obtain a fusion feature graph;

and the re-identification module is used for re-identifying the pedestrian according to the fusion characteristic diagram.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

and re-identifying the pedestrian according to the fusion feature map.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

and re-identifying the pedestrian according to the fusion feature map.

According to the pedestrian re-identification method, the pedestrian re-identification device, the computer equipment and the storage medium, the target image is input into the pedestrian re-identification network, the pedestrian re-identification network comprises the depth separable convolutional layer, the depth separable convolutional layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting different numbers of depth separable convolutional modules in series. The feature graphs of different scale information can be obtained by each parallel branch, feature fusion processing is carried out on the feature graphs corresponding to the different scale information through the feature fusion layer cascaded with the depth separable convolution layer, a fusion feature graph is obtained, the fusion feature graph is more discriminable, pedestrian re-identification is carried out based on the fusion feature graph, and the accuracy of the obtained pedestrian re-identification result is higher. Because the convolution layer of the pedestrian re-identification network is a depth separable convolution layer, and the parameter quantity of the depth separable convolution layer is less than that of the existing target classification network, the goal of lightening the pedestrian re-identification network is achieved, and the generalization performance of the pedestrian re-identification network is improved.

Drawings

FIG. 1 is a flow diagram illustrating a pedestrian re-identification method in one embodiment;

FIG. 2 is a schematic diagram of a network structure of a deep separable convolutional layer in one embodiment;

FIG. 3 is a schematic diagram of a depth separable convolution module in accordance with an embodiment;

FIG. 4 is a flowchart illustrating a pedestrian re-identification method according to another embodiment;

FIG. 5 is a schematic diagram of a network structure of a deep separable convolutional layer in one embodiment;

FIG. 6 is a diagram illustrating a process for calculating channel weights by the channel attention unit in one embodiment;

FIG. 7 is a flowchart illustrating a pedestrian re-identification method according to another embodiment;

FIG. 8 is a schematic diagram of a network structure of a deep separable convolutional layer in one embodiment;

FIG. 9 is a diagram illustrating a process for calculating spatial weights for a spatial attention unit in one embodiment;

FIG. 10 is a flowchart illustrating a pedestrian re-identification method according to another embodiment;

FIG. 11 is a schematic diagram of a new depth separable convolutional layer in one embodiment;

FIG. 12 is a block diagram showing the construction of a pedestrian re-identification apparatus in one embodiment;

FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The pedestrian RE-identification technology (RE-ID for short) is used as an important branch in computer vision and is widely applied to the fields of smart cities, smart traffic and the like. Specifically, the pedestrian re-identification means: a moving object, which may be a person or a vehicle, may appear in succession in different cameras. And (3) judging whether two targets in different cameras are the same target by using a computer vision related method.

At present, the application of the deep learning technology to the task of pedestrian re-identification is a mainstream method at present, the deep learning technology is mainly realized based on a target classification network, and in the prior art, the type of the target classification network for pedestrian re-identification is a convolutional neural network. The target classification network comprises a convolution layer, the convolution layer is used for extracting character features of the input image, and the character features are used for determining whether the character in the input image is a specific character or not.

However, the convolutional layer in the conventional target classification network often has a complex structure and a large number of parameters, and the number of training samples used for training the pedestrian re-recognition function is small, so that the pedestrian re-recognition model trained based on the target classification network is prone to be over-fitted.

Meanwhile, the existing target classification network ignores the fusion of features under different scales, so that the network cannot learn more discriminative features, and therefore the accuracy of pedestrian re-identification is influenced.

According to the pedestrian re-identification method, the target image is input into the pedestrian re-identification network, the pedestrian re-identification network comprises a depth separable convolution layer, the depth separable convolution layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting depth separable convolution modules in series in different numbers. Because the convolution layer of the pedestrian re-identification network is a depth separable convolution layer, and the parameter quantity of the depth separable convolution layer is less than that of the existing target classification network, the goal of lightening the pedestrian re-identification network is achieved, and the generalization performance of the pedestrian re-identification network is improved.

Furthermore, the depth separable convolutional layer comprises a plurality of parallel branches, each parallel branch can acquire feature maps of different scale information, feature fusion processing is carried out on the feature maps corresponding to the different scale information through the feature fusion layer cascaded with the depth separable convolutional layer to obtain a fusion feature map, the fusion feature map is more discriminative, pedestrian re-identification is carried out based on the fusion feature map, and the accuracy of the obtained pedestrian re-identification result is higher.

In one embodiment, as shown in fig. 1, a pedestrian re-identification method is provided, and this embodiment is exemplified by applying the method to a computer device, which may be a server, a computer, or the like. In this embodiment, the method includes the steps of:

step 101, inputting a target image into a pedestrian re-identification network by a computer device.

The pedestrian re-identification network comprises a depth separable convolution layer, the depth separable convolution layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting depth separable convolution modules in different numbers in series.

In the embodiment of the application, the computer device can perform image processing on the target image so that the size of the target image is adapted to the size of the pedestrian re-identification network.

In this embodiment, the pedestrian re-identification network may be an existing pedestrian re-identification model, and the convolution layers in the existing pedestrian re-identification model are replaced by the depth-separable convolution layers in this embodiment, where the number of the depth-separable convolution layers may be multiple, and the structure of each depth-separable convolution layer is the same.

In the embodiment of the application, the target image is an image of a person to be detected.

The depth separable convolution refers to convolution by adopting different convolution kernels at different input channels, and the depth separable convolution decomposes a common convolution operation into a point-by-point convolution process and a depth convolution process. Further, in the embodiment of the application, in order to acquire feature information of different scales of a target image, a plurality of parallel branches are designed. As shown in fig. 2, fig. 2 illustrates a network structure of a depth separable convolutional layer, in which a blank box represents a depth separable module, and 4 parallel branches are exemplarily illustrated in fig. 2. So-called parallel branching is parallel independent feature extraction branching. Each parallel branch can perform feature extraction on the target image separately.

Further, each parallel branch is formed by concatenating a different number of depth-separable convolution modules. After the depth separable convolution modules with different numbers are connected in series, the features corresponding to the information with different scales can be extracted when the features of the target image are extracted.

Optionally, the smaller the number of depth-separable convolution modules, the smaller the scale corresponding to the feature extracted from the target image, and the larger the number of depth-separable convolution modules, the larger the scale corresponding to the feature extracted from the target object.

Alternatively, as shown in fig. 3, fig. 3 shows a schematic structural diagram of the depth separable convolution module. The depth separable convolution module may be a mini3 x 3 convolution module. Alternatively, as shown in FIG. 3, the depth separable convolution module may consist of cascaded point-by-point convolution layers, depth convolution layers, normalization layers, and activation layers. Wherein, the point-by-point convolution layer may be Conv 1 × 1, the depth convolution layer may be DW Conv 3 × 3 (depth separable convolution), the normalization layer may be BatchNorm module, and the active layer may be a nonlinear unit ReLU (nonlinear active function).

In the embodiment of the application, the feature fusion layer is cascaded with the multiple parallel branches and is used for performing feature fusion on features output by the multiple parallel branches.

And 102, the computer equipment extracts the features of the target image through each parallel branch and acquires the feature maps corresponding to different scale information output by each parallel branch.

In this embodiment of the application, after the computer device inputs the target image into the pedestrian re-identification network, the computer device may obtain an initial data matrix corresponding to the target image, where the initial data matrix is a matrix formed by three channel values of each pixel of the target image, and for example, the data matrix may be represented as W × H × 3, where W represents the width of the target image, H represents the height of the target image, and 3 represents three channels of the pixels. As shown in fig. 2, the initial data matrix first undergoes a 1 × 1 convolution operation, and then undergoes a plurality of parallel branches, and a different number of depth-separable convolution modules in each parallel branch can perform a feature extraction operation on the input initial data matrix, thereby implementing extraction of information of different scales.

Wherein the feature extraction operation may include the following: the depth separable convolution module included in the parallel branch performs convolution operation on the input initial data matrix to obtain a characteristic diagram of the scale information corresponding to the parallel branch.

Optionally, taking the above example as an example, the feature map corresponding to different scale information output by each parallel branch may be represented as a W × H × 3 feature matrix.

And 103, the computer equipment performs feature fusion processing on the feature maps corresponding to the information with different scales output by each parallel branch through the feature fusion layer to obtain a fusion feature map.

In the embodiment of the application, for each parallel branch, after the parallel branch outputs the feature map of the scale information corresponding to the parallel branch, the feature map is transmitted to the feature fusion layer.

The feature fusion layer can perform feature fusion processing on a plurality of feature maps corresponding to different scale information of the target image. Optionally, the feature fusion processing process may include: and taking the maximum value of the feature graphs corresponding to the information with different scales output by each parallel branch according to bits through a feature fusion layer. Optionally, the feature fusion processing process may include: and adding the feature graphs corresponding to the information with different scales output by each parallel branch according to bits through a feature fusion layer.

In the embodiment of the application, the feature fusion layer is obtained by fusing a plurality of feature maps of different scale information, so that the fused feature map obtained by fusion is more discriminative than the feature maps corresponding to different scale information output by each parallel branch.

And 104, the computer equipment performs pedestrian re-identification according to the fusion feature map.

In the embodiment of the present application, the process of performing pedestrian re-identification by the computer device according to the fused feature map may include the following:

the computer device may obtain the feature map of the target person, compare the similarity of the feature map of the fusion feature map and the feature map of the target person, determine that the person in the target image and the target person are the same person if the similarity is greater than a threshold, and determine that the person in the target image and the target person are not the same person if the similarity is less than or equal to the threshold.

Optionally, the process of acquiring the feature map of the target person by the computer device may refer to the contents disclosed in step 101 and step 103.

Optionally, in this embodiment of the application, the pedestrian re-identification network may further include a cascading pooling layer and a classification layer, where the pooling layer is connected to the deep separable convolution layer, and the process of performing pedestrian re-identification by the computer device according to the fused feature map may further include the following:

after the depth-separable convolutional layer outputs the fused feature map, the computer device may perform pooling processing on the fused feature map through the pooling layer to obtain the feature vector. Wherein pooling treatment may include, but is not limited to, average pooling, maximum pooling, minimum pooling, and overlapping pooling, among others. The computer device may then input the feature vectors into a classification layer, which is substantially a fully connected layer, from which pedestrian re-identification may be performed.

According to the pedestrian re-identification method, the target image is input into the pedestrian re-identification network, the pedestrian re-identification network comprises the depth separable convolution layer, the depth separable convolution layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting different numbers of depth separable convolution modules in series. Because the convolution layer of the pedestrian re-identification network is a depth separable convolution layer, and the parameter quantity of the depth separable convolution layer is less than that of the existing target classification network, the goal of lightening the pedestrian re-identification network is achieved, and the generalization performance of the pedestrian re-identification network is improved.

In one embodiment, as shown in fig. 4, the present application provides a new pedestrian re-identification method, including:

step 401, the computer device inputs the target image into a pedestrian re-identification network.

The target image is an image of a person to be detected, and the size of the target image is processed into a size matched with the pedestrian re-identification network in advance.

The pedestrian re-identification network comprises a depth-separable convolutional layer, as shown in fig. 5, fig. 5 shows a new depth-separable convolutional layer, which comprises a plurality of parallel branches and a channel attention layer and a feature fusion layer cascaded with the plurality of parallel branches, wherein the channel attention layer is shown as a dotted-line box in fig. 5, the channel attention layer comprises a plurality of channel attention cells CAM, each channel attention cell CAM is connected with one parallel branch, and each channel attention cell CAM is connected with the feature fusion layer.

The method comprises the steps that a computer device inputs a target image into a pedestrian re-identification network, so that the computer device can obtain an initial data matrix corresponding to the target image, the initial data matrix is a matrix formed by three channel numerical values of each pixel point of the target image, the initial data matrix is subjected to 1 x 1 convolution operation firstly and then passes through a plurality of parallel branches, and different numbers of depth separable convolution modules in each parallel branch can perform feature extraction operation on the input initial data matrix, so that extraction of information with different scales is achieved.

Step 402, the computer device extracts the features of the target image through each parallel branch, and obtains feature maps corresponding to different scale information output by each parallel branch.

Reference may be made to the disclosure of step 102 in embodiments of the present application.

In step 403, for the feature map corresponding to the information with different scales output by each parallel branch, the computer device calculates the channel weights of the feature maps corresponding to the information with different scales output by the parallel branches through the channel attention unit connected with the parallel branches.

In this embodiment, the computer device may calculate a channel weight of a feature map of each parallel branch output, specifically, for example, a feature map of a certain parallel branch output is represented as H × W × C, where H × W represents a width and height size of the target image, and C represents a channel value.

As shown in fig. 6, fig. 6 is a schematic diagram of a process of calculating a channel weight by a channel attention unit, and in particular, in this embodiment of the present application, a feature map H × W × C is spatially compressed to obtain a one-dimensional vector with a length C, where the one-dimensional vector with the length C includes a channel value corresponding to each pixel in an image size H × W. And then compressing the length of the one-dimensional vector to C/4, and then expanding the one-dimensional vector with the length of C/4 through a full connection layer to obtain a new one-dimensional vector with the length of C. The new one-dimensional vector is the channel weight corresponding to the H × W × C feature map. The new one-dimensional vector includes a channel weight corresponding to a channel value of each pixel in the image size H × W.

And step 404, the computer device corrects the feature maps corresponding to the information with different scales output by the parallel branches according to the channel weight to obtain the corrected channel feature map.

In the embodiment of the present application, as shown in fig. 6, a process of correcting a feature map by a computer device according to a channel weight includes the following steps: the computer device may multiply the channel weight corresponding to each pixel with the channel value for that pixel to obtain a modified channel value. Then, the computer device may replace the channel values in the feature map H × W × C with the corrected channel, thereby obtaining a corrected channel feature map.

And 405, the computer device performs feature fusion processing on each corrected channel feature map through the feature fusion layer to obtain a fusion feature map, and performs pedestrian re-identification based on the fusion feature map.

In the embodiment of the present application, taking 3 parallel branches shown in fig. 5 as an example, a computer device may obtain three corrected channel feature maps.

And the computer equipment can perform feature fusion processing on the three corrected channel feature maps through the feature fusion layer. Optionally, in this embodiment of the application, reference may be made to the content disclosed in step 103 for the process of performing, by the computer device, feature fusion processing on the three corrected channel feature maps.

The process of the computer device performing pedestrian re-identification based on the fused feature map can refer to the disclosure of step 104.

In the embodiment of the application, a channel attention mechanism is introduced, and each channel attention unit in a channel attention layer is used for correcting the channel of the feature diagram output by each parallel branch so as to realize weight redistribution of different channels, thereby improving the accuracy of extracted features and improving the accuracy of pedestrian re-identification.

In one embodiment, as shown in fig. 7, the present application provides a new pedestrian re-identification method, including:

step 701, inputting a target image into a pedestrian re-identification network by a computer device.

The pedestrian re-identification network comprises a depth separable convolutional layer, as shown in fig. 8, fig. 8 shows a new depth separable convolutional layer, which comprises a plurality of parallel branches and a spatial attention layer and a feature fusion layer cascaded with the plurality of parallel branches, wherein the spatial attention layer is shown as a dashed box in fig. 8, the spatial attention layer comprises a plurality of spatial attention units SAM, each spatial attention unit SAM is connected with one of the parallel branches, and each spatial attention unit SAM is connected with the feature fusion layer.

Step 702, the computer device extracts the features of the target image through each parallel branch, and obtains feature maps corresponding to different scale information output by each parallel branch.

And step 703, for the feature map corresponding to the information with different scales output by each parallel branch, the computer device calculates the spatial weight of the feature map corresponding to the information with different scales output by the parallel branch through the channel attention unit connected with the parallel branch.

In this embodiment, the computer device may calculate a spatial weight of a feature map of each parallel branch output, specifically, for example, the feature map of a certain parallel branch output is represented as H × W × C, where H × W represents a width and height size of the target image, and C represents a channel value.

As shown in fig. 9, fig. 9 is a schematic diagram of a process of calculating a spatial weight by a spatial attention unit, specifically, in this embodiment of the present application, channel compression is performed on a feature map H × W × C to obtain a two-dimensional vector with a size H × W, a convolution operation is performed on the two-dimensional vector with a size H × W by using a 1 × 1 convolution kernel to obtain a convolved two-dimensional vector, the convolved two-dimensional vector includes spatial weights respectively corresponding to each channel, and the spatial weight of each channel includes a wide weight and a high weight.

Step 704, the computer device corrects the feature maps corresponding to the information with different scales output by the parallel branches according to the spatial weight, and obtains the corrected spatial feature map.

In the embodiment of the present application, as shown in fig. 9, a process of correcting a feature map by a computer device according to a spatial weight includes the following: for each channel, the computer device may perform a weighting operation on the width and height corresponding to the channel and the spatial weight corresponding to the channel to obtain the width and height after the channel is corrected. Then, the computer device may replace the width in the feature map H × W × C with the corrected width, thereby obtaining a corrected spatial feature map.

Step 705, the computer device performs feature fusion processing on each corrected spatial feature map through the feature fusion layer to obtain a fusion feature map, and performs pedestrian re-identification based on the fusion feature map.

In the embodiment of the present application, taking 3 parallel branches shown in fig. 8 as an example, a computer device may obtain three corrected spatial feature maps.

The computer equipment can perform feature fusion processing on the three corrected spatial feature maps through the feature fusion layer. Optionally, in this embodiment of the application, reference may be made to the content disclosed in step 103 for the process of performing, by the computer device, feature fusion processing on the three corrected spatial feature maps.

In the embodiment of the application, a space attention mechanism is introduced, and the width and the height of the feature map output by each parallel branch are corrected through each space attention unit in a space attention layer so as to realize weight redistribution of different spaces, thereby improving the accuracy of the extracted features and improving the accuracy of pedestrian re-identification.

In one embodiment, as shown in fig. 10, the present application provides a new pedestrian re-identification method, including:

step 1001, a computer device inputs a target image into a pedestrian re-identification network.

Wherein the pedestrian re-identification network comprises a depth-separable convolutional layer, as shown in fig. 11, fig. 11 shows a new depth-separable convolutional layer, the depth-separable convolutional layer comprises a plurality of parallel branches and a channel attention layer, a spatial attention layer and a feature fusion layer cascaded with the plurality of parallel branches, wherein the channel attention layer comprises a plurality of channel attention units, the spatial attention layer comprises a plurality of spatial attention units, each channel attention unit is connected with one parallel branch, and each spatial attention unit is connected with one channel attention unit, and each spatial attention unit is connected with a feature fusion layer.

Step 1002, the computer device extracts features of the target image through each parallel branch, and obtains feature maps corresponding to different scale information output by each parallel branch.

In the embodiment of the present application, with reference to the content disclosed in step 102, the feature maps corresponding to the information of different scales output by each parallel branch may be obtained.

And step 1003, for the feature map corresponding to the information with different scales output by each parallel branch, the computer device calculates the channel weight of the feature map corresponding to the information with different scales output by the parallel branch through a channel attention unit connected with the parallel branch.

In the embodiment of the present application, referring to the content disclosed in step 403, the channel weight of the feature map output by each parallel branch may be obtained.

And step 1004, the computer device corrects the feature maps corresponding to the information with different scales output by the parallel branches according to the channel weight to obtain a corrected first feature map.

In this embodiment of the application, for the feature map output by each branch and the channel weight of the feature map output by each parallel branch, the content disclosed in step 404 may be referred to correct the feature map output by each parallel branch, so as to obtain a corrected first feature map corresponding to each parallel branch.

In step 1005, for each first feature map, the computer device calculates a spatial weight of each first feature map by each spatial attention unit.

In the embodiment of the present application, the corrected first feature map may be represented as W × H × C, and the spatial weight of the first feature map may be calculated with reference to the content disclosed in step 703.

And step 1006, the computer device corrects each first feature map according to the spatial weight to obtain a corrected second feature map.

The embodiment of the present application may obtain the modified second characteristic diagram by referring to the content disclosed in step 704.

And step 1007, the computer device performs feature fusion processing on each corrected second feature map through the feature fusion layer to obtain a fusion feature map, and performs pedestrian re-identification based on the fusion feature map.

In the embodiment of the present application, taking 4 parallel branches shown in fig. 11 as an example, the computer device may obtain four modified second feature maps.

The computer device can perform feature fusion processing on the four corrected second feature maps through the feature fusion layer. Optionally, in this embodiment of the application, reference may be made to the content disclosed in step 103 for the process of performing, by the computer device, the feature fusion processing on the four corrected second feature maps.

In the embodiment of the application, a channel attention mechanism and a space attention mechanism are introduced, the channel of the feature map output by each parallel branch is corrected through each channel attention unit in the channel attention layer, and the width and the height of the feature map output by each parallel branch are corrected through each space attention unit in the space attention layer, so that the weight redistribution of different channels and different spaces is realized, the accuracy of the extracted features is improved, and the accuracy of pedestrian re-identification is improved.

It should be understood that although the various steps in the flowcharts of fig. 1-11 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-11 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In one embodiment, as shown in fig. 12, there is provided a pedestrian re-recognition apparatus including: an input module 1201, a feature extraction module 1202, a feature fusion module 1203, and a re-recognition module 1204, wherein:

the input module 1201 is used for inputting the target image into a pedestrian re-identification network, the pedestrian re-identification network comprises a depth separable convolution layer, the depth separable convolution layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting depth separable convolution modules in series in different numbers;

the feature extraction module 1202 is configured to perform feature extraction on the target image through each parallel branch, and obtain feature maps corresponding to different scale information output by each parallel branch;

a feature fusion module 1203, configured to perform feature fusion processing on the feature maps corresponding to the information with different scales output by each parallel branch through a feature fusion layer, so as to obtain a fusion feature map;

and a re-identification module 1204, configured to perform pedestrian re-identification according to the fused feature map.

In one embodiment of the present application, the depth-separable convolutional layer further comprises a channel attention layer, the channel attention layer comprising a plurality of channel attention units, each channel attention unit connected to one of the parallel branches, and each channel attention unit connected to the feature fusion layer; the feature extraction module 1202 is further configured to calculate, for the feature map corresponding to the different scale information output by each parallel branch, a channel weight of the feature map corresponding to the different scale information output by the parallel branch through a channel attention unit connected to the parallel branch;

In one embodiment of the present application, the depth-separable convolutional layer further comprises a spatial attention layer, the spatial attention layer comprising a plurality of spatial attention cells, each spatial attention cell connected to one of the parallel branches, and each spatial attention cell connected to the feature fusion layer; the feature extraction module 1202 is further configured to calculate, for the feature map corresponding to the different scale information output by each parallel branch, a spatial weight of the feature map corresponding to the different scale information output by the parallel branch through a channel attention unit connected to the parallel branch;

In one embodiment of the present application, the depth-separable convolutional layer further comprises a channel attention layer and a spatial attention layer in cascade, wherein the channel attention layer comprises a plurality of channel attention cells, the spatial attention layer comprises a plurality of spatial attention cells, each channel attention cell is connected with one parallel branch, and each spatial attention cell is connected with one channel attention cell, each spatial attention cell is connected with a feature fusion layer; the feature extraction module 1202 is further configured to calculate, for the feature map corresponding to the different scale information output by each parallel branch, a channel weight of the feature map corresponding to the different scale information output by the parallel branch through a channel attention unit connected to the parallel branch;

In an embodiment of the present application, the feature fusion module 1203 is further configured to add, by a feature fusion layer, the feature maps corresponding to the different scale information output by the parallel branches in bits.

In an embodiment of the present application, the pedestrian re-identification network further includes a cascading pooling layer and a classification layer, the pooling layer is connected to the depth separable convolution layer, and the re-identification module 1204 is further configured to perform pooling processing on the fusion feature map through the pooling layer to obtain a feature vector;

In one embodiment of the present application, the depth separable convolution module includes a cascaded point-by-point convolution layer, a depth convolution layer, a normalization layer, and an activation layer.

For specific definition of the pedestrian re-identification device, reference may be made to the above definition of the pedestrian re-identification method, and details are not repeated here. The modules in the pedestrian re-identification device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing a pedestrian re-identification network. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a pedestrian re-identification method.

Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

inputting a target image into a pedestrian re-identification network, wherein the pedestrian re-identification network comprises a depth separable convolution layer, the depth separable convolution layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the plurality of parallel branches, and each parallel branch is formed by connecting different numbers of depth separable convolution modules in series; extracting the characteristics of the target image through each parallel branch, and acquiring characteristic graphs corresponding to different scale information output by each parallel branch; performing feature fusion processing on feature maps corresponding to different scale information output by each parallel branch through a feature fusion layer to obtain fusion feature maps; and re-identifying the pedestrian according to the fusion feature map.

In one embodiment, the depth-separable convolutional layer further comprises a channel attention layer, the channel attention layer comprising a plurality of channel attention units, each channel attention unit connected to one of the parallel branches, and each channel attention unit connected to the feature fusion layer, the processor when executing the computer program further performs the steps of: for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the channel weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches; correcting the feature graphs corresponding to the information with different scales output by the parallel branches according to the channel weight to obtain corrected channel feature graphs; and performing feature fusion processing on each corrected channel feature map through a feature fusion layer.

In one embodiment, the depth-separable convolutional layer further comprises a spatial attention layer, the spatial attention layer comprising a plurality of spatial attention units, each spatial attention unit connected to one of the parallel branches, and each spatial attention unit connected to the feature fusion layer; the processor, when executing the computer program, further performs the steps of: for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the spatial weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches; correcting the feature maps corresponding to the information with different scales output by the parallel branches according to the spatial weight to obtain a corrected spatial feature map; and performing feature fusion processing on each corrected spatial feature map through a feature fusion layer.

In one embodiment, the depth-separable convolutional layer further comprises a channel attention layer and a spatial attention layer in cascade, wherein the channel attention layer comprises a plurality of channel attention cells, the spatial attention layer comprises a plurality of spatial attention cells, each channel attention cell is connected with one parallel branch, each spatial attention cell is connected with one channel attention cell, and each spatial attention cell is connected with a feature fusion layer; the processor, when executing the computer program, further performs the steps of: for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the channel weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches; correcting feature graphs corresponding to different scale information output by the parallel branches according to the channel weight to obtain a corrected first feature graph; for each first feature map, calculating the spatial weight of each first feature map through each spatial attention unit; correcting each first characteristic diagram according to the spatial weight to obtain a corrected second characteristic diagram; and performing feature fusion processing on each corrected second feature map through the feature fusion layer.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and adding the feature graphs corresponding to the information with different scales output by each parallel branch according to bits through a feature fusion layer.

In one embodiment, the pedestrian re-identification network further comprises a cascading pooling layer and a classification layer, the pooling layer being connected to the deep separable convolutional layer, the processor when executing the computer program further implementing the steps of: performing pooling treatment on the fusion characteristic diagram through a pooling layer to obtain a characteristic vector; and carrying out pedestrian re-identification according to the feature vectors through a classification layer.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the depth-separable convolutional layer further comprises a channel attention layer, the channel attention layer comprising a plurality of channel attention units, each channel attention unit connected to one of the parallel branches, and each channel attention unit connected to the feature fusion layer, the computer program when executed by the processor further performs the steps of: for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the channel weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches; correcting the feature graphs corresponding to the information with different scales output by the parallel branches according to the channel weight to obtain corrected channel feature graphs; and performing feature fusion processing on each corrected channel feature map through a feature fusion layer.

In one embodiment, the depth-separable convolutional layer further comprises a spatial attention layer, the spatial attention layer comprising a plurality of spatial attention units, each spatial attention unit connected to one of the parallel branches, and each spatial attention unit connected to the feature fusion layer; the computer program when executed by the processor further realizes the steps of: for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the spatial weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches; correcting the feature maps corresponding to the information with different scales output by the parallel branches according to the spatial weight to obtain a corrected spatial feature map; and performing feature fusion processing on each corrected spatial feature map through a feature fusion layer.

In one embodiment, the depth-separable convolutional layer further comprises a channel attention layer and a spatial attention layer in cascade, wherein the channel attention layer comprises a plurality of channel attention cells, the spatial attention layer comprises a plurality of spatial attention cells, each channel attention cell is connected with one parallel branch, each spatial attention cell is connected with one channel attention cell, and each spatial attention cell is connected with a feature fusion layer; the computer program when executed by the processor further realizes the steps of: for the feature maps corresponding to the information with different scales output by each parallel branch, calculating the channel weights of the feature maps corresponding to the information with different scales output by the parallel branches through a channel attention unit connected with the parallel branches; correcting feature graphs corresponding to different scale information output by the parallel branches according to the channel weight to obtain a corrected first feature graph; for each first feature map, calculating the spatial weight of each first feature map through each spatial attention unit; correcting each first characteristic diagram according to the spatial weight to obtain a corrected second characteristic diagram; and performing feature fusion processing on each corrected second feature map through the feature fusion layer.

In one embodiment, the computer program when executed by the processor further performs the steps of: and adding the feature graphs corresponding to the information with different scales output by each parallel branch according to bits through a feature fusion layer.

In one embodiment, the pedestrian re-identification network further comprises a cascaded pooling layer and classification layer, the pooling layer being connected to the deep separable convolutional layer, the computer program when executed by the processor further implementing the steps of: performing pooling treatment on the fusion characteristic diagram through a pooling layer to obtain a characteristic vector; and carrying out pedestrian re-identification according to the feature vectors through a classification layer.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A pedestrian re-identification method, the method comprising:

inputting a target image into a pedestrian re-identification network, wherein the pedestrian re-identification network comprises a depth-separable convolutional layer, the depth-separable convolutional layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and each parallel branch is formed by connecting different numbers of depth-separable convolutional modules in series;

extracting the features of the target image through each parallel branch, and acquiring feature maps corresponding to different scale information output by each parallel branch;

performing feature fusion processing on feature maps corresponding to different scale information output by each parallel branch through the feature fusion layer to obtain fusion feature maps;

and re-identifying the pedestrian according to the fusion feature map.

2. The method of claim 1, wherein the depth separable convolutional layer further comprises a channel attention layer comprising a plurality of channel attention cells, each of the channel attention cells being connected to one of the parallel branches, and each of the channel attention cells being connected to the feature fusion layer; before the feature fusion processing is performed on the feature maps corresponding to the different scale information output by each of the parallel branches through the feature fusion layer, the method further includes:

for the feature map corresponding to the information with different scales output by each parallel branch, calculating the channel weight of the feature map corresponding to the information with different scales output by the parallel branch through a channel attention unit connected with the parallel branch;

correcting the feature graphs corresponding to the information with different scales output by the parallel branches according to the channel weights to obtain corrected channel feature graphs;

the feature fusion processing of the feature graph corresponding to the information with different scales output by each parallel branch through the feature fusion layer includes:

and performing feature fusion processing on each corrected channel feature map through the feature fusion layer.

3. The method of claim 1, wherein the depth separable convolutional layer further comprises a spatial attention layer comprising a plurality of spatial attention cells, each of the spatial attention cells connected to one of the parallel branches, and each of the spatial attention cells connected to the feature fusion layer; before the feature fusion processing is performed on the feature maps corresponding to the different scale information output by each of the parallel branches through the feature fusion layer, the method further includes:

for the feature map corresponding to the information with different scales output by each parallel branch, calculating the spatial weight of the feature map corresponding to the information with different scales output by the parallel branch through a channel attention unit connected with the parallel branch;

correcting the feature maps corresponding to the information with different scales output by the parallel branches according to the spatial weights to obtain corrected spatial feature maps;

and performing feature fusion processing on each corrected spatial feature map through the feature fusion layer.

4. The method of claim 1, wherein the depth separable convolutional layer further comprises a channel attention layer and a spatial attention layer in cascade, wherein the channel attention layer comprises a plurality of channel attention cells, the spatial attention layer comprises a plurality of spatial attention cells, each of the channel attention cells is connected to one of the parallel branches, and each of the spatial attention cells is connected to one of the channel attention cells, each of the spatial attention cells being connected to the feature fusion layer; before the feature fusion processing is performed on the feature maps corresponding to the different scale information output by each of the parallel branches through the feature fusion layer, the method further includes:

correcting the feature maps corresponding to different scale information output by the parallel branches according to the channel weights to obtain a corrected first feature map;

for each first feature map, calculating a spatial weight of each first feature map by each spatial attention unit;

modifying each first feature map according to the spatial weight to obtain a modified second feature map;

5. The method according to claim 1, wherein the performing, by the feature fusion layer, a feature fusion process on the feature map corresponding to the different scale information output by each of the parallel branches includes:

and adding the feature graphs corresponding to different scale information output by each parallel branch according to bits through the feature fusion layer.

6. The method of claim 1, wherein the pedestrian re-identification network further comprises a cascading pooling layer and a classification layer, the pooling layer being connected to the deep separable convolution layer, the performing pedestrian re-identification according to the fused feature map comprising:

performing pooling treatment on the fusion characteristic diagram through the pooling layer to obtain a characteristic vector;

and carrying out pedestrian re-identification according to the feature vectors through the classification layer.

7. The method of claim 1, wherein the depth separable convolution module comprises a concatenation of a point-by-point convolution layer, a depth convolution layer, a normalization layer, and an activation layer.

8. A pedestrian re-identification apparatus, the apparatus comprising:

the pedestrian re-identification network comprises a depth separable convolutional layer, the depth separable convolutional layer comprises a plurality of parallel branches and a feature fusion layer cascaded with the parallel branches, and the parallel branches are formed by connecting different numbers of depth separable convolutional modules in series;

the feature extraction module is used for extracting features of the target image through each parallel branch and acquiring feature maps corresponding to different scale information output by each parallel branch;

the feature fusion module is used for performing feature fusion processing on the feature graphs corresponding to the information with different scales output by the parallel branches through the feature fusion layer to obtain fusion feature graphs;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.