CN114049609A

CN114049609A - Multilevel aggregation pedestrian re-identification method based on neural architecture search

Info

Publication number: CN114049609A
Application number: CN202111407584.0A
Authority: CN
Inventors: 王胜法; 杜亮
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-02-15

Abstract

The invention belongs to the field of image retrieval and computer vision, and provides a multi-level aggregation pedestrian re-identification method based on neural architecture search. Most of the current work is to design a feature extraction framework manually, which needs a great deal of prior knowledge, aiming at the problem, how to imitate the process of human visual perception and filter background noise so as to focus the attention point of a model on human bodies and some local identity-related information is considered, according to the thought, firstly, a multi-layer aggregation framework is provided to capture and aggregate more discriminative pedestrian features from different layers, and then a specific search space for pedestrian re-identification is adopted, wherein the specific search space comprises two search units and efficient operators. And finally, exploring a search space from a unit level and an operation level by introducing a collaborative search strategy to obtain an optimized search architecture. Extensive experiments prove that the framework searched by the method achieves the most advanced performance on 4 pedestrian re-identification benchmarks.

Description

Multilevel aggregation pedestrian re-identification method based on neural architecture search

Technical Field

The invention belongs to the field of image retrieval and computer vision, and relates to a multilevel aggregate pedestrian re-identification method based on neural architecture search.

Background

With the development of deep learning in the field of computer vision, pedestrian re-identification is widely concerned by the academic and industrial circles as a basic task of cross-camera pedestrian tracking and intelligent security monitoring. The basic flow of pedestrian re-identification is as follows: given a query person image, other images belonging to the person are found in the search gallery, these candidate images being captured by non-overlapping cameras in different scenes. However, the pedestrian images have the problems of partial occlusion, posture change, complex lighting conditions, background clutter and the like. Therefore, pedestrian re-identification remains a challenging task.

In response to the above existing challenges, the existing pedestrian re-identification method mainly matches and distinguishes images of pedestrians by extracting diversified discriminative features. In particular, YifanSun et al, in "Beyond part models: Person retrieved with refined part sales" (and a strong connected basic) published in 2018, integrate global and local features by introducing a strategy to divide a picture vertically into several equal sub-parts. The method can achieve better performance, but does not consider the relevance of all parts of the pedestrian. Some methods begin to learn more robust features using a priori knowledge. "Omni-scale feature learning for person identification" published in 2019 by KaiyangZhou et al, proposed the use of a convergent gate to dynamically fuse different scale features and channel features. In the same year, PengfeiFang et al published "Biliner attentionnetworks for person retrieval" and simulated human perception patterns with an attention module to obtain richer feature representations. These methods are all an exploration on how to extract the final features, and omit the extraction and utilization of the multi-stage features.

In subsequent work, it has been proposed to employ a multi-level aggregated network architecture for pedestrian re-identification. Hopping connection aggregation is one of the most common methods that can aggregate multi-level features at once. But this straightforward approach introduces a lot of low-level noise information, which, although increasing the number of features, does not improve the quality of the features. The "chiral person re-identification via multi-modal training" published by FengZheng et al uses a top-down characteristic pyramid structure with transverse connections. In this way, a high-level semantic feature map can be constructed, but also a large number of extraneous interference features can be introduced. In general, the bottom-layer features can naturally reflect detailed local information of the image, and the high-layer features can retain more semantic information. Recent research shows that more robust discrimination information can be extracted by effectively fusing bottom-layer features and high-layer features. Therefore, the multi-branch structure becomes a development trend of pedestrian re-identification naturally, and the method can achieve better performance through a large amount of work.

The current multi-branch structure design idea is roughly divided into two types: one is to apply various attention modules in the multi-branch network to improve the capability of the branch network to extract specific features, and the other is to design a complex multi-branch connection mode to enable more features to be fused with each other. These manually designed architectures are highly dependent on a priori knowledge, such as tuning experience and extensive experimentation, and even if experts in the field are involved in the design, it may take months. To reduce labor and time costs, some work has proposed methods to automatically search convolutional neural network architectures. "Auto-reid: search for a part-aware convnetfor person re-identification" published in 2019 by RuijieQuan et al proposes to introduce a part of sensing modules in the search space and to search out a lightweight network suitable for the task of pedestrian re-identification. The method firstly proposes to solve the pedestrian re-identification problem through automatic neural framework search. The subsequent 2021, HanjunLi et al published "Combined depth space based architecture search for personne-identification" and proposed a new combined deep search space and low cost search strategy to search network architecture. Although these two methods can search for network architectures with a small number of parameters, they ignore the advantages of the multi-branch architecture. This also results in their lack of ability to extract potential features and fuse multi-level discriminative information, thereby limiting the potential for neural architecture search, and the searched network structure may not achieve better performance.

Disclosure of Invention

We propose a method for automatically searching for an optimal multi-level aggregation architecture for pedestrian re-identification tasks. Specifically, we first build a multi-level aggregation architecture, capturing and aggregating more discriminative features from different levels. This process is intended to mimic the human visual perception process, i.e., filtering background noise, focusing attention on mining potential information. Then we design a new search space, which includes two search units and six efficient convolution operations. In addition, a high-efficiency collaborative search strategy is adopted to explore a search space, and finally, an optimal end-to-end pedestrian re-identification framework is searched.

The specific scheme comprises the following steps:

a multilevel aggregate pedestrian re-identification method based on neural architecture search comprises the following steps:

firstly, constructing a multi-layer polymerization framework;

the framework of the multi-layer aggregation architecture is composed of two parts, namely a main network for extracting global information and a branch network for integrating a plurality of layer characteristics: the main network is ResNet50 and is responsible for extracting semantic information of four different layers; the branch network consists of three search branches, and each search branch consists of four search modules; the macrostructure of the search module is obtained by searching in a unit-level search space, and the microstructure is obtained by exploring an operation-level search space; because each layer of characteristics are different, the search modules searched by the search loss supervision are also different, so that the detail composition of the three search branches is also different, the semantic information is more comprehensively processed and integrated, and the final branch network can extract richer and more robust characteristics;

extracting multi-stage features through a backbone network, and then integrating and processing the multi-stage features through search branches in a branch network; each search branch processes the extracted features of the corresponding main network layer, and the processing result and the features output by the next main network layer are aggregated to be used as the input of the next search branch, so that prior knowledge is provided for feature integration of the next scale, and the final output features are the aggregated features of the last stage of the main network and the features processed by the third search branch;

designing a search space for forming a branch network, wherein the search space comprises a unit level search level and an operation level search level; two units are designed on the unit level, namely a distillation unit and a fine-grained unit; introducing six efficient convolution modules which are most suitable for a pedestrian re-identification task on an operation level;

the distillation unit is a graph structure mainly composed of four nodes and nine edges, wherein the nodes represent features, and the edges represent potential operations which are obtained by searching in an operation level search space; seven of the nine edges are edges to be searched, and two edges are fixed operation edges; defining nodes as n respectively₁、n₂、n₃And n₄Except for n₄Each node except the node has two edges, wherein one edge is connected with the node behind, and the other edge is connected with the feature combination module, wherein the feature combination module is used for connecting a plurality of features, and specifically comprises the following steps: integrating the features of different dimensions into a uniform dimension and aggregating to obtain rich and comprehensive features; input features are represented by n₁Inputting, performing potential operation on 7 edges, aggregating by a feature combination module, and cascading with the original input features to finally form output features;

the fine-grained unit mainly comprises three parts, namely feature segmentation, parallel flow and feature fusion; firstly, defining parallel flow as four layers, wherein each layer of flow structure has three nodes representing characteristics, edges connecting characteristic nodes are operations to be searched, and the operations are obtained by an operation level search space; the fine-grained unit comprises eight edges to be searched; the input features are divided into 4 horizontal bar features by feature division operation, then the features are processed by a parallel flow structure, and the processed features are restored into brand new features of the input feature size by feature fusion operation;

thirdly, searching a search space by adopting an efficient collaborative search strategy, and finally searching an optimal end-to-end pedestrian re-identification network architecture by guiding a retrieval loss function;

searching units and operations forming an optimal architecture by adopting an efficient collaborative search strategy, and essentially solving a double-layer optimization problem;

wherein α ═ { α ═ α_o，α_cDenotes the over-parameters of the cell level and the operation level, ω is a weight parameter,

and

representing training and validation losses, respectively.

And through supervision of retrieval loss, iteratively searching units in the unit-level space, searching the operation of the constituent units in the operation-level space, and finally searching out the optimal network framework for the pedestrian re-identification task.

The invention has the beneficial effects that:

1) the invention provides a new pedestrian re-identification model from the angle of neural framework search, and the optimal pedestrian re-identification model is finally formed by automatically searching out the detailed components of the network based on the thought of neural framework search by constructing a more reasonable multi-level aggregation basic framework. Meanwhile, the four criteria on the task of re-identifying the pedestrians are tested, the performance of the algorithm is excellent, and the judgment of the characteristics of each level and the effectiveness of the search space are verified through ablation research.

2) The algorithm provided by the invention for the pedestrian re-identification problem shows excellent performance in the aspect of extracting pedestrian features, has strong anti-interference performance on background noise, and can accurately determine a target area under the condition that other pedestrian body parts exist in an image. And the running speed is high, and the efficiency is high.

Drawings

FIG. 1 is a schematic diagram of a network framework for the method of the present invention;

FIG. 2 is a schematic diagram of the search space of the present invention, (a) representing a unit level search space, and (b) representing an operation level search space;

FIG. 3 is a diagram of a characteristic attention area of a picture of a pedestrian according to the present invention, (a) (b) shows a case where there is only one pedestrian in the picture, and (c) (d) shows a case where there are a plurality of pedestrian body parts in the picture;

fig. 4 is a result display diagram of the present invention on a large scale outdoor camera shooting image.

Detailed Description

The invention is based on the neural framework search thought, takes a pedestrian image as input, utilizes a multi-level aggregation framework to aggregate global features and multi-level local features, and simultaneously constructs a new unit level search space and an operation level search space. And finally, an optimal structure combination is automatically searched out through the search of a high-efficiency collaborative search strategy on the search space, and the attention of the model on the characteristics is focused on a more discriminant area, so that the retrieval accuracy is improved. The following further describes specific embodiments of the present invention with reference to the drawings and technical solutions.

The first step, constructing a multi-layer aggregation architecture:

we took the ResNet50 trained on ImageNet as a global branch to extract the multi-stage features because of the wide applicability and better performance of ResNet 50. Unlike other works, we change the amplitude of the last down-sampling at layer 4 of ResNet50 to 1, which can add more spatial information. The global branch contains 4 residual blocks, which are respectively denoted as Block1, Block2, Block3 and Block 4. Through the residual blocks, the features of different scales and semantic information can be obtained, and are respectively marked as f1, f2, f3 and f 4. Then, a search module is inserted at each stage of the hierarchical features to integrate and process the features. And prior knowledge is provided for feature integration at the next scale, wherein the search module is constructed by two units through automatic search. By connecting the characteristics of the upper layer and the current layer, an initial multi-layer aggregation branch can be established. The branch can gradually learn more discriminative semantic features and details, simulate the observation habit of human beings, namely, pay attention to pedestrians in the foreground in irrelevant background interference, and mine clues with finer granularity. And finally, performing end-to-end training and optimal frame search process by utilizing retrieval loss.

Secondly, designing a new search space:

the traditional neural framework search technology is designed for an initial search space aiming at a classification task, and for a pedestrian re-identification task, the fine-grained and multi-scale semantic information is ignored in the search space. To solve this problem, we design a new search space on both the macro and micro level. Macroscopically comprising two units, a distillation unit and a fine particle size unit.

The distillation unit consists of 4 nodes, and this structure can be described as:

the distillation characteristics and refining characteristics of the i-th layer are shown, respectively, and DE and RE represent distillation connecting sides and refining connecting sides, respectively. Finally, all distillation characteristics are connected as the final output of the distillation unit.

The fine-grained unit consists of three parts, namely feature segmentation, parallel flow and feature fusion. First, a feature map F is given as an input of a unit, and then is divided into N horizontal stripes, resulting in local features containing fine-grained structural clues. And connecting a plurality of nerve modules in series, processing local characteristics and further mining potential information. Finally, we fuse all the strips as the inverse of the segmentation operation to obtain the final output.

Alternative operations include some highly complex convolution operation modules: a 3 × 3 dilation convolution, a 3 × 3 depth separable convolution, a residual convolution module, a dense convolution module, a spatial attention module, and a channel attention module.

Thirdly, searching the search space by adopting an efficient collaborative search strategy:

first, some basic symbolic representations are defined: α ═ α_o,α_cDenotes the over-parameters of the cell level and the operation level, and ω is a weight parameter. D_trainAnd D_valTraining and validation sets, respectively.

The specific algorithm flow of the collaborative search strategy is as follows:

a loss function.

We combine classification loss and triple loss into search loss to guide the training and frame search process of the network, the training loss function can be expressed as,

p_iis a prediction of the ith pedestrian label, q_iIs the ith pedestrian truth label. d_posAnd d_posRespectively represent the distance of the positive and negative sample pairs, [ d ]]₊Represents max (0, x).

Claims

1. A multilevel aggregate pedestrian re-identification method based on neural architecture search is characterized by comprising the following steps:

firstly, constructing a multi-layer polymerization framework;

the framework of the multi-layer aggregation architecture is composed of two parts, namely a main network for extracting global information and a branch network for integrating a plurality of layer characteristics: the main network is ResNet50 and is responsible for extracting semantic information of four different layers; the branch network consists of three search branches, and each search branch consists of four search modules; the macrostructure of the search module is obtained by searching in a unit-level search space, and the microstructure is obtained by exploring an operation-level search space; because each layer of characteristics are different, the search modules searched by the search loss supervision are also different, so the detail composition of the three search branches is also different;

wherein α ═ { α ═ α_o,α_cDenotes the over-parameters of the cell level and the operation level, ω is a weight parameter,

and

representing training loss and validation loss, respectively;