CN112071075B

CN112071075B - Escaping vehicle weight identification method

Info

Publication number: CN112071075B
Application number: CN202010595381.8A
Authority: CN
Inventors: 孙伟; 代广昭; 戴亮; 张旭; 常鹏帅; 张国策; 陈旋
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2022-10-14
Anticipated expiration: 2040-06-28
Also published as: CN112071075A

Abstract

The invention discloses a method for identifying the weight of an escaping vehicle, which comprises the following steps: (1) Constructing a target camera topological network, and predicting a related camera track; (2) Based on the measurement learning of visual angle perception, learning two different depth measurements in an S-view same-visual angle sample and a D-view cross-visual angle sample; (3) Vehicle weight recognition under adaptive attention based on two paths of paths; and (3) the double paths comprise a global path and a local path, in the step (2), double-path vehicle re-identification is respectively carried out in the S-view same-view and D-view cross-view feature space, the global path extracts the global features of the picture, and the local path is used for global feature supplement. According to the method, a key monitoring area with optimal time sequence is obtained by constructing a topological network of the suspicious vehicle camera; different loss functions are applied by utilizing depth measurement learning, the self-adaptive attention model is added, the re-recognition task is carried out, the walking track of the vehicle is obtained, and the accuracy of re-recognition of the escaping vehicle is improved.

Description

Weight recognition method for escaping vehicles

Technical Field

The invention relates to a vehicle weight identification method, in particular to an escaping vehicle weight identification method.

Background

Along with the development of science and technology and the improvement of the living standard of people, the use frequency and the occupancy rate of automobiles are gradually increased, and meanwhile, the awareness of people on traffic safety and the processing scheme are correspondingly improved. Once a traffic accident occurs, how to contact artificial intelligence recognition to quickly and accurately process the standardization and the intellectualization of the traffic accident is very important.

In recent years, with the introduction of large data sets and the development of deep learning algorithms, as well as the widespread use of traffic cameras, vehicle re-identification based on deep learning has enjoyed significant success over the past decade. The vehicle re-identification technology has great application potential in the fields of urban safety monitoring and intelligent traffic monitoring, particularly in the task of accurately and quickly re-identifying vehicles causing traffic accidents and escaping.

In view of the non-obvious differences between different vehicles, vehicle weight identification remains a very challenging task, especially in the case of large data volumes. This work faces significant challenges, and first, appearance-based approaches tend to yield unsatisfactory results because differences between different vehicle classes taken from similar perspectives are small, while differences within the same vehicle class taken from different perspectives are large. Although depth metric learning has been successful in feature acquisition studies of viewing angle changes, it is still very challenging for extreme viewing angle changes (e.g., 180 °) of vehicles, where the impact of viewing angle changes on the accuracy of the re-identification task is significant. Second, some subtle cues such as tire models, window stickers, window borders and some custom decorations in the car are difficult to obtain in the global characterization of appearance inspection. Most extreme is that different vehicles may possess similar colors and shapes, especially vehicles from the same manufacturer in different years having similar appearances but with some locally small decorations different. Therefore, when the vehicle weight recognition model is used for making a decision, feature acquisition when the view angle changes and adaptive attention to the vehicle weight recognition key part are very important.

In the vehicle weight recognition, not all key points provide recognition information, and the direction of the vehicle in the query picture is a determining factor for selecting the key points, so that the recognition process of the vehicle weight recognition model needs to learn the ability of learning the perspective perception metric and focus attention on the differentiated parts.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a method for re-identifying escaping vehicles by combining vehicle re-identification with rapid positioning of hit-and-run vehicles in traffic accidents.

The technical scheme is as follows: the invention discloses a vehicle weight recognition method, which comprises the following steps: 1. constructing a target camera topological network, and predicting a key monitoring area of a vehicle escaping due to a hit-and-run accident; 2. based on the metric learning of visual angle perception, learning the depth metrics under two different visual angle constraints in an S-view same-visual angle sample and a D-view cross-visual angle sample respectively; 3. vehicle weight recognition under adaptive attention based on two paths of paths; the double paths in the step 3 comprise a global path and a local path, vehicle weight recognition of the global path and the local path is respectively carried out in the S-view same-view and D-view cross-view feature space in the step 2, the global path extracts picture global features, and the local path extracts local differentiation features for global feature supplement through self-adaptive attention.

In the step 1, the monitoring detection range of the target is narrowed through the time transition probability among the cameras, wherein the monitoring detection range is a key monitoring area, and the method specifically comprises the following steps:

step 1.1: establishing road section information of a vehicle monitoring scene to be inquired and a network topological structure of multiple cameras through a map and an actual camera view in data;

step 1.2: the suspicious vehicles in the monitoring circle are tracked through a monitoring system, and the key point is that after hit-and-run vehicles are observed from an initial position, the positions of cameras where the next or a plurality of hit-and-run vehicles appear need to be determined, and the cameras which possibly appear are associated;

step 1.3: and analyzing and sequencing the probability of the vehicles to be queried for hit-and-run to appear in the associated camera set, and finding a small number of cameras with the optimal time sequence relation as key monitoring areas.

And (4) after the step (1.3) is completed, executing the step 1 and the step 3, and updating the key monitoring circle after the vehicle is identified.

Step 2 is provided with a two-way network, and the input vehicle image is mapped to two characteristic space areas, and the method specifically comprises the following steps:

step 2.1: inputting a picture of a vehicle to be inquired, firstly predicting an absolute visual angle of each image by using a visual angle classifier, and dividing the visual angle into front, side or rear; if the image pair is from the same/similar view angle, classifying as S-view pair, otherwise, D-view pair;

step 2.2: sending the image classified into the S-view pair into an S-view characteristic space for S-view same-view constraint training, and sending the image classified into the D-view pair into a D-view characteristic space for D-view cross-view constraint training;

step 2.3: attention feature fusion is respectively carried out in the two feature spaces S-view and D-view, and a fusion attention model of the feature space S-view and a fusion attention model of the specially Huizoni space D-view are respectively obtained.

In the step 3, a double-path self-adaptive attention model is added in an S-view feature space and a D-view feature space respectively to identify the vehicle weight, a global appearance path captures global features of the vehicle appearance, a local appearance path with directional constraint learns to capture local differentiation features, and vehicles which are inconsistent with the appearance of the query vehicle are filtered; the method specifically comprises the following steps:

step 3.1: the main network uses ResNet-50 and ResNet-101 as baseline models at the same time, pre-trains in a VehclelID data set, and then extracts the global characteristics f of the vehicle _g ；

Step 3.2: using a two-stage model to estimate key points and orientations of the vehicle, and comprising the following two steps of:

step 3.2.1: the convolutional network based on VGG-19 is used to make a rough hot spot map estimate for 21 classes, 21 classes include 20 key points and 1 background, the convolutional network based on VCG-19 is trained using a pixel-by-pixel multi-class cross entropy loss function, the loss function is:

in the formula I _i,j Is a vector of corresponding pixel locations (i, j) on all output channels,

is a ground truth label of each pixel position, H and W respectively represent the height and width of the hot spot diagram, x _i,j (k) A predictor representing the corresponding pixel location (i, j) on all output channels;

step 3.2.2, down-sampling the input image through HRnet and refine the rough key point and direction in step 3.2.1;

step 3.3: self-adaptive key points are selected, local microscopic features are extracted, and the directions of the vehicles are divided into 8 types: front, rear, left, left front, left rear, right, right front and right rear, designing a keypoint selector, and adaptively selecting keypoints based on the prediction direction;

step 3.4: and (3) respectively adding the self-adaptive attention appearance detection models trained in the steps 3.1, 3.2 and 3.3 into the S-view and D-view feature space of the step 2 for joint optimization.

Has the advantages that: compared with the prior art, the invention has the following remarkable effects: 1. constructing a suspicious vehicle camera topological network, and obtaining a key monitoring area with optimal time sequence by calculating the occurrence frequency between the associated cameras; 2. different loss functions are respectively used for depth measurement learning at different visual angles, and a self-adaptive attention model is added in the process for accurate searching, so that not only can a re-identification task be performed, but also the traveling track of a vehicle can be obtained; 3. the network performance is good, the generalization capability is strong, and the accuracy of re-identification of escaping vehicles is effectively improved; 4. provided is an escaping vehicle weight recognition method.

Drawings

FIG. 1 is a schematic diagram of the identification process of the present invention;

FIG. 2 is a diagram of a partial camera node-arc model of the present invention;

FIG. 3 is a diagram of a model for dual-path appearance detection based on adaptive attention strategy according to the present invention;

FIG. 4 is a diagram of an adaptive attention model under the learning based on the perspective perception metric according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

1. Building network structures and models

The vehicle re-identification in the artificial intelligence field is combined with the rapid positioning of the vehicle causing the traffic accident in the traffic accident. And giving out a suspected vehicle picture by using a pre-trained network, and accurately identifying the image in the database of the suspected vehicle by using a vehicle re-identification algorithm.

The camera network is combined with a Geographic Information System (GIS), a camera network topological structure is constructed, a camera group with the optimal time sequence relation is obtained as a key monitoring area by calculating the transfer time frequency between cameras in the identification process, and the vision field of which cameras the vehicle appears in can be actively predicted.

The method comprises a self-adaptive attention strength learning model based on visual angle perception and two constraint modes: are co-view constraints (S-view) and cross-view constraints (D-view), respectively, and are fused with an adaptive attention model, trained with triplet loss and cross-entry loss.

During appearance detection, a double-path self-adaptive attention model is adopted. Firstly, extracting the macroscopic features of the vehicle through the global appearance path; and capturing local difference features through the local appearance path of the directional constraint, and finally performing joint optimization on the global features and the local features.

2. Implementation procedure

Step 1, constructing a target camera topological network

Step 1.1, a road network is used for expressing reachable space of a city, and then a road network is expressed by using a node arc section model to form a node-arc section model. The center line in the road section represents a road (namely an arc section), and the node represents an intersection of the road section. The mathematical model is represented as Q (N, a), where N = (N) ₁ ,n ₂ ,...,n _m ) Represents a set of nodes, and a set of arc segments is A = (a) ₁ ,a ₂ ,...,a _m ). In practice, due to complexity of camera distribution and uncertainty of view boundary, cameras overlap and abutOr not adjacent, the topological difficulty of the camera is increased. According to the invention, the vision fields of the camera are divided into 4 types according to the condition of the monitoring picture of the camera, namely an intersection area (A), a one-way lane (B), a single roadway (C) and a two-way roadway (D). One camera and its field of view are considered as a node, and each camera node is given 3 attributes, arc (representing the street where the camera is located), node (representing the nearest intersection to which the camera belongs), and offset (the distance between the camera node and the nearest intersection). According to the above definition, the google map is added with the actual road camera view to construct the camera topology structure of hit-and-run vehicles.

Step 1.2, defining a search surrounding ring of the camera as O:

O＝(F _Ψ ,G(g _s )) (1)

in formula (1), G (G) _s ) Is the set of view nodes associated with respect to the starting camera node.

In the formula (2), g _e Searching a relevant camera view node;

is the road termination point.

In formula (3), F _ψ To start the number of paths between the camera and all cameras,

is g _s And g _e Monitoring all path sets in the enclosure;

is represented by g _s To g _e In one of themThe strip path, namely the starting and ending positions of the path must be the positions of the camera nodes, and no other nodes except the starting and ending camera nodes can exist between the paths.

The definition shows that the searching is started from the initial camera where the hit-and-run vehicle is located, and the monitoring enclosure is formed along different searching paths until the node of the searching camera or the road termination point; if the hit-and-run vehicle reaches the end of the road along the path for installing the camera or stops in the monitoring enclosure, the monitoring enclosure is regarded as a run-away vehicle area.

Step 1.3, carrying out collaborative analysis between cameras, wherein the specific process can be divided into three steps:

step 1.3.1, determining the initial position of the vehicle where the hit-and-run vehicle is found, then sorting according to the shortest path between the start-stop camera and the associated camera, and obtaining the spatial relationship between the cameras according to the Google map.

Step 1.3.2, calculate probability function

In the formula (4), the reaction mixture is,

is a camera g _e Number of paths to the starting camera, F _ψ The number of paths between the starting camera and all cameras.

P(g _e ) It is indicated that the hit-and-run vehicle starts from the initial camera and appears in the next moment with other probability associated to the camera, i.e. the higher the probability the hit-and-run vehicle is more likely to appear in the camera field of view. And finding the first 6 camera nodes with the maximum probability, sequencing in sequence, and regarding the area where the camera nodes are located as a key monitoring area.

Step 1.3.3, calculating the time from the initial camera to the key monitoring area camera obtained in the step 1.3.2 according to the path length and the speed ratio of the vehicle, and sequencing the cameras from small to large in sequence, wherein a first sequencing result is assumed to be C ₂ >C ₅ >C ₈ >C ₉ >C ₁₀ >C ₁₅ …C _n . Setting from the initial camera to the camera C ₂ Need to pass through t ₀ Time, when t passes ₀ If the moving vehicle is not in the camera C ₂ If the vehicle appears, the vehicle is excluded from appearing C ₂ The possibility of (a); let time of all cameras subtract t simultaneously ₀ Then, reordering is carried out to obtain a second group of key monitoring cameras C ₅ >C ₈ >C ₉ >C ₁₀ >C ₁₅ >C ₂₀ …C _n And continuing to observe. If the hit-and-run vehicle appears in the visual field of the camera, the tracking is successful, and the camera is replaced by the initial camera, the step 1.3.1 is returned, and the process is circulated; if no hit-and-run vehicle appears in all camera areas, it is indicated that the hit-and-run vehicle has disappeared within the search range. And finally, after the escape vehicle camera topological network is constructed, the

steps

2 and 3 are carried out, and a vehicle re-identification related model is trained.

Step 2, learning based on visual angle perception measurement

The commonly used triplet loss is adopted to establish baseline for metric learning, cross-entry loss is added in vehicle classification, and the loss function is jointly optimized by using the triplet loss and the cross-entry loss. In different feature spaces, namely same-view S-view and cross-view D-view, two different depth measures are respectively learned for the input image. Denote the dataset by X, I = (X) _i ,x _j ) Representing a pair of images, in which x _i ,x _j Belongs to X; the function f represents the mapping of the original image to the feature space, respectively f _s ,f _d (ii) a D is the Euclidean distance between features, D (I) = D (x) _i ,x _j )＝||f(x _i )-f(x _j )|| ² Sample distances D corresponding to the respective feature spaces S-view and D-view _s (I)、D _d (I) Respectively:

D _s (I)＝||f _s (x _i )-f _s (x _j )|| ² (5)

D _d (I)＝||f _d (x _i )-f _s (x _j )|| ² (6)

given three samples x, x ⁺ ,x ^- Constructing a triple, wherein x, x ⁺ Samples represented as the same category (same vehicle),

for an S-view positive sample pair,

is a D-view positive sample pair, x ^- For samples of different classes

For the pair of S-view negative samples,

for D-view sample pairs, positive sample pairs I ⁺ ＝(x,x ⁺ ) Negative sample I ^- ＝(x,x ^- ) Defining a triple loss formula:

in the formula (7), α represents two Euclidean distances D (I) ⁺ ) And D (I) ^- ) A minimum spacing therebetween.

Step 2.1, constructing a triple in the Veni-776 data set, wherein to identify the view angle relationship between each image pair (i.e. the triple), the view angle of the vehicle should be calculated first, and the view angle classifier is used to roughly divide the view angle of the vehicle in all the images into 3 view angles: front, side, rear. The perspective classifier adopts RegNet as a basic network, and a perspective classification loss function is trained by a cross entropy loss function (at the moment, only the visual angle is roughly classified, and the vision classification is performed in a vehicle re-identification stage based on self-adaptive attention). An image pair is classified as an S-view pair if it is from the same or similar view angle, otherwise it is a D-view pair.

Step 2.2, the already obtained in step 2.1The view-aware N images map a series of stacked convolutional layers, then add two convolutional branches (different feature spaces) to convert all pictures to 2N, and then calculate the respective sample distances D in the corresponding feature spaces _s (I)＝||f _s (x _i )-f _s (x _j )|| ² ,D _d (I)＝||f _d (x _i )-f _d (x _j )|| ² For view-aware metric learning, two constraints are employed: co-view constraints and cross-view constraints.

And (5) constraint with the same view angle: ensuring D (P) in both feature spaces ⁺ )<D(P ^- ) Always hold (sample pair at same view angle) triple loss function L in S-view, D-view feature space _s 、L _d Respectively as follows:

and (4) cross-view constraint: when the image pairs are from different visual angles, D (P) is still remained when the sample pairs are respectively in different feature spaces ⁺ )<D(P ^- ) Always true, corresponding triple loss function L _cross Comprises the following steps:

step 2.3, combining the triplet loss function of step 2.1 and step 2.2

L _trplet ＝L _s +L _d +L _cross (11)

In the data set X, N vehicles of the category (ID) are defined, for a given input picture X, corresponding to a vehicle, the label is y, and cross-entry loss is used for punishing wrong vehicle identification prediction, so that the accuracy of the vehicle identification prediction is improved. The corresponding loss function is:

in the formula (12), p _i To input the group route tag corresponding to the sample picture x,

is a predicted value;

the loss function is jointly optimized by using the tree loss and cross-entry loss together:

L _view ＝ωL _softmax +(1-ω)L _trplet (13)

in the formula (13), L _view ω is L based on the loss function at the viewing angle _view ω =0.25.

Step 3, recognizing the weight of the vehicle based on self-adaptive attention

And (3) adding self-adaptive attention models into two different S-view and D-view feature spaces obtained in the step (2) respectively to identify the vehicle weight.

Step 3.1, global feature extraction (f) _g ) And taking ResNet-50 and ResNet-101 as main networks and also as baseline models, pre-training the main networks in a VehicleiD data set, wherein the data set for vehicle re-identification is VeRi-776. The 2048-dimensional feature vectors from the last convolutional layer are input into a shallow, multi-layered perceptron trained using L2 softmax loss.

Step 3.2, in the step, key points and orientation estimation in the attention strategy required by step 3.1 are obtained, and the specific steps can be divided into two steps:

step 3.2.1, using a full convolution network based on VGG-19 to perform a rough H × W (64 × 64) hot spot map estimation on the picture, the result is 21 types (N) ₁ =21, which includes 20 keypoints and 1 background), the network is trained using a pixel-by-pixel multi-class cross-entropy loss function, the loss function being:

in the formula (14), l _i,j Is a vector of corresponding pixel locations (i, j) on all output channels,

is a ground route label of each pixel position, H and W respectively represent the height and width of the hot spot diagram, x _i,j (k) Representing the predicted value of the corresponding pixel location (i, j) on all output channels.

Step 3.2.2, the HRNet network is fine-tuned, and compared with stacked hourglass structures and other methods for recovering high-resolution representations from low-resolution representations, the HRNet retains the high-resolution representations and gradually increases subnets of high to low resolution through multi-scale fusion. Therefore, the predicted key points and the hotspot graph are more accurate and more accurate in space, and meanwhile, the perspective predicted by the perspective classifier in the perspective perception metric learning is refined.

The input image is down sampled by HRNet and the coarse keypoint and direction estimates are redefined in step 3.2.1. The refine rear vehicle perspective results can be refined into 8 categories: rear, left, left front, left rear, right, right front and right rear. To train the redefinement of the hotspot graph and the estimation of the directional branches in step 3.2.2, the mean square error and cross entropy loss functions were used respectively but excluding the training of the background picture.

Representing the loss function in step 3.2.2:

and then

Regression loss representing a hotspot graph:

represents the directional classification loss function:

in formula (15); mu is a hyper-parameter balancing two losses, and the value is 11;

in the formula (16), H and W represent the height and width of the hotspot chart, respectively, and N ₂ ＝N ₁ -1＝20，h _k (i, j) and

the predicted hotspot graph and the real hotspot graph of the k key points (i, j) in the step 3.2.2 are respectively;

in the formula (17), p ^* And N _p Respectively representing the predicted orientation vector, the corresponding true vector and the number of direction classes, p (p) ^* ) Representing the probability that the predicted azimuth is the true azimuth, and p (i) is the probability of one direction in the predicted azimuth vector.

And 3.3, dividing the vehicle into 8 types according to the statistical view angle and direction: real, left, left front, right front and right front, however, there is no clear boundary between the two orientations but there is a difference in the key points taken. In order to overcome the problem, a key point selector is designed, and key points can be selected adaptively based on the predicted azimuth possibility, which is implemented as follows: for each set of squares, the top 8 keypoints are counted. Given the set of highest likelihood orientation (orientations calculated by step 3.2), each orientation set picks up 8 keypoints by the keypoint selector. Then, inputting the heat point map of the 8 key points, and then, using the deeper network blocks (Res 3, res4 and Res 5) in another ResNet for extracting the local feature f _l . Finally, the local feature f _l And global feature f _g Connecting and performing combined optimization through L2 softmax loss multi-layer perceptron, and optimizing function

Comprises the following steps:

step 3.4, extracting global feature f _g Then extracting the local feature f of the orientation constraint _l (by adaptive attention strategies); the 2048-dimensional feature vectors from the last convolutional layer are input into a shallow multi-layered perceptron trained using L2 softmax loss.

Step 3.5, respectively adding the self-adaptive attention appearance detection models trained in the steps 3.1, 3.2 and 3.3 into the S-view and D-view feature spaces of the step 2 (view perception metric learning), and finally performing joint optimization, wherein an optimization function L is as follows:

and finally, predicting the most probable position of the target vehicle through the multi-camera topological structure in the step 1, and acquiring a key monitoring area. And (4) applying the vehicle re-identification model trained in the step (2) and the step (3) to a suspicious vehicle database to identify the vehicle again, and returning to the step (1) to update the camera topological network to obtain the key monitoring area for continuously tracking the vehicle after the hit vehicle and the position (camera position) are found.

Claims

1. A method for re-identifying an escaping vehicle is characterized by comprising the following steps: (1) Constructing a target camera topological network, and predicting key monitoring areas of vehicles escaping due to accidents; (2) Based on the metric learning of visual angle perception, learning the depth metrics under two different visual angle constraints in an S-view same-visual angle sample and a D-view cross-visual angle sample respectively; (3) Vehicle weight recognition under self-adaptive attention based on double paths; the double paths in the step (3) comprise a global path and a local path, vehicle weight recognition of the global path and the local path is respectively carried out in a cross-view feature space based on the S-view same-view and D-view in the step (2), the global path extracts picture global features, and the local path extracts local differential features for global feature supplement through self-adaptive attention;

in the step (1), the monitoring detection range of the target is narrowed through the time transition probability between the cameras, the monitoring detection range is a key monitoring area, and the method specifically comprises the following steps:

step 1.2: the suspicious vehicles in the monitoring circle are tracked through a monitoring system, and the key point is that after hit-and-run vehicles are observed from the initial position, the positions of cameras where the next hit-and-run vehicle or vehicles appear need to be determined, and the cameras which possibly appear are associated;

step 1.3: and analyzing and sequencing the probability of the hit-and-run vehicle to be inquired appearing in the associated camera set, and finding a small number of cameras with the optimal time sequence relation as key monitoring areas.

2. The escaping vehicle weight identification method according to claim 1, characterized in that after the step (1.3) is completed, the steps (2) and (3) are executed, and the key monitoring circle is updated after the vehicle is identified.

3. The escaping vehicle weight recognition method as claimed in claim 1, wherein a two-way network is provided in step (2) to map the input vehicle image to two feature space regions, specifically comprising the steps of:

step 2.3: and respectively carrying out attention feature fusion in the two feature spaces S-view and D-view to respectively obtain a fusion attention model of the feature space S-view and a fusion attention model of the feature space D-view.

4. The escape vehicle weight recognition method according to claim 1, wherein in the step (3), a double-path adaptive attention model is added to the feature spaces of S-view and D-view respectively for vehicle weight recognition, global appearance paths capture global features of vehicle appearances, directionally-constrained local appearance path learning captures local differentiated features, and vehicles which do not accord with the appearance of the query vehicle are filtered out; the method specifically comprises the following steps:

step 3.1: the main network uses ResNet-50 and ResNet-101 as well as Baseline models, is pre-trained in a VehcleID data set and then is used for extracting the global feature f of the vehicle _g ；

Step 3.2: the method comprises the following steps of (1) estimating key points and orientations of a vehicle by using a two-stage model:

step 3.2.1: the VGG-19 based convolutional network is used to make a rough hotspot graph estimation for 21 classes, the 21 classes include 20 key points and 1 background, the VGG-19 based convolutional network is trained using a pixel-by-pixel multi-class cross-entropy loss function, the loss function is:

is per pixel bitPut ground truth label, H, W represent height and width, x of the heat point diagram respectively _i,j (k) A predictor representing the corresponding pixel location (i, j) on all output channels;

step 3.2.2: down-sampling the input image by HRNet and refine the coarse keypoints and directions in step 3.2.1;

step 3.3: self-adaptive key points are selected, local microscopic features are extracted, and the directions of the vehicles are divided into 8 types: front, rear, left, left front, left rear, right, right front and right rear, designing a key point selector, and adaptively selecting key points based on the prediction direction;

step 3.4: and (3) respectively adding the self-adaptive attention appearance detection models trained in the steps 3.1, 3.2 and 3.3 into the S-view and D-view feature space of the step (2) for joint optimization.