CN114419588A

CN114419588A - Vehicle detection method and device, edge device and storage medium

Info

Publication number: CN114419588A
Application number: CN202210049635.5A
Authority: CN
Inventors: 王振亚; 艾国; 杨作兴; 房汝明; 向志宏
Original assignee: Hangzhou Yanji Microelectronics Co ltd
Current assignee: Hangzhou Yanji Microelectronics Co ltd
Priority date: 2022-01-17
Filing date: 2022-01-17
Publication date: 2022-04-29

Abstract

The embodiment of the invention provides a vehicle detection method, a vehicle detection device, edge equipment and a storage medium. The method is performed by an edge device, comprising: extracting a first image feature map based on an original image containing a vehicle; generating a second image feature map based on the first image feature map, wherein the receptive field of the second image feature map is larger than that of the first image feature map; performing feature fusion processing on the first image feature map and the second image feature map to generate a third image feature map after feature fusion processing; and detecting the vehicle based on the third image feature map. The embodiment of the invention integrates network characteristics of different layers, optimizes the network structure to be beneficial to the edge side, obviously reduces the parameter quantity and the running speed, and is suitable for edge equipment with limited computational resources.

Description

Vehicle detection method and device, edge device and storage medium

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a vehicle detection method, a vehicle detection device, edge equipment and a storage medium.

Background

Artificial intelligence techniques have been closely related to people's daily life. In traffic monitoring systems, vehicle detection and vehicle weight identification have become very important basic tasks. The vehicle detection can detect the vehicle from the image, and the vehicle weight identification can play a role in positioning and supervising the target vehicle. The vehicle weight identification has wide application, such as target vehicle search, cross-border vehicle tracking, traffic behavior analysis, automatic toll collection system, parking lot entrance, vehicle counting and the like. With the rise of deep neural networks and the introduction of large data sets, improving the accuracy of vehicle re-identification becomes a research hotspot in the fields of computer vision and multimedia in recent years.

In current practical applications, a network structure for vehicle detection and related applications (such as vehicle re-identification) has the characteristics of large parameter and slow operation speed, and is not suitable for edge devices (such as cameras) with limited computational resources.

Disclosure of Invention

The embodiment of the invention provides a vehicle detection method, a vehicle detection device, edge equipment and a storage medium.

The technical scheme of the embodiment of the invention is as follows:

a vehicle detection method, the method performed by an edge device, comprising:

extracting a first image feature map based on an original image containing a vehicle;

generating a second image feature map based on the first image feature map, wherein the receptive field of the second image feature map is larger than that of the first image feature map;

performing feature fusion processing on the first image feature map and the second image feature map to generate a third image feature map after feature fusion processing;

and detecting the vehicle based on the third image feature map.

In an exemplary embodiment, the generating a second image feature map based on the first image feature map comprises:

inputting the first image feature map into a spatial pyramid pooling network to generate the second image feature map, the pyramid pooling network comprising a plurality of convolution kernels each adapted to transform the first image feature map into a respective image feature map having a respective receptive field, and a stitching section adapted to stitch the respective image feature map and the first image feature map provided by each of the plurality of convolution kernels into the second image feature map.

In an exemplary embodiment, the performing the feature fusion process on the first image feature map and the second image feature map comprises:

inputting the first image feature map and the second image feature map into a feature pyramid network to perform a feature fusion operation, the feature pyramid network comprising a downsampling section and an upsampling section, wherein there is a horizontal connection between the downsampling section and the upsampling section, the downsampling section being adapted to perform a downsampling process on the first image feature map and the second image feature map to generate a feature layer comprising high-level features, and the upsampling section being adapted to perform an upsampling process on the feature layer.

In an exemplary embodiment, the detecting the vehicle based on the third image feature map includes: outputting the position of the center point of the vehicle based on the third image feature map.

In an exemplary embodiment, the method further comprises at least one of:

calibrating the position of the center point;

predicting the width and height of the detection frame taking the calibrated central point as the center;

determining a vehicle characteristic adapted to re-identify the vehicle.

In one exemplary embodiment, the detecting the vehicle based on the third image feature map is: outputting, in a first portion of a network output layer, a location of a center point of the vehicle based on the third image feature map;

the method further comprises the following steps:

calibrating the location of the center point in a second portion of the network output layer;

predicting, in a third portion of the network output layer, a width and a height of a detection box centered on the calibrated center point;

in a fourth portion of the network output layer, determining a vehicle characteristic adapted to re-identify the vehicle;

based on a loss function L_totalTraining an overall network comprising the pyramid pooling network, the feature pyramid network and the network output layer, wherein

Wherein L is_heatIs a center point loss function; l is_offCalibrating a loss function for the center point; l is_sizeA loss function for the width and height of the detection box; l is_thA loss function for determining a characteristic of the vehicle; e is the base of the natural logarithm; w1 is a first predetermined weight, and w2 is a second predetermined weight.

A vehicle detection apparatus, the apparatus being incorporated in an edge device, comprising:

an extraction module configured to extract a first image feature map based on an original image containing a vehicle;

a generating module configured to generate a second image feature map based on the first image feature map, wherein a receptive field of the second image feature map is larger than a receptive field of the first image feature map;

a fusion module configured to perform feature fusion processing on the first image feature map and the second image feature map to generate a third image feature map after feature fusion processing;

a detection module configured to detect the vehicle based on the third image feature map.

In an exemplary embodiment, the generating module is configured to input the first image feature map into a spatial pyramid pooling network to generate the second image feature map, the pyramid pooling network comprises a plurality of convolution kernels and a stitching section, wherein each convolution kernel is adapted to convert the first image feature map into a respective image feature map having a respective receptive field, and the stitching section is adapted to stitch the respective image feature map and the first image feature map provided by each of the plurality of convolution kernels into the second image feature map.

In an exemplary embodiment, the fusion module is configured to input the first image feature map and the second image feature map into a feature pyramid network to perform a feature fusion operation, the feature pyramid network comprising a down-sampling part and an up-sampling part, wherein the down-sampling part and the up-sampling part have a horizontal connection therebetween, the down-sampling part is adapted to perform a down-sampling process on the first image feature map and the second image feature map to generate a feature layer comprising a higher-level feature, and the up-sampling part is adapted to perform an up-sampling process on the feature layer.

In an exemplary embodiment, the detection module is configured to output a position of a center point of the vehicle based on the third image feature map.

In an exemplary embodiment, the detection module is further configured to perform at least one of:

calibrating the position of the center point;

determining a vehicle characteristic adapted to re-identify the vehicle.

In an exemplary embodiment, the detection module is configured to output, in a first portion of a network output layer, a location of a center point of the vehicle based on the third image feature map; calibrating the location of the center point in a second portion of the network output layer; predicting, in a third portion of the network output layer, a width and a height of a detection box centered on the calibrated center point; in a fourth portion of the network output layer, determining a vehicle characteristic adapted to re-identify the vehicle; the device also includes:

a training module configured to be based on a loss function L_totalTraining an overall network comprising the pyramid pooling network, the feature pyramid network and the network output layer, wherein

An edge device, comprising:

a memory;

a processor;

wherein the memory has stored therein an application executable by the processor for causing the processor to execute the vehicle detection method as defined in any one of the above.

A computer readable storage medium having computer readable instructions stored therein for performing a vehicle detection method as described in any one of the above.

As can be seen from the above technical solutions, in the embodiment of the present invention, a first image feature map is extracted based on an original image including a vehicle; generating a second image feature map based on the first image feature map, wherein the receptive field of the second image feature map is larger than that of the first image feature map; performing feature fusion processing on the first image feature map and the second image feature map to generate a third image feature map after the feature fusion processing; and detecting the vehicle based on the third image feature map. Therefore, the embodiment of the invention integrates network characteristics of different layers, optimizes the network structure to be beneficial to the edge side, obviously reduces the parameter quantity and the running speed, and is suitable for edge equipment with limited computational resources.

Drawings

Fig. 1 is an exemplary flowchart of a vehicle detection method according to an embodiment of the present invention.

Fig. 2 is a processing diagram of a backbone network according to an embodiment of the present invention.

Fig. 3 is a processing diagram of a Spatial Pyramid Pooling (SPP) structure according to an embodiment of the present invention.

Fig. 4 is a processing diagram of a Feature Pyramid Network (FPN) structure according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a vehicle detection and re-identification process according to an embodiment of the invention.

Fig. 6 is a structural diagram of a vehicle detection device according to an embodiment of the present invention.

FIG. 7 is an exemplary block diagram of a vehicle detection device having a memory-processor architecture in accordance with the present invention.

Fig. 8 is an exemplary configuration diagram of the vehicle detecting device of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings.

For simplicity and clarity of description, the invention will be described below by describing several representative embodiments. Numerous details of the embodiments are set forth to provide an understanding of the principles of the invention. It will be apparent, however, that the invention may be practiced without these specific details. Some embodiments are not described in detail, but rather are merely provided as frameworks, in order to avoid unnecessarily obscuring aspects of the invention. Hereinafter, "including" means "including but not limited to", "according to … …" means "at least according to … …, but not limited to … … only". In view of the language convention of chinese, the following description, when it does not specifically state the number of a component, means that the component may be one or more, or may be understood as at least one.

Hereinafter, terms related to the embodiments of the present disclosure are explained.

Vehicle detection: the detection of the position of the vehicle in the image is a key step in vehicle analysis, and is the basis for subsequent vehicle type recognition, vehicle logo recognition, license plate recognition, vehicle characteristic recognition and the like.

Vehicle weight identification (reid): it is determined which vehicles are the same vehicle in a stack of vehicles.

Anchor block (Anchor base) detection: in anchor-box detection, the object detection problem is typically modeled as a problem that classifies and regresses candidate regions. In the single-stage detector, the candidate region is an anchor frame generated by a sliding window method; in the two-stage detector, the candidate region is a proposal (proposal) for RPN generation, but RPN still classifies and regresses the anchor frame generated by the sliding window approach.

Detection without Anchor frame (Anchor free): in the detection without an anchor frame, the detection frame is mainly represented by the following two types: (1) and a detection algorithm based on the key points: detecting upper left corners and lower right corners of the target, and combining the corners to form a detection frame; (2) and a detection algorithm based on the center: and directly detecting the central region and the boundary information of the object, and decoupling the classification and regression into two sub-grids.

Backbone (backbone) network: the network used to perform feature extraction is generally used for front-end extraction of picture information to generate a feature map (feature map) for use by a subsequent network. For example, it may be implemented as VGGNet or Resnet, etc.

Head (head) network: a network of network output content is obtained, and predictions are made using previously extracted features.

Neck (tack) network: is arranged between the backbone network and the head network in order to make better use of the features extracted by the backbone network.

Receptive Field (Receptive Field): in the convolutional neural network, the definition of the receptive field is the area size of the mapping of the pixel points on the characteristic diagram output by each layer of the convolutional neural network on the input image. In general, a point on the feature map corresponds to a region on the input map.

In consideration of the defects that the current edge equipment has limited computing capacity and cannot effectively execute vehicle detection and related applications (such as a network which cannot start vehicle detection and re-identification at the same time), the embodiment of the invention carries out re-optimization design on the network structure, selects the network structure which is more favorable for the edge side, integrates the detection and re-identification, enables the information of the detection and re-identification to be obtained by one-time reasoning, and obviously reduces the calculation amount.

The embodiments of the present invention are preferably modified using centret as a reference (baseline). The CenterNet has simple structure and strong adaptability to different hardware devices. The anchor box detection networks commonly employed in the prior art are not suitable for extracting features. In an anchor-box detection network, objects may be taken care of and detected by multiple anchor boxes, resulting in severe network ambiguity. When detecting dense objects, the object frame between the objects has high coincidence, and the information in the object frame at this time basically contains a plurality of surrounding objects, which is not beneficial to feature extraction, and the mode based on the central point can reduce the situation. In addition, for the edge device, the CPU side has a large calculation pressure, and different chips have different processing modes for different structures (for example, SSD and YOLOV3 commonly used in the industry), so that it is difficult to unify the effects, and the centret is used as a detection network without an anchor frame at one stage and has no Non Maximum Suppression (NMS) post-processing, so that the pressure on the CPU side can be reduced, and the network has no special structure, so that the deployment is more friendly.

Fig. 1 is an exemplary flowchart of a vehicle detection method according to an embodiment of the present invention. The method is performed by an edge device.

As shown in fig. 1, the method includes:

step 101: a first image feature map is extracted based on an original image containing a vehicle.

Here, the original image may be captured by a camera included in the edge device, or may be captured from another image source (e.g., a cloud).

Here, the first image feature map may be extracted from the original image using a backbone network. Among other things, the backbone network can be implemented as a border-friendly mobilene network (e.g., mobilene v 2). The Mobilenet adopts a depth separable convolution and anti-residual structure, and obviously reduces the parameter quantity under the condition of ensuring the precision. Optionally, the backbone network may also be implemented as AlexNet, VGG, ResNet, or the like.

Fig. 2 is a processing diagram of a backbone network according to an embodiment of the present invention. In fig. 2, the backbone network is implemented as a mobilene v2 network.

In fig. 2: the left subgraph is the process diagram of the backbone network when stride (stride) is equal to 1, and the right subgraph is the process diagram of the backbone network when stride is equal to 2. Wherein: when stride equals 1, sum (sum) element by element to connect input and output features; when stride equals 2, there is no quick connection (short cut) connecting the input and output features. The convolution is usually followed by a modified Linear Unit (ReLU) to perform nonlinear activation. As shown in fig. 2, ReLU6 is used in MobileNet. In ReLU6, the maximum output is limited to 6, which is to achieve good numerical resolution even in low-precision mobile-end devices.

Step 102: and generating a second image feature map based on the first image feature map, wherein the receptive field of the second image feature map is larger than that of the first image feature map.

In a typical CNN structure, a fully connected layer is usually connected after the convolutional layer. The feature number of the fully-connected layer is fixed, so that the size of the input is fixed at the time of network input. However, in reality, the size of the input image is always not sufficient, and the typical processing means includes cropping (crop) and stretching (warp), but the aspect ratio (ratio aspect) of the image and the size of the input image are changed, thereby distorting the original image. Spatial Pyramid Pooling (SPP) can solve the image warping problem well.

In an exemplary embodiment, the step 102 of generating a second image feature map based on the first image feature map comprises: the first image feature map is input into an SPP network to generate a second image feature map, the SPP network comprising a plurality of convolution kernels each adapted to transform the first image feature map into a respective image feature map having a respective receptive field, and a stitching section adapted to stitch the respective image feature map and the first image feature map provided by each of the plurality of convolution kernels into the second image feature map.

FIG. 3 is a schematic processing diagram of an SPP structure according to an embodiment of the present invention.

As shown in fig. 3, the input image (corresponding to the first image feature map) of 16 (long) × 16 (wide) × 256 (dimension) is subjected to 3 × 3 convolution kernel, 5 × 5 convolution kernel and 7 × 7 convolution kernel, respectively, to obtain 3 × 256 feature map, 5 × 256 feature map and 7 × 7 feature map, respectively. Then, the 3 × 256 feature map, 5 × 256 feature map, 7 × 256 feature map, and the original input 16 × 256 feature map are combined into 16 × 1024 feature map (corresponding to the second image feature map).

The above exemplary description describes a typical example of generating the second image feature map based on the first image feature map, and those skilled in the art will appreciate that this description is only exemplary and is not intended to limit the scope of the embodiments of the present invention.

Step 103: and performing feature fusion processing on the first image feature map and the second image feature map to generate a third image feature map after feature fusion processing.

The Feature Pyramid Network (FPN) incorporates a Feature Pyramid in target detection, which improves the accuracy of target detection, and is particularly embodied in the detection of small objects. FPN naturally utilizes a pyramid form of Convolutional Neural Network (CNN) level features, while generating a feature pyramid with strong semantic information at all scales. The FPN structure comprises a top-down (top-down) structure and transverse connection, so that shallow features with high resolution and deep features with rich semantic information are fused, a single input image with single scale is realized, a feature pyramid with strong semantic information on all scales is quickly built, and obvious cost is not generated.

In an exemplary embodiment, the step 103 of performing the feature fusion process on the first image feature map and the second image feature map includes: inputting the first image feature map and the second image feature map into an FPN network to perform a feature fusion operation, the FPN network comprising a down-sampling part and an up-sampling part, wherein the down-sampling part and the up-sampling part have a horizontal connection therebetween, the down-sampling part is adapted to perform a down-sampling process on the first image feature map and the second image feature map to generate a feature layer comprising a high-level feature, and the up-sampling part is adapted to perform an up-sampling process on the feature layer.

FIG. 4 is a process diagram of an FPN structure according to an embodiment of the invention. In fig. 4, there is a transversal connection between the down-sampling part 41 and the up-sampling part 42. The downsampling section 41 may perform downsampling processing on the input feature map to generate a feature layer containing high-layer features; the upsampling section 42 may perform upsampling processing on a feature layer containing a higher layer feature.

In the FPN structure, the down-sampling part 41 has the characteristics from bottom to top, i.e. the forward propagation process of network training, the feature map size is smaller and smaller, and the last layer output of each stage is selected as the classification and regression reference feature map. The upsampling portion 42 has a top-down characteristic: the high-level feature map is upsampled (e.g., nearest neighbor upsampling) and then the feature is laterally connected to the previous-level feature, so that the high-level feature is enhanced and the underlying positioning detail information can be utilized.

The upsampling section 42 and the downsampling section 41 have a horizontal connection. The feature maps of the previous layer are connected by addition between pixels through convolution processing of a convolution kernel. This process is iterated until a refined feature map is generated. After the fine feature map is obtained, the fused feature map is deconvoluted by a convolution kernel, and the aliasing effect of the up-sampling is eliminated, so that the finally required feature map is generated.

Step 104: and detecting the vehicle based on the third image feature map.

In one exemplary embodiment, the detecting the vehicle based on the third image feature map in step 104 includes: and outputting the position of the central point of the vehicle based on the third image feature map.

In an exemplary embodiment, the method further comprises at least one of:

(1) calibrating the position of the central point;

(2) predicting the width and height of a detection frame taking the calibrated central point as the center;

(3) determining a vehicle characteristic adapted to re-identify the vehicle.

In one exemplary embodiment, the detecting the vehicle based on the third image feature map in step 104 includes: outputting, in a first portion of a network output layer, a location of a center point of the vehicle based on a third image feature map; the method further comprises the following steps: calibrating the position of the center point in a second part of the network output layer; predicting, in a third portion of the network output layer, a width and a height of a detection box centered on the calibrated center point; in a fourth portion of the network output layer, determining vehicle characteristics adapted to heavily identify the vehicle; based on a loss function L_totalTraining an overall network comprising a pyramid pooling network, a feature pyramid network and a network output layer, wherein

Firstly, obtaining a vehicle image; then inputting the image into a backbone network to extract image characteristics; and then, combining the semantic features of the upper layer and the lower layer through the SPP and FPN structures, and respectively branching the processed features through different outputs to obtain the position information, the feature information and the like of the vehicle. The backbone network can be implemented as a more edge-friendly mobilene v2, with a deep separable convolution and anti-residual structure to significantly reduce the number of parameters while ensuring accuracy.

Furthermore, the improvement of the embodiment of the invention for the neck network comprises the following steps: the prior art network does not have a neck network part, and the invention mainly performs the fusion of the enhanced feature part in the neck network part. The multi-layer feature fusion is necessary in consideration that the feature re-recognition network portion cannot only include semantic information in a higher layer network, but also moderately includes color and texture information in a lower layer network. The embodiment of the invention firstly adopts SPP to obtain the feature maps with different receptive fields and the same size through convolution kernels with different sizes, then the feature maps are spliced, and then an FPN module is connected, wherein the FPN module is used for naturally utilizing the pyramid form of CNN level features and simultaneously generating the feature pyramid with strong semantic information on all scales. The FPN structure realizes that a characteristic pyramid with strong semantic information on all scales is quickly constructed from a single input image of a single scale, and meanwhile, obvious cost is not generated. In the FPN structure, top-layer features are fused with lower-layer features through upsampling, and finally prediction is carried out on the output bottom layer.

The improvement of the embodiment of the invention aiming at the head network comprises the following steps: the output may contain four parts, (1), a position of center point (HeatMap) part, of size (W/4, H/4,1), where W and H are the width and height of the network input, respectively, outputting the position of the center point of the class object; (2) an Offset (W/4, H/4,2) part for refining the output of the HeatMap to improve the positioning accuracy; (3) a Height & Width (W/4, H/4,2) part of the detection frame, predicting the Width and Height of the detection frame with the center point after calibration as the center; (4) and a vehicle characteristic (Re-ID Embedding) part, the shape is (H, W, 128), namely each vehicle is respectively characterized by a 128-dimensional vector.

With respect to the Loss function (Loss) part, the combination of multiple tasks with multi-task uncertainty Loss can be improved.

(1) Class loss function of center point L_heatAn extended (Focal length) function of cross entropy Loss is used, where α and β are the hyper-parameters of the Focal length, and N is the number of keypoints in the image, to normalize all positive samples Focal length to 1.

For example, in practical use, α and β are taken to be 2 and 4, respectively. The Focal length is mainly used for solving the problem of serious imbalance of the proportion of positive and negative samples in one-stage target detection. The loss function reduces the weight of a large number of simple negative samples in training, which can also be understood as a difficult sample mining. Wherein N represents the number of pixel points in the output heatmap feature map, Y_xycIn the actual category of the device to be tested,

is the probability value of the prediction category.

(2) Bias loss L with respect to the center_off: since the image is down-sampled four times, such feature maps will cause accuracy errors when being re-mapped onto the original image, and therefore, an offset is additionally used to compensate each center point. In all classesThe centroids share the same offset, and this offset value (offset) is trained using the L1 penalty function.

Wherein R ═ 4;

is the predicted bias; n represents the number of pixel points in the output bias characteristic diagram, P represents the central point of the target frame,

represents

An integer value of (1), then

It represents the offset value due to the downsampling rounding.

(3) Loss L with respect to target size_size: to reduce the difficulty of regression, the L1 loss function was used as the predictor.

With the preceding L_offThe same loss is obtained, N represents the number of pixel points in the output width and height characteristic diagram, and s_kIs the size of the actual object and,

is the size of the predicted target.

(4) Reid loss L for vehicle_th: the hard sample triplet loss (subsequently represented by the TriHard loss function) is an improved version of the triplet loss. The traditional triple random sampling method for three pictures from training data is simple, but most samples are simple and easily distinguished sample pairs. If a large number of training sample pairs are simple sample pairs, the network learning is not facilitated to obtain better characteristics. It has been found that training the network with more difficult samples can improve the generalization ability of the network, while there are many ways to sample pairs of difficult samples. The core idea of TriHard loss is: for each training batch, P pedestrians with IDs are randomly selected, and each pedestrian randomly selects K different pictures, namely one batch contains P multiplied by K pictures. Then, for each picture a in the batch, a triplet can be formed by selecting a most difficult positive sample and a most difficult negative sample and a.

Where α is an artificially set threshold parameter. The TriHard loss calculates the Euclidean distance d of each picture in a and batch in the feature space, and then selects the positive sample p which is farthest (not like) from a and the negative sample n which is closest (like) to a to calculate the triplet loss.

Because the detection branch focuses on deep semantic features, and the vehicle re-identification part needs light information such as light textures and colors, the embodiment of the invention performs multi-layer feature fusion on a network structure to combine features of different layers. The detection and re-identification parts cannot be simply and directly added at the network loss part, so that the network is excessively biased to one task, and the weight of each task is adaptively adjusted by combining each branch by adopting a multi-task uncertain loss method.

Thus, the total loss function L_totalAs follows.

Wherein w1 and w2 are adaptive weights.

The training process is illustrated. During network training, 30 ten thousand pieces of data under a monitoring scene are adopted, wherein each picture contains the position information of all clear vehicles, and 1/3 data simultaneously contain the identification information of the vehicles. If the image does not contain the identification information in the training, the partial parameters corresponding to the identification branch are not updated. Training 120 periods on the whole data set, wherein the Average accuracy (Ap) of a detection part of a final test result reaches 89.9, and the Average accuracy of a vehicle weight identification part Rank-1 is 93.5. When the network test adopts 512 × 256 inputs, the running time on the edge device is 30 milliseconds (ms), which saves about 50ms compared with two network reasoning of YOLOv3+ Resnet18, improves the speed, reduces the consumption of hardware devices, and is more friendly to the edge device with limited computing power.

Fig. 6 is a structural diagram of a vehicle detection device according to an embodiment of the present invention. The apparatus 600 is included in an edge device, the apparatus 600 comprising:

an extraction module 601 configured to extract a first image feature map based on an original image containing a vehicle;

a generating module 602 configured to generate a second image feature map based on the first image feature map, wherein a receptive field of the second image feature map is larger than a receptive field of the first image feature map;

a fusion module 603 configured to perform a feature fusion process on the first image feature map and the second image feature map to generate a third image feature map after the feature fusion process;

a detection module 604 configured to detect a vehicle based on the third image feature map.

In an exemplary embodiment, the generating module 602 is configured to input the first image feature map into a spatial pyramid pooling network to generate the second image feature map, the pyramid pooling network comprising a plurality of convolution kernels and a stitching section, wherein each convolution kernel is adapted to convert the first image feature map into a respective image feature map having a respective receptive field, and the stitching section is adapted to stitch the respective image feature map and the first image feature map provided by each of the plurality of convolution kernels into the second image feature map.

In an exemplary embodiment, the fusion module 603 is configured to input the first image feature map and the second image feature map into a feature pyramid network to perform a feature fusion operation, the feature pyramid network comprising a down-sampling part and an up-sampling part, wherein there is a horizontal connection between the down-sampling part and the up-sampling part, the down-sampling part is adapted to perform a down-sampling process on the first image feature map and the second image feature map to generate a feature layer comprising a higher-level feature, and the up-sampling part is adapted to perform an up-sampling process on the feature layer.

In an exemplary embodiment, the detecting module 604 is configured to output a position of a center point of the vehicle based on the third image feature map.

In an exemplary embodiment, the detection module 604 is further configured to perform at least one of the following: calibrating the position of the central point; predicting the width and height of the detection frame taking the calibrated central point as the center; a vehicle characteristic adapted to heavily identify the vehicle is determined.

In an exemplary embodiment, the detection module 604 is configured to output, in the first portion of the network output layer, a location of a center point of the vehicle based on the third image feature map; calibrating the position of the center point in a second part of the network output layer; predicting, in a third portion of the network output layer, a width and a height of a detection box centered on the calibrated center point; in a fourth portion of the network output layer, determining vehicle characteristics adapted to heavily identify the vehicle; the apparatus 600 further comprises: a training module 606 configured to base the loss function L on_totalTraining the overall network 605 comprising the pyramid pooling network, the feature pyramid network and the network output layer, wherein

As shown in fig. 7, the vehicle detection device includes: a processor 701; a memory 702; in which a memory 702 has stored therein an application program executable by the processor 701 for causing the processor 701 to execute the vehicle detection method according to the above embodiment.

The memory 702 may be embodied as various storage media such as an Electrically Erasable Programmable Read Only Memory (EEPROM), a Flash memory (Flash memory), and a Programmable Read Only Memory (PROM). The processor 501 may be implemented to include one or more central processors or one or more field programmable gate arrays that integrate one or more central processor cores. In particular, the central processor or central processor core may be implemented as a CPU, MCU or Digital Signal Processor (DSP).

Fig. 8 is an exemplary configuration diagram of the vehicle detecting device of the present invention. In general, the vehicle detection apparatus 800 is an edge device including: a processor 801 and a memory 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 801 may also include an AI processor for processing computational operations related to machine learning. For example, the AI processor may be implemented as a neural network processor.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices.

In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the vehicle detection methods provided by various embodiments of the present disclosure. In some embodiments, the vehicle detection device 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 701, memory 702, and peripheral interface 803 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a touch screen display 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one Input/Output (I/O) related peripheral to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting Radio Frequency (RF) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or Wireless Fidelity (Wi-Fi) networks. In some embodiments, the radio frequency circuit 804 may also include Near Field Communication (NFC) related circuitry, which is not limited by this disclosure.

Display 805 is used to display a User Interface (UI). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, disposed on the front panel of the vehicle detection device 800; in other embodiments, the number of the display screens 805 may be at least two, and each of the at least two display screens may be disposed on a different surface of the vehicle detection apparatus 800 or may be of a foldable design; in some embodiments, display 805 may be a flexible display disposed on a curved surface or a folded surface of vehicle detection device 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 may be made of Liquid Crystal Display (LCD), Organic Light-Emitting Diode (OLED), or the like.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and a Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp refers to a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different positions of the vehicle detection apparatus 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is used to locate the current geographic Location of the vehicle detection device 800 to implement navigation or Location Based Service (LBS). The Positioning component 808 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, the graves System in russia, or the galileo System in the european union.

The power supply 809 is used to supply power to various components in the vehicle detection device 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power source 809 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging.

Those skilled in the art will appreciate that the above-described configuration is not intended to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

It should be noted that not all steps and modules in the above flows and structures are necessary, and some steps or modules may be omitted according to actual needs. The execution order of the steps is not fixed and can be adjusted as required. The division of each module is only for convenience of describing adopted functional division, and in actual implementation, one module may be divided into multiple modules, and the functions of multiple modules may also be implemented by the same module, and these modules may be located in the same device or in different devices.

The hardware modules in the various embodiments may be implemented mechanically or electronically. For example, a hardware module may include a specially designed permanent circuit or logic device (e.g., a special purpose processor such as an FPGA or ASIC) for performing specific operations. A hardware module may also include programmable logic devices or circuits (e.g., including a general-purpose processor or other programmable processor) that are temporarily configured by software to perform certain operations. The implementation of the hardware module in a mechanical manner, or in a dedicated permanent circuit, or in a temporarily configured circuit (e.g., configured by software), may be determined based on cost and time considerations.

The present invention also provides a machine-readable storage medium storing instructions for causing a machine to perform a method according to the present application. Specifically, a system or an apparatus equipped with a storage medium on which a software program code that realizes the functions of any of the embodiments described above is stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program code stored in the storage medium. Further, part or all of the actual operations may be performed by an operating system or the like operating on the computer by instructions based on the program code. The functions of any of the above-described embodiments may also be implemented by writing the program code read out from the storage medium to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causing a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on the instructions of the program code.

Examples of the storage medium for supplying the program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs, DVD + RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or the cloud by a communication network.

"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings are only schematic representations of the parts relevant to the invention, and do not represent the actual structure of the product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "a" does not mean that the number of the relevant portions of the present invention is limited to "only one", and "a" does not mean that the number of the relevant portions of the present invention "more than one" is excluded. In this document, "upper", "lower", "front", "rear", "left", "right", "inner", "outer", and the like are used only to indicate relative positional relationships between relevant portions, and do not limit absolute positions of the relevant portions.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A vehicle detection method, characterized in that the method is performed by an edge device, comprising:

and detecting the vehicle based on the third image feature map.

2. The vehicle detecting method according to claim 1,

the generating a second image feature map based on the first image feature map comprises:

3. The vehicle detecting method according to claim 2,

the performing of the feature fusion process on the first image feature map and the second image feature map comprises:

4. The vehicle detection method according to claim 3, wherein the detecting the vehicle based on the third image feature map includes: outputting the position of the center point of the vehicle based on the third image feature map.

5. The vehicle detection method of claim 4, further comprising at least one of:

calibrating the position of the center point;

determining a vehicle characteristic adapted to re-identify the vehicle.

6. The vehicle detecting method according to claim 4,

the detecting the vehicle based on the third image feature map is: outputting, in a first portion of a network output layer, a location of a center point of the vehicle based on the third image feature map;

the method further comprises the following steps:

7. A vehicle detection apparatus, the apparatus being incorporated in an edge device, comprising:

8. The vehicle detecting apparatus according to claim 7,

the generation module is configured to input the first image feature map into a spatial pyramid pooling network to generate the second image feature map, the pyramid pooling network comprising a plurality of convolution kernels and a stitching section, wherein each convolution kernel is adapted to convert the first image feature map into a respective image feature map having a respective receptive field, the stitching section is adapted to stitch the respective image feature map and the first image feature map provided by each of the plurality of convolution kernels into the second image feature map.

9. The vehicle detecting apparatus according to claim 8,

the fusion module is configured to input the first image feature map and the second image feature map into a feature pyramid network to perform a feature fusion operation, the feature pyramid network comprising a downsampling part and an upsampling part, wherein the downsampling part and the upsampling part have a horizontal connection therebetween, the downsampling part is adapted to perform downsampling processing on the first image feature map and the second image feature map to generate a feature layer comprising a high-level feature, and the upsampling part is adapted to perform upsampling processing on the feature layer.

10. The vehicle detecting apparatus according to claim 9,

the detection module is configured to output the position of the center point of the vehicle based on the third image feature map.

11. The vehicle detecting apparatus according to claim 10,

the detection module further configured to perform at least one of:

calibrating the position of the center point;

determining a vehicle characteristic adapted to re-identify the vehicle.

12. The vehicle detecting apparatus according to claim 10,

the detection module is configured to output the position of the center point of the vehicle based on the third image feature map in a first part of a network output layer; calibrating the location of the center point in a second portion of the network output layer; predicting, in a third portion of the network output layer, a width and a height of a detection box centered on the calibrated center point; in a fourth portion of the network output layer, determining a vehicle characteristic adapted to re-identify the vehicle; the device further comprises:

a training module configured to be based on a loss functionNumber L_totalTraining an overall network comprising the pyramid pooling network, the feature pyramid network and the network output layer, wherein

13. An edge device, comprising:

a memory;

a processor;

wherein the memory has stored therein an application program executable by the processor for causing the processor to execute the vehicle detection method according to any one of claims 1 to 6.

14. A computer-readable storage medium having stored therein computer-readable instructions for performing the vehicle detection method according to any one of claims 1 to 6.