CN116110008A

CN116110008A - Vehicle detection method, vehicle flow statistics method and device

Info

Publication number: CN116110008A
Application number: CN202310084691.7A
Authority: CN
Inventors: 徐韶华; 黎恒; 黄宇; 邹磊; 尚圣弘; 覃飞宇
Original assignee: Guangxi Beitou Xinchuang Technology Investment Group Co ltd
Current assignee: Guangxi Beitou Xinchuang Technology Investment Group Co ltd
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-05-12

Abstract

The disclosure provides a vehicle detection method, a vehicle flow statistics method and a device, wherein a vehicle detection model is applied, the vehicle detection model comprises a backbone network, a fusion network and a prediction network, and the method comprises the following steps: extracting multi-scale image features of a vehicle image by using the backbone network, wherein the backbone network comprises N C3 modules, the hierarchical structure of each C3 module is a Swin transform structure, and N is an integer greater than or equal to 2; performing feature fusion on the multi-scale image features by using the fusion network to obtain a plurality of fusion features; and processing each fusion characteristic by using the prediction network to obtain a plurality of vehicle detection results. The method provided by the exemplary embodiment of the disclosure not only can accurately classify and track the vehicles in the vehicle image, but also can accurately count the traffic flow.

Description

Vehicle detection method, vehicle flow statistics method and device

Technical Field

The present invention relates to the field of vehicle identification technologies, and in particular, to a vehicle detection method, a vehicle flow statistics method and a device.

Background

Along with the rapid development of national economy and the continuous acceleration of urban process, the quantity of the reserved automobiles in China is continuously and rapidly increased, a series of urban traffic problems are also caused, and the rapid development of an intelligent traffic system provides an effective solving strategy for solving the traffic problems, saves a great amount of manpower and material resources, and becomes the development direction of future traffic.

The traffic flow detection is an important component of an intelligent traffic system, can better allocate traffic resources, prevent and alleviate urban traffic jam problems, improve the transportation efficiency of road traffic, and early traffic flow statistics requires manpower to carry out statistics and analysis on monitoring videos, so that a large amount of manpower resources are consumed, the efficiency is low, and the traditional detection method is easily influenced by external factors such as weather and illumination, so that the accuracy of traffic flow detection is influenced. In order to overcome the defects of the traditional vehicle detection method, the vehicle detection can be accurately positioned by adopting a deep learning detection method, position information is acquired, continuous tracking is carried out, and the detection of video vehicle flow is completed.

Disclosure of Invention

According to an aspect of the present disclosure, there is provided a vehicle detection method, applying a vehicle detection model, the vehicle detection model including a backbone network, a fusion network, and a prediction network, including:

extracting multi-scale image features of a vehicle image by using a backbone network, wherein the backbone network comprises N C3 modules, the hierarchical structure of each C3 module is a Swin transform structure, and N is an integer greater than or equal to 2;

Performing feature fusion on the multi-scale image features by using the fusion network to obtain a plurality of fusion features;

and processing each fusion characteristic by using the prediction network to obtain a plurality of vehicle detection results.

According to another aspect of the present disclosure, there is provided a traffic flow statistical method, including:

extracting an initial frame image from a vehicle video;

determining a plurality of vehicle detection results of the initial frame image by using the vehicle detection method according to the exemplary embodiment of the present disclosure;

carrying out vehicle tracking on the vehicle video based on each vehicle detection result to obtain a corresponding vehicle tracking result;

and determining the vehicle flow based on the vehicle tracking results corresponding to the plurality of vehicle detection results.

According to another aspect of the present disclosure, there is provided a vehicle detection apparatus applying a vehicle detection model including a backbone network, a fusion network, and a prediction network, including:

the first extraction module is used for extracting multi-scale image features of a vehicle image by utilizing a backbone network, the backbone network comprises N C3 modules, the hierarchical structure of each C3 module is a Swin transform structure, and N is an integer greater than or equal to 2;

The fusion module is used for carrying out feature fusion on the multi-scale image features by utilizing the fusion network to obtain a plurality of fusion features;

and the processing module is used for processing each fusion characteristic by utilizing the prediction network to obtain a plurality of vehicle detection results.

According to another aspect of the present disclosure, there is provided a traffic flow statistics apparatus including:

the second extraction module is used for extracting an initial frame image from the vehicle video;

a first determining module configured to determine a plurality of vehicle detection results of the initial frame image using the method according to the exemplary embodiment of the present disclosure;

the tracking module is used for tracking the vehicle video based on each vehicle detection result to obtain a corresponding vehicle tracking result;

and the second determining module is used for determining the vehicle flow based on the vehicle tracking results corresponding to the vehicle detection results.

According to another aspect of the present disclosure, there is provided an electronic device including:

a processor; the method comprises the steps of,

a memory storing a program;

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the method according to an exemplary embodiment of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium, the non-transitory computer

The readable storage medium stores computer instructions for causing the computer to perform the method according to the exemplary embodiments of the present disclosure.

One or more aspects provided in the exemplary embodiments of the present disclosure apply a vehicle detection model, and the vehicle detection model includes a backbone network, a fusion network, and a prediction network to detect a vehicle. When the vehicle is detected, the multi-scale image features of the vehicle image can be extracted by utilizing a backbone network, the backbone network comprises N C3 modules, the hierarchical structure of each C3 module is a Swin transform structure, N is an integer greater than or equal to 2, and then the multi-scale image features are subjected to feature fusion by utilizing a fusion network to obtain a plurality of fusion features; and finally, processing each fusion characteristic by utilizing a prediction network to obtain a plurality of vehicle detection results. Therefore, the method of the exemplary embodiment of the disclosure can introduce the Swin transform structure into the C3 module, and because the Swin transform structure has good universality, can adapt to a target size which varies in a large range, realizes a dense prediction task, and can integrate local features and global information, so that the model has good performance, and the expression capability of the model is greatly improved, and therefore, the Swin transform structure is introduced into the C3 module, the main network can make the processing result of the input vehicle image more accurate, and further, the errors of a plurality of fusion features and the vehicle image are smaller, the prediction result is more accurate, and finally, the classification and positioning of the vehicle are more accurate.

On the basis, when the method of the embodiment of the invention is applied to traffic flow statistics, the vehicle detection result of the initial frame image can be determined according to the vehicle detection method, and a plurality of vehicle detection results of the initial frame image are determined by adopting the vehicle detection method of the embodiment of the invention, so that vehicle tracking is carried out on the vehicle video according to each vehicle detection result, and a corresponding vehicle tracking result is obtained; and, the vehicle flow rate may be determined based on the vehicle tracking results corresponding to the plurality of vehicle detection results. Based on the method, when traffic flow statistics is carried out, the omission rate and the false detection rate can be effectively reduced based on a more accurate vehicle detection model, the jump rate of the vehicle during tracking is reduced to a certain extent, and the traffic flow statistics is more accurate. Therefore, the method provided by the exemplary embodiment of the disclosure not only can accurately classify and track the vehicles in the vehicle image, but also can accurately count the vehicle flow.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the present disclosure, and together with the description serve to explain the present disclosure. In the drawings:

FIG. 1 illustrates a schematic diagram of an example system in which various methods described herein may be implemented, according to an example embodiment of the present disclosure;

FIG. 2 illustrates a flow chart of a vehicle detection method of an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a traffic flow statistics method of an exemplary embodiment of the present disclosure;

FIG. 4 shows a functional block diagram of a vehicle detection apparatus according to an exemplary embodiment of the present disclosure;

FIG. 5 shows a schematic block diagram of functional modules of a traffic flow statistic device according to an exemplary embodiment of the disclosure;

FIG. 6 shows a schematic block diagram of a chip according to an exemplary embodiment of the present disclosure;

fig. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below. It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

Before describing embodiments of the present disclosure, the following definitions are first provided for the relative terms involved in the embodiments of the present disclosure:

deep learning, which is a form of machine learning, concepts stem from the study of artificial neural networks. The multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data. It is a new field in machine learning research, and its motivation is to build and simulate a neural network for analysis learning of human brain, which mimics the mechanism of human brain to interpret data.

YOLOv5s network model, at present, YOLOv5 is the most commonly used lightweight object detection model, issued by Ultralytics company, based on the PyTorch framework, it contains five version models of Yolov5n, yolov5s, yolov5m, yolov5l, yolov5 x. The YOLOv5s has high running speed and small model, and is convenient to expand into embedded equipment to put into production.

The coordinate attention mechanism (Coordinate Attention, abbreviated as CA) mainly embeds the position information into the channel attention, thus alleviating the problem that the SE attention mechanism adopts global pooling operation to ignore the position information, and enabling the network to be attended to a larger scope.

CBAM (Convolutional Block Attention Module) the attention module comprises two consecutive sub-modules: the channel attention module (Channel Attention Module, CAM) and the space attention module (Spatial Attention Module, SAM) integrate weight information in the channel and space dimensions, and greatly improve the connection of different features in space and channel.

GAM (Global Attention Mechanism) the attention module is divided into two modules: the system comprises a space attention sub-module and a channel attention sub-module, wherein the channel attention is to learn weights of different channels, the weights are used for dividing the different channels in multiple, the space attention is focused on position information of an object on an image, and the characteristics of each space are selectively focused through weighting of the spatial characteristics.

The deep SORT target tracking algorithm is a multi-target tracking algorithm which is quite common at present, and deep SORT is an ascending version of the SORT algorithm. The Deep is that a Deep Learning network is used in the algorithm, and the SORT algorithm predicts the state of a detection frame in the next frame by using a Kalman filtering algorithm, and matches the state with the detection result of the next frame to realize tracking of the vehicle.

A multi-layer perceptron (Multilayer Perceptron, abbreviated as MLP) is an artificial neural network of forward structure that maps a set of input vectors to a set of output vectors.

At present, traffic flow detection based on a monitoring video gradually becomes a research hotspot, and the difficulty of the detection is how to accurately and rapidly identify a vehicle target contained in each frame of image in the video under a more complex environment, so that the problems of vehicle shielding, size change, identity jump and the like cause missed detection and false detection of vehicles, and inaccurate traffic flow statistics are avoided.

In the related technology, video streams or frame images can be transmitted into an original Yolov5s target detection algorithm to finish classification and detection of target vehicles, then the tracking of the vehicles is finished by combining with a deep SORT tracking algorithm, and finally the vehicle flow is counted by using a virtual detection line increasing method. However, when a small target is identified by detection using the YOLOv5s model, the identification effect is poor, false detection and missing detection are easy to occur, and the detection accuracy is not high; when the conventional target detection algorithm deep SORT is used for tracking the vehicle, the target detection algorithm deep SORT is used for tracking pedestrians, so that the recognition size is not suitable for tracking the vehicle; in the case of traffic flow statistics, a physical coil method is generally adopted, and the method requires embedding an inductor/sensor under a road surface, so that there are problems such as high installation and maintenance costs and road surface damage.

Aiming at the problems, the invention integrates the Swin Transformer into a backbone network by improving the existing vehicle detection model, introduces an attention module into the fusion network, trains a vehicle detection data set by using the improved vehicle detection model, extracts important features of the vehicle, obtains the accurate position of the vehicle by a category probability value and obtains the type classification of the vehicle; and the obtained vehicle detection result is input into the deep SORT tracking method, and the characteristic extraction network part of the deep SORT tracking method is optimized and adjusted to be suitable for the size of vehicle detection, so that the real-time tracking of vehicles of different vehicle types is realized, the characteristic identification of a target vehicle is effectively enhanced based on the size, the detection precision of the target is well improved under the condition of meeting the real-time performance, the omission ratio and the false detection ratio are effectively reduced, and the vehicle detection is more accurate and efficient. On the basis, the vehicle flow statistics can be rapidly and accurately carried out by combining the vehicle detection result and setting a virtual detection line.

Fig. 1 illustrates a schematic diagram of an example system in which various methods described herein may be implemented, according to an example embodiment of the present disclosure. As shown in fig. 1, a system 100 of an exemplary embodiment of the present disclosure may include: user device 110, computing device 120, and data storage system 130.

As shown in fig. 1, the user device 110 may communicate with the computing device 120 over a communication network. The communication network may be a wired communication network or a wireless communication network. The wired communication network may be a communication network based on a power line carrier technology, and the wireless communication network may be a local area wireless network or a wide area wireless network. The local wireless network may be a WIFI wireless network, a Zigbee wireless network, a mobile communication network, or a satellite communication network, etc.

As shown in fig. 1, the user equipment 110 may include a computer, a mobile phone, or an intelligent terminal such as an information processing center, and the user equipment 110 may be used as an initiating terminal of a model training operation or an expression driving parameter determining operation to initiate a request to the computing device 120. The computing device 120 may be a cloud server, a network server, an application server, a management server, or other servers with data processing functions, for implementing the training method and the generating method. The server may be configured with a deep learning processor, which may be a neuron of a single-core deep learning processor (Deep Learning Processor-Singlecore, abbreviated as DLP-S) or a multi-core deep learning processor (Deep Learning Processor-Multicore, abbreviated as DLP-M). The DLP-M is a multi-core extension performed on the basis of the DLP-S, and performs inter-core communication by performing protocols such as interconnection, multicasting, inter-core synchronization and the like on a plurality of DLP-ss through a Network-on-chip (Noc) to complete a vehicle detection task and a traffic flow statistics task.

As shown in FIG. 1, the data storage system 130 may be a generic term that includes databases that store historical data locally, either on the computing device 120, on other web servers, or on the data storage system 130. The data storage system 130 may be separate from the computing device 120 or may be integrated within the computing device 120.

In practical application, the user equipment may upload the vehicle video to the computing equipment through the communication network, after receiving the vehicle image or the vehicle video, the computer equipment detects the vehicle information contained in the vehicle image or the vehicle video, and tracks the vehicle based on the detection result, so as to complete the traffic statistics based on the tracking result, and then feeds back the corresponding vehicle information and the traffic information to the user equipment through the communication network.

The vehicle detection method of the exemplary embodiments of the present disclosure may be applied to a server or a chip in a server, and the method of the exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 2 shows a flow chart diagram of a vehicle detection method according to an exemplary embodiment of the present disclosure. The vehicle detection method of the exemplary embodiment of the present disclosure applies a vehicle detection model including a backbone network, a fusion network, and a prediction network, the vehicle detection method including:

Step 201: and extracting the multi-scale image characteristics of the vehicle image by using a backbone network, wherein the backbone network comprises N C3 modules, the hierarchical structure of each C3 module is a Swin transform structure, and N is an integer greater than or equal to 2. Wherein the multi-scale image features of the vehicle image may be the result of sampling the vehicle image at different sizes.

For example, the above-described vehicle detection model may be a YOLOv5s target detection model, which may be used for detecting and classifying and identifying vehicles contained in a vehicle image. For example, a suitable vehicle image may be selected from a large number of vehicle images as a training set and a verification set, and a plurality of vehicles included in the vehicle image may be labeled, and the vehicles may be labeled as five types of Car, SUV, bus, truck-I (less than three axes) and Truck-II (more than three axes) for subsequent training of the vehicle detection model. It should be appreciated that the labeling tool may be LabelImg, labelme, CVAT, etc.

For example, when a small target is identified and detected by using a conventional YOLOv5s target detection model, the identification effect is poor, the situation of false detection and missing detection of the target is easily caused, and the detection precision is not high. The feature extraction network of the YOLOv5s target detection model is improved based on the exemplary embodiment of the disclosure, and a Swin transform module is integrated into the backbone network of the YOLOv5s target detection model to replace the hierarchical structure part of the C3 module. It should be understood that the Swin transducer structure can be used as a general backbone of computer vision, which not only absorbs the advantages of strong versatility, ultra-long field of view, and the like of the transducer, but also has the advantages of translational invariance, locality, layering, and the like of the CNN, and exceeds the network performance based on the CNN in various application fields of computer vision, such as classification, object detection, semantic segmentation, instance segmentation, and the like. Therefore, after the Swin transducer structure is integrated into the backbone network, the processing of the vehicle detection data set of the input end is more accurate, and the vehicle classification and positioning are more accurate.

The backbone network of the exemplary embodiment of the present disclosure may further include a convolution module, wherein the plurality of C3 modules includes a first C3 module to an nth C3 module connected in series in sequence, wherein the convolution module is connected to the first C3 module, and the first C3 module to the nth C3 module are all connected to the fusion network; the convolution module can be used for extracting vehicle image features of a vehicle image, the first C3 module is used for processing the vehicle image features based on the attention mechanism of the moving window to obtain first-scale image features, and the nth C3 module is used for processing the N-1-scale image features based on the attention mechanism of the moving window to obtain an nth-scale image, wherein N is more than or equal to 2 and less than or equal to N.

For example, the overall structure of the Swin transducer, much like the hierarchical structure of convolution, turns half resolution per layer and twice the number of channels. First, the image block is the operation of equally dividing the image classification model (vision transformer, viT) into small blocks; then, the method is divided into a plurality of stages, wherein each stage comprises two parts, namely an image merging mechanism and a window multi-head attention mechanism. In the proposed Swin transducer architecture, a regular window partitioning scheme can be used to calculate self-attention in each window; an irregular windowed partitioning scheme may also be employed, wherein the window partitioning scheme may be modified on a regular window partition basis to create new windows, with self-attention calculations in the new windows crossing the boundaries of previous windows in the layer to provide connections between them. It should be appreciated that patch Merging is an operation similar to pooling, which loses information, but patch Merging does not.

As described above, the Swin Transformer adopts a hierarchical structure, and builds a hierarchical representation by gradually merging adjacent image blocks into deeper Transformer layers starting from small-sized image blocks, with these hierarchical feature maps, the Swin Transformer model can conveniently make dense predictions using advanced techniques, such as Feature Pyramid Network (FPN) or U-Net, the Swin Transformer builds a hierarchical feature map by merging image blocks in deeper layers, and has linear computational complexity for the input image size due to the calculation of self-attention only within each partial window, so it can be used as a general backbone for image classification and dense recognition tasks. In addition, the Swin transducer provides a concept of moving windows, self-attention is calculated only in the windows, the length of the sequence can be effectively reduced, and information interaction between adjacent windows is realized by combining the moving windows, so that the multi-scale idea is embodied. In addition, with the operation of window merging, the receptive field is continuously enlarged, and the self-attention is calculated, so that the global information is gathered, namely the local characteristics are focused, and the global information is considered.

Based on this, each C3 module incorporated into the Swin Transformer may comprise: an embedding module and a moving window based attention module; at this time, when the C3 module is the first C3 module, the embedding module may be configured to perform feature representation on the vehicle image to obtain first feature representation information, and the attention module is configured to process the first feature representation information based on an attention mechanism of the moving window to obtain a first scale image feature; when the C3 module is the nth C3 module, the embedded module is used for carrying out feature representation on the vehicle image to obtain nth feature representation information, and the attention module is used for processing the nth feature representation information based on the attention mechanism of the moving window to obtain nth scale image features. Based on the method, a plurality of scale images can be obtained through a plurality of C3 modules, and the plurality of scale images can contain smaller scale images, so that the accuracy of vehicle detection and identification in the plurality of scale images is improved, and the condition of missing detection and false detection caused by small scale vehicles in the vehicle images is avoided.

Therefore, through the improved backbone network, the multi-scale image of the vehicle image can be obtained from a plurality of scales, small target vehicles with smaller pixel occupation and complex background in the vehicle image can be subjected to effective feature extraction, so that the vehicle identification and detection accuracy in the vehicle image is higher, the condition of false detection missing caused by the loss of small-scale vehicles in the vehicle image is effectively avoided, and the integrity of the vehicle image on a macroscopic level is also ensured.

Step 202: and carrying out feature fusion on the multi-scale image features by utilizing a fusion network to obtain a plurality of fusion features. The fusion network may include a pyramid network structure, and is configured to fuse the multi-scale image features to obtain a plurality of fusion features. The pyramid network structure can be composed of a series of vehicle images with different scales and is mainly used for sampling the vehicle images, and the fusion network can comprise one or more pyramid network structures, so that the features of the images with different scales are fused, the receptive field is enlarged, and the features with multiple scales are fused, so that the detection capability of the vehicle detection model on small targets in the vehicle images is enhanced. It should be appreciated that the converged network may include only an upsampled pyramid network structure, may include only a downsampled pyramid network structure, and may include both an upsampled pyramid network structure and a downsampled pyramid network structure.

For example, the fusion network may further include an attention module connected to the pyramid network structure, the attention module configured to process, based on an attention mechanism, an image feature with a smallest scale among the multi-scale image features; wherein the attention mechanism may be at least one of a CA attention mechanism, a CBAM attention mechanism, and a GAM attention mechanism. It is understood that the introduction of the attention mechanism can enable the vehicle detection model to have a better detection effect, capture important characteristics of targets with different sizes, and improve the detection precision of the vehicle detection model.

For example, exemplary embodiments of the present disclosure select a GAM attention mechanism to be incorporated into a fused network of YOLOv5s, a global attention mechanism that can reduce information dispersion and enlarge global interactive representations to improve deep neural network performance, including channel attention mechanisms as well as spatial attention mechanisms. The channel attention sub-module uses three-dimensional arrangement to reserve information in three dimensions, and further adopts a two-layer multi-layer perceptron to amplify the cross-dimensional channel-space dependence, so that the attention degree of a GAM attention mechanism to the space information is higher; in the spatial attention sub-module, two convolution layers are used for spatial information fusion, so that the GAM attention mechanism can capture more relations between pixel spaces.

Therefore, when the GAM attention mechanism is integrated into the fusion network, small targets can be detected in a larger range, the detection capability of the vehicle detection model is improved, and the characteristics extracted by the fusion network can be more comprehensive and rich.

Step 203: and processing each fusion characteristic by using a prediction network to obtain a plurality of vehicle detection results.

The prediction network predicts important feature information of the vehicle in the vehicle image and position information of the vehicle in the vehicle image using the plurality of fusion features obtained by the fusion network, for example.

Based on the method, the vehicle detection model provided by the embodiment of the disclosure can enhance the detection capability of the small target vehicle, can effectively extract the characteristics of the small target vehicle with smaller pixel occupation and complex background in the vehicle image, introduces an attention mechanism in the fusion network, can enhance the identification of important characteristics of the vehicle, suppresses general or no characteristics, better captures the characteristic context information, enlarges the receptive field of the model, enables the vehicle detection model to have better detection effect, can capture the important characteristics of the vehicle contained in the vehicle images with different sizes, improves the detection precision of the vehicle detection model based on the detection capability, realizes the accurate detection of the vehicle in the vehicle image, and further can obtain the accurate position of the vehicle and the vehicle type classification through the category probability value.

The traffic flow statistics method of the exemplary embodiments of the present disclosure may be applied to a server or a chip in a server, and the method of the exemplary embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 3 shows a flow chart diagram of a traffic flow statistics method according to an exemplary embodiment of the present disclosure. The traffic flow statistical method of the exemplary embodiment of the present disclosure includes:

step 301: an initial frame image is extracted from the vehicle video.

The vehicle video may be a plurality of video segments shot by a fixed camera above a highway, an urban arterial road and the like, and the video segments comprise vehicle videos under different conditions such as different illumination intensities, different shooting angles, different traffic jam conditions, different resolutions and the like.

For example, a certain frame of vehicle image is selected as an initial frame image from continuous frame vehicle images extracted from a vehicle video. And selecting a first frame of vehicle image which can be a certain vehicle video from the initial frame of image, and also selecting any frame of vehicle image in the middle of the certain vehicle video.

Step 302: a plurality of vehicle detection results of an initial frame image are determined using the vehicle detection method of the exemplary embodiment of the present disclosure.

Step 303: and carrying out vehicle tracking on the vehicle video based on each vehicle detection result to obtain a corresponding vehicle tracking result.

For example, the vehicle tracking method adopted by the exemplary embodiments of the present disclosure may be a deep start tracking method, through which a plurality of vehicle tracking results may be obtained. The vehicle tracking method can be used for tracking the vehicle images to be detected based on the deep SORT tracking method, so that the position information of the same vehicle in different vehicle images to be detected is obtained, and further real-time tracking of vehicles of different vehicle types is realized. Here, not only can the same vehicle be tracked in real time, but also vehicles of different vehicle types can be tracked in real time at the same time.

In practical applications, the deep sort original tracking algorithm is mainly used for pedestrian detection and tracking, but the original size is not suitable for detection and identification of vehicles. Accordingly, the deep start tracking method further includes a preprocessing network for preprocessing the initial frame image to modify the initial frame image to the target image size.

Illustratively, exemplary embodiments of the present disclosure optimally adjust the deep start tracking method for vehicle appearance characteristics, modifying the original input image size 128 (H) ×64 (W) to 128 (H) ×128 (W) to make it more suitable for detection and identification of vehicle targets.

Step 304: and determining the vehicle flow based on the vehicle tracking results corresponding to the plurality of vehicle detection results.

For example, exemplary embodiments of the present disclosure may set a virtual detection line in a vehicle video, and determine a traffic flow of a certain period based on the number of collisions of the vehicle with the virtual detection line.

In practical application, a plurality of vehicle videos may be acquired under different scenes, may be acquired under scenes such as daytime, overcast and rainy, and dusk, and a virtual detection line may be set in the vehicle videos according to the scenes to perform traffic statistics, wherein the position of the virtual detection line may be manually adjusted according to real-time conditions, and herein, the setting of the virtual detection line may not only perform traffic statistics on vehicles of different types in a unidirectional lane, but also perform traffic statistics on vehicles of different types in a bidirectional lane.

In summary, the exemplary embodiments of the present disclosure fuse the Swin transform structure on the basis of the YOLOv5s network structure, and introduce a attentive mechanism, enhance the recognition of important features of a vehicle, inhibit general or no features, better capture feature context information, expand the receptive field of a model, enhance the detection capability of a small target vehicle, and improve the detection accuracy of the model. In addition, the characteristic extraction network part of the deep SORT tracking algorithm is adjusted, so that the deep SORT tracking algorithm is suitable for tracking vehicles, the tracking effect is improved, and the problems of inaccurate vehicle flow statistics caused by missed detection and false detection of vehicles due to the problems of vehicle shielding, size change, identity jump and the like are avoided.

The foregoing description of the solution provided by the embodiments of the present disclosure has been mainly presented from the perspective of a server. It will be appreciated that the server, in order to implement the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The embodiments of the present disclosure may divide functional units of a server according to the above method examples, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present disclosure, the division of the modules is merely a logic function division, and other division manners may be implemented in actual practice.

In the case where respective functional modules are divided with corresponding respective functions, the exemplary embodiments of the present disclosure provide a vehicle detection apparatus, which may be a server or a chip applied to the server. Fig. 4 shows a functional block diagram of a vehicle detection apparatus according to an exemplary embodiment of the present disclosure. As shown in fig. 4, the vehicle detection apparatus 400 applies a vehicle detection model including a backbone network, a fusion network, and a prediction network, including:

a first extraction module 401, configured to extract multi-scale image features of a vehicle image by using the backbone network, where the backbone network contains N C3 modules, and a hierarchical structure of each C3 module is a Swin transform structure, and N is an integer greater than or equal to 2;

a fusion module 402, configured to perform feature fusion on the multi-scale image features by using the fusion network to obtain a plurality of fusion features;

and the processing module 403 is configured to process each of the fusion features by using the prediction network, so as to obtain a plurality of vehicle detection results.

In one possible implementation manner, the backbone network further includes a convolution module, where the plurality of C3 modules includes a first C3 module to an nth C3 module connected in series in sequence, the convolution module is connected to the first C3 module, and all of the first C3 module to the nth C3 module are connected to the fusion network; the convolution module is used for extracting vehicle image features of the vehicle image, the first C3 module is used for processing the vehicle image features based on the attention mechanism of the moving window to obtain first-scale image features, and the nth C3 module is used for processing the N-1-scale image features based on the attention mechanism of the moving window to obtain an nth-scale image, wherein N is more than or equal to 2 and less than or equal to N.

In one possible implementation, each of the C3 modules includes: an embedding module and a moving window based attention module; when the C3 module is a first C3 module, the embedding module is used for carrying out feature representation on the vehicle image to obtain first feature representation information, and the attention module is used for processing the first feature representation information based on an attention mechanism of a moving window to obtain first scale image features; when the C3 module is the nth C3 module, the embedded module is used for carrying out feature representation on the vehicle image to obtain nth feature representation information, and the attention module is used for processing the nth feature representation information based on an attention mechanism of a moving window to obtain nth scale image features.

In one possible implementation, the fusion network includes a pyramid network structure for fusing the multi-scale image features to obtain a plurality of fused features.

In a possible implementation manner, the fusion network further comprises an attention module, wherein the attention module is connected with the pyramid network structure and is used for processing the image feature with the smallest scale in the multi-scale image features based on an attention mechanism; the attention mechanism is at least one of CA attention mechanism, CBAM attention mechanism and GAM attention mechanism.

In the case of dividing each functional module with corresponding each function, exemplary embodiments of the present disclosure provide a traffic flow statistics device, which may be a server or a chip applied to the server. Fig. 5 shows a schematic block diagram of functional modules of a traffic flow statistics device according to an exemplary embodiment of the present disclosure. As shown in fig. 5, the traffic flow statistics device 500 includes:

a second extraction module 501 for extracting an initial frame image from a vehicle video;

a first determining module 502, configured to determine a plurality of vehicle detection results of the initial frame image by using the vehicle detection method according to the exemplary embodiment of the present disclosure;

a tracking module 503, configured to perform vehicle tracking on the vehicle video based on each vehicle detection result, and obtain a corresponding vehicle tracking result;

a second determining module 504, configured to determine a vehicle flow based on the vehicle tracking results corresponding to the plurality of vehicle detection results.

In one possible implementation manner, the vehicle tracking method includes a deep start tracking method, and the vehicle tracking result is obtained, including: and carrying out vehicle tracking on the vehicle image to be detected based on the deep SORT tracking method.

In one possible implementation manner, the deep start tracking method further includes preprocessing a network, and the method further includes: preprocessing an initial frame image, and modifying the initial frame image into a target image size.

In one possible implementation manner, the determining the vehicle flow based on the vehicle tracking results corresponding to the plurality of vehicle detection results includes: setting a virtual detection line in the vehicle video; and determining the traffic flow based on the number of collisions of the vehicle with the virtual detection line.

Fig. 6 shows a schematic block diagram of a chip according to an exemplary embodiment of the present disclosure. As shown in fig. 6, the chip 600 includes one or more (including two) processors 601 and a communication interface 602. The communication interface 602 may support a server to perform the data transceiving steps of the method described above, and the processor 601 may support the server to perform the data processing steps of the method described above.

Optionally, as shown in fig. 6, the chip 600 further includes a memory 603, and the memory 603 may include a read only memory and a random access memory, and provides operation instructions and data to the processor. A portion of the memory may also include non-volatile random access memory (non-volatile random access memory, NVRAM).

In some embodiments, as shown in FIG. 6, the processor 601 performs the corresponding operation by invoking a memory-stored operating instruction (which may be stored in an operating system). The processor 601 controls the processing operations of any one of the terminal devices, which may also be referred to as a central processing unit (central processing unit, CPU). Memory 603 may include read only memory and random access memory and provide instructions and data to processor 601. A portion of the memory 603 may also include NVRAM. Such as a memory, a communication interface, and a memory coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. But for clarity of illustration, the various buses are labeled as bus system 604 in fig. 6.

The method disclosed by the embodiment of the disclosure can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The various methods, steps and logic blocks of the disclosure in the embodiments of the disclosure may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present disclosure may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.

The exemplary embodiments of the present disclosure also provide an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor for causing the electronic device to perform a method according to embodiments of the present disclosure when executed by the at least one processor.

The present disclosure also provides a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to embodiments of the disclosure.

With reference to fig. 7, a block diagram of an electronic device that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. The output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 704 may include, but is not limited to, magnetic disks, optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through computer networks, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above. For example, in some embodiments, the methods of the exemplary embodiments of the present disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. In some embodiments, the computing unit 701 may be configured to perform the methods of the exemplary embodiments of the present disclosure by any other suitable means (e.g., by means of firmware).

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the processes or functions described by the embodiments of the present disclosure are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a terminal, a user equipment, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; optical media, such as digital video discs (digital video disc, DVD); but also semiconductor media such as solid state disks (solid state drive, SSD).

Although the present disclosure has been described in connection with specific features and embodiments thereof, it will be apparent that various modifications and combinations thereof can be made without departing from the spirit and scope of the disclosure. Accordingly, the specification and drawings are merely exemplary illustrations of the present disclosure as defined in the appended claims and are considered to cover any and all modifications, variations, combinations, or equivalents within the scope of the disclosure. It will be apparent to those skilled in the art that various modifications and variations can be made to the present disclosure without departing from the spirit or scope of the disclosure. Thus, the present disclosure is intended to include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A vehicle detection method, characterized by applying a vehicle detection model comprising a backbone network, a fusion network, and a prediction network, the method comprising:

extracting multi-scale image features of a vehicle image by using the backbone network, wherein the backbone network comprises N C3 modules, the hierarchical structure of each C3 module is a Swin transform structure, and N is an integer greater than or equal to 2;

2. The method of claim 1, wherein the backbone network further comprises a convolution module, the plurality of C3 modules comprising a first C3 module through an nth C3 module in series, the convolution module being connected to the first C3 module, the first C3 module through the nth C3 module being connected to the fusion network;

the convolution module is used for extracting vehicle image characteristics of the vehicle image, the first C3 module is used for processing the vehicle image characteristics based on the attention mechanism of a moving window to obtain first scale image characteristics,

the nth C3 module is used for processing the nth-1 scale image characteristic based on the attention mechanism of the moving window to obtain an nth scale image, wherein N is more than or equal to 2 and less than or equal to N.

3. The method of claim 1, wherein each of the C3 modules comprises: an embedding module and a moving window based attention module;

when the C3 module is a first C3 module, the embedding module is used for carrying out feature representation on the vehicle image to obtain first feature representation information, and the attention module is used for processing the first feature representation information based on an attention mechanism of a moving window to obtain first scale image features;

When the C3 module is the nth C3 module, the embedded module is used for carrying out feature representation on the vehicle image to obtain nth feature representation information, and the attention module is used for processing the nth feature representation information based on an attention mechanism of a moving window to obtain nth scale image features.

4. The method of claim 1, wherein the fusion network comprises a pyramid network structure for fusing the multi-scale image features to obtain a plurality of fused features.

5. The method of claim 4, wherein the converged network further comprises an attention module, the attention module being coupled to the pyramid network structure,

the attention module is used for processing the image features with the minimum scale in the multi-scale image features based on an attention mechanism;

the attention mechanism is at least one of CA attention mechanism, CBAM attention mechanism and GAM attention mechanism.

6. A method of traffic flow statistics, the method comprising:

extracting an initial frame image from a vehicle video;

determining a plurality of vehicle detection results of the initial frame image using the method of any one of claims 1 to 5;

7. The method of claim 6, wherein the method of vehicle tracking comprises a deep tracking method, obtaining a vehicle tracking result, comprising:

and carrying out vehicle tracking on the vehicle image to be detected based on the deep SORT tracking method.

8. The method of claim 7, wherein the deep start tracking method further comprises preprocessing a network, the method further comprising:

preprocessing an initial frame image, and modifying the initial frame image into a target image size.

9. The method of claim 6, wherein the determining the vehicle flow based on the vehicle tracking results corresponding to the plurality of vehicle detection results comprises:

setting a virtual detection line in the vehicle video;

and determining the traffic flow based on the number of collisions of the vehicle with the virtual detection line.

10. A vehicle detection apparatus, characterized by applying a vehicle detection model including a backbone network, a fusion network, and a prediction network, the vehicle detection apparatus comprising:

11. A traffic flow statistic device, characterized in that the traffic flow statistic device comprises:

a first determining module configured to determine a plurality of vehicle detection results of the initial frame image using the method of any one of claims 1 to 5;

12. An electronic device comprising a memory and a processor, the memory for storing computer instructions, wherein the computer instructions are executed by the processor to implement the method of any one of claims 1-10.

13. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-10.