CN115953744A

CN115953744A - Vehicle identification tracking method based on deep learning

Info

Publication number: CN115953744A
Application number: CN202211698806.3A
Authority: CN
Inventors: 张舟洋; 张文强; 冯晋; 王朝兴; 贾莉芳; 裘璐; 董晓龙; 冯兴盼; 田友航
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-11

Abstract

The invention discloses a vehicle identification tracking method based on deep learning, which analyzes real-time video stream by enabling a monitoring camera or an unmanned aerial vehicle camera to have a deep learning computer vision algorithm, detects and tracks vehicles appearing in a video, screens out specific vehicles through input information, captures the vehicles, and uploads captured images to a background system for recording. The invention fully utilizes the existing hardware camera and has high identification precision.

Description

Vehicle identification tracking method based on deep learning

Technical Field

The invention belongs to the technical field of deep learning and machine learning, and particularly relates to a vehicle identification tracking method based on deep learning.

Background

The recognition and tracking of the vehicle has important significance for pursuing criminal vehicles, and with the rapid development of computer technology, the intelligent application of video images is promoted. One of the current research hotspots for video images includes moving object tracking. At present, the research on moving targets mainly focuses on tracking of a single camera, but when the single camera is used for tracking, shielding among vehicles easily exists, and the problem of small monitoring area exists for tracking license plates, so that the problems of low tracking precision of the moving targets or inaccurate identification are easily caused, and thus, a lot of defects exist in practical application.

Disclosure of Invention

According to the invention, images are acquired in real time according to the existing monitoring probe, a deep learning algorithm is enabled, the video stream transmitted back in real time is analyzed, the detection and tracking of the target vehicle across the camera are realized, the alarm information is sent to the manager, the actual requirements can be well met in the monitoring area and time, the working efficiency of the worker is improved, and the illegal criminal behaviors are struck, so that the national and people safety is protected.

The purpose of the invention is realized by the following technical scheme:

a vehicle identification tracking method based on deep learning comprises the following steps:

the method comprises the following steps: collecting vehicle pictures in a real traffic scene, marking vehicles, license plate positions on the vehicles and license plate text information in the vehicle pictures, and constructing a vehicle detection data set, a license plate detection data set and a license plate identification data set;

step two: constructing and training a vehicle detection network model; the input of the vehicle detection network model is a picture in a real traffic scene, the output is a prediction frame coordinate and category information of a vehicle, the prediction frame coordinate and the category information are stored, and then a vehicle image is cut out according to the prediction frame coordinate of the vehicle;

constructing and training a license plate detection network model, wherein the input of the license plate detection network model is a cut vehicle image, the output of the license plate detection network model is a prediction frame coordinate of a license plate, the prediction frame coordinate is stored, and the image of the license plate is cut according to the prediction frame coordinate of the license plate;

constructing and training a license plate recognition network model, wherein the input of the license plate recognition network model is a cut license plate image, and character information in a license plate is output;

step three: and matching the recognition result of the license plate recognition network model with the license plate to be tracked, if the matching is successful, considering the vehicle as the tracked vehicle, capturing the image containing the vehicle to be tracked, determining the position information of the corresponding vehicle according to the output result of the vehicle detection network model in the step two, and performing frame marking and alarming on the tracked vehicle.

Further, the vehicle detection network model comprises a backbone network and a detection head; the backbone network performs five times of downsampling on the input image through convolution on the basis of resnet, and feature maps of the last three scales are reserved; an SPPF module is also inserted into the backbone network and is used for serially processing input features through a plurality of maximum pooling layers with the size of 5x 5; the detection head fuses the obtained three-scale feature maps in a FPN + PAN mode, so that shallow features firstly pass through the FPN and then are combined with low-level features through upsampling to generate higher-level features, and more accurate position information is transmitted.

Further, all activation functions of a main network of the vehicle detection network model adopt SILU activation functions.

Further, a detection head of the vehicle detection network model generates a corresponding prediction frame through a preset anchor frame with fixed size and length-width ratio by adopting the idea of frame regression; and calculating the central coordinate of the prediction frame through the central coordinate of the anchor frame, and calculating the position information of the prediction frame through the scaling of the anchor frame.

Furthermore, the license plate detection network model comprises a backbone network and a detection head, wherein the backbone network of the license plate detection network model is based on DarkNet, detection is carried out on feature maps of four scales, and an attention mechanism is introduced into convolution.

Further, the license plate recognition network model comprises a visual model, a text model and a fusion model;

the vision model firstly carries out feature extraction on an input license plate image through a ResNet + Transformer structure to obtain an output feature map; the text model is based on RNN and Transformer, takes the probability vector of the character as input, and outputs the probability distribution of the expected character; the fusion model splices the results of the visual model and the text model together, then learns a weight value, and adjusts the influence of the visual model and the text model on the final prediction result.

Furthermore, the fusion model outputs a segment of character position serial number codes and the visual model result to be sent to the text model together for correction, and then the language model is repeatedly executed for multiple rounds through the idea of iterative correction, so that the recognition effect is gradually corrected, and the final output result is obtained.

Further, mosaics data of the vehicle pictures in the real traffic scene are enhanced, 4 pictures are spliced in a mode of random zooming, random cutting and random arrangement to form a new picture, and a labeling frame of the new picture is obtained according to the labeling frames of the original four pictures; and training a vehicle recognition tracking model by using the new picture.

An electronic device, comprising:

one or more processors;

a storage device to store one or more programs that, when executed by the electronic device, cause the electronic device to implement a vehicle identification tracking method such as based on deep learning.

A computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a deep learning-based vehicle identification tracking method.

The invention has the following beneficial effects:

(1) The vehicle license plate detection network model, the license plate detection network model and the license plate recognition network model are connected in series, the SPPF is adopted to perform multi-scale fusion on the image, and compared with the SPP which needs to designate the size of a convolution kernel for three times, the calculation speed is higher when the convolution module performs three times of pooling and splicing on data; an attention mechanism is introduced, so that the network can better extract features, a feature graph of one scale is added for a small target such as a license plate, and the detection precision is improved. In the text recognition part, bidirectional expression can be realized through a multi-mode model, a spelling correction language model is made based on the complete filling concept, and the recognition result is further corrected through repeatedly executing the language model for multiple rounds, so that the character recognition precision is improved.

(2) When the vehicle is detected, a multi-scale fusion method is adopted, and vehicle targets can be well detected for the scene of mutual shielding among vehicles; for small targets such as license plates, the characteristics of the small targets are better extracted by adding a characteristic diagram, so that the detection precision is improved; in addition, for the problem that Chinese recognition is difficult during license plate recognition, the recognition effect is gradually corrected by repeatedly executing the language model for multiple rounds, and the final output result is obtained, so that the recognition precision is improved.

Drawings

Fig. 1 is a flowchart of a vehicle identification and tracking method according to the present invention.

Fig. 2 is a schematic diagram of a canet backbone network.

Fig. 3 is a schematic diagram of an LPDNet network structure.

Fig. 4 is a flow chart of a specific implementation of the attention mechanism.

Fig. 5 is a schematic diagram of an LPRNet network structure.

Fig. 6 is a picture of an input canet of the embodiment.

FIG. 7 is a schematic diagram of a detection block of the output of CarNet.

Fig. 8 is a vehicle image clipped from the prediction box output by cant.

FIG. 9 is a block diagram of the detection of the LPDNet output.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at" \8230; "or" when 8230; \8230; "or" in response to a determination ", depending on the context.

As shown in fig. 1, in the vehicle identification and tracking method based on deep learning of the present invention, a monitoring camera or an unmanned aerial vehicle camera is enabled with a deep learning computer vision algorithm, so that a real-time video stream can be analyzed, vehicles appearing in a video can be detected and tracked, a specific vehicle can be screened out through input information and captured, and simultaneously captured images are uploaded to a background system for recording. The method comprises the following steps:

the method comprises the following steps: the method comprises the steps of collecting vehicle pictures in a real traffic scene, marking vehicles, license plate positions on the vehicles and license plate text information in the vehicle pictures, and constructing a vehicle detection data set, a license plate detection data set and a license plate identification data set.

Step two: constructing and training a vehicle detection network model; the input of the vehicle detection network model is a picture in a real traffic scene, the output is a prediction frame coordinate and category information (including car, bus, truck and the like) of a vehicle, the prediction frame coordinate and the category information are stored, and then a vehicle image is cut out according to the prediction frame coordinate of the vehicle;

constructing and training a license plate recognition network model, wherein the input of the license plate recognition network model is a cut license plate image, and outputting character information in a license plate;

step three: and matching the recognition result of the license plate recognition network model with the license plate to be tracked, if the matching is successful, considering the vehicle as the tracked vehicle, capturing the image containing the vehicle to be tracked, determining the position information of the corresponding vehicle according to the output result of the vehicle detection network model in the second step, and performing frame marking and alarming on the tracked vehicle.

Firstly, splicing four input pictures in a random zooming, random cutting and random arrangement mode by using a mosaic data enhancement method, wherein each picture is provided with a corresponding label frame, and a new picture can be obtained by splicing the four pictures together, wherein the label frame of the picture can be obtained according to the original label frames of the four pictures. The generated pictures are transmitted to a vehicle recognition tracking model for training, and as four images are received at the same time, the detection capability of the model on the background is improved, the problem of poor detection effect on small targets in model training is effectively solved, and the robustness of the model is improved; and when BN operation is carried out, four images can be processed simultaneously, and the purpose of reducing training resources is achieved.

(1) Vehicle detection network model CarNet

The carbon network comprises a Backbone network (Backbone) and a detection Head (Head), and the network schematic diagram is shown in fig. 2 as follows. The backbone network carries out five times of downsampling through convolution on the basis of resnet, and feature maps of the last three scales are reserved; for an image with an input size of 640x640, a feature map with three dimensions of 80x80, 40x40 and 20x20 is finally obtained. All the activation functions in the backbone network adopt SILU activation functions, the SILU activation functions have the characteristics of no upper bound and no lower bound, smoothness and nonmonotony, and the performance in a network model is superior to that of a Leaky _ relu activation function. As shown in fig. 2, the underlying CBS block representation is subject to convolution, BN, and SILU activation operations; basic1 module indicates that two CBS modules are connected in series; the Basic2 module represents the operation of shorting together the output and input of the Basic1 module through short; the Layer module indicates that two Basic2 modules are connected in series. The Focus module divides data into four parts, wherein the data of each part is subjected to 2-time down-sampling, then is spliced in a channel, and finally is subjected to convolution operation. The great advantage of this is that when down-sampling is performed, information loss can be minimized.

In order to better extract the fusion characteristics, an SPPF module is inserted into the Backbone, the conventional SPP (Spatial Pyramid fusion) module performs multi-scale fusion on the pictures in a mode of maximum Pooling of 1 × 1,5 × 5,9 × 9 and 13 × 13, and the SPPF structure is that the input features are serially processed through a plurality of 5 × 5 MaxPool layers, wherein the serial processing through two 5 × 5 MaxPool layers is the same as the calculation result of one 9 × 9 MaxPool layer, and the serial processing through three 5 × 5 MaxPool layers is the same as the calculation result of one 13 × 13 MaxPool layer. The SPPF only needs to designate one convolution kernel, the output after each pooling becomes the input of the next pooling, and compared with the SPP which needs to designate the size of the convolution kernel for three times, the calculation speed is higher when the data are subjected to the operations of pooling for three times and splicing through the convolution module. The SPPF aims at enhancing the feature expression capability of the feature map, can effectively avoid image deformation caused by cutting and zooming operations, and can well solve the problem of extracting repeated features by a convolutional network, thereby accelerating the speed of generating a candidate frame and saving the calculation cost.

In order to improve the detection precision, the obtained feature maps of three scales are fused in a FPN + PAN mode, the FPN adopts a top-down method, the feature map obtained by up-sampling is combined with low-level features to generate higher-level features, and more accurate position information is transmitted; the structure of PAN includes from bottom to top links and from top to bottom links, because the structure of FPN transmits shallow features to the upper layer, requiring tens of layers, even hundreds of layers, so that shallow information will be lost more, but the path from bottom to top is generally less, shallow features first pass through FPN, then connect with the upper layer through upsampling, preventing too much information from being lost. Therefore, the high-level feature map and the low-level feature map are combined to obtain a new feature map, and the feature map not only contains a large amount of semantic information, but also has a large number of pixels, so that the image can be better detected.

The detection head adopts the frame regression idea, the prediction frame is formed by the preset anchor frame with fixed size and length-width ratio, the prediction frame can be regarded as fine adjustment on the basis of the anchor frame, each anchor frame has a prediction frame corresponding to the anchor frame, the center coordinate of the prediction frame can be calculated through the center coordinate of the anchor frame, and the position information of the prediction frame can be calculated through the scaling of the anchor frame.

(II) license plate detection network model LPDNet

And collecting vehicle pictures in a real traffic scene, and labeling license plates in the images, wherein the labeling information comprises license plate text information besides position information. And then establishing a license plate detection database according to the marked position information.

A schematic diagram of a license plate detection network model LPDNET is shown in FIG. 3. The LPDNET network is composed of a Backbone network and a Head network, is different from a CarNet network, and improves a Backbone network and a detection Head to a certain extent to improve detection accuracy because a license plate belongs to a small target under a monitoring view angle. The backbone network of the LPDNET takes DarkNet as a basis, detection is carried out on feature maps of four scales, the detection effect of small target features is improved, and meanwhile, an attention mechanism is introduced for a detection system so as to improve the detection performance of the system.

Wherein, the basic CBS module represents the activation operation of convolution, BN and SILU; the Bottleneck _ F module represents the serial connection of two CBS modules; the Bottleneck module represents an operation of splicing together the output and input of the Bottleneck _ F module through shortcut.

The C3_ x module is divided into two parts after passing through the CBS module, one part is processed by x Bottleneck modules, the other part passes through one CBS module, the two parts of results are spliced together, and finally the number of channels is adjusted through the CBS module; the structure of the C3_ x _ F module is the same as that of the C3_ x module, except that the Bottleneck module in the structure is replaced by the Bottleneck _ F module. The SPPF structure is to process input features serially through a plurality of MaxPool layers of 5 × 5 size to perform feature fusion on pictures.

The attention mechanism can be applied as an additional network in conv in the network, and a specific input is selected or different weights are given to the characteristics. The attention mechanism can be used as an additional network in conv in the network, and a specific input is selected or different weights are given to the characteristics. The attention mechanism is mainly an attention mechanism in the neural network, and the neural network can not only learn by itself according to the attention mechanism, but also know the characteristics of the network through the attention mechanism, so that the performance of the model is better improved.

The basic idea of the attention mechanism is to identify key features in the images by a new weighting method, and then through learning training, enable the network to identify the parts to be noticed by each image, thereby forming attention. The specific implementation process is as shown in fig. 4, the input feature graph is first subjected to global average pooling, and is then sent to the Sigmoid activation function through a series of full connection layers to obtain the channel attention weight, and finally the channel attention weight and the input feature graph are subjected to multiplication.

(III) license plate recognition network LPRNet

A dictionary is established according to character information appearing in the license plate, and the dictionary content comprises Chinese characters, capital English letters and numbers. The license plate recognition network LPRNet is mainly responsible for recognizing character information in a license plate and storing output character information. The input of the license plate recognition network LPRNet is a license plate image detected by a license plate detection network model LPDNet, the license plate image is intercepted from an original image and is used as the input of the license plate recognition network LPRNet, meanwhile, the position information of a license plate is reserved, and a label is an index for converting each character into a corresponding dictionary. And mapping the output result of the license plate recognition network LPRNet to a dictionary to obtain a corresponding character. The LPDNet structure diagram is shown in FIG. 5 and mainly comprises a visual model, a text model and a fusion model, wherein features are extracted through the visual model, the text model is corrected, the fusion model fuses results of the two models together, and a final recognition result of the rice is calculated according to self-adaptive weight.

The method comprises the following steps that a background of a visual model firstly carries out feature extraction on an input text picture through a ResNet + Transformer structure to obtain an output feature map. The Position attribute of the Transformer is different from the traditional Self attribute and is directly used, Q is Self-generated Position code, the initial value is similar to sine and cosine code plus a layer of linear conversion, K represents the vector of the correlation of information and other information, V represents the vector of information content, and the Position attribute is made to fix the information of each letter Position.

For general ocr character recognition, the output of the visual module can be directly connected with a multi-classification loss function, but the multi-mode model is made by adding a text model, so that the recognition accuracy can be effectively improved. The network architecture of the text model is based on RNN and Transformer, the output result of the visual model and the character position serial number are used as input, and the index of the predicted character is output.

The input of the text model is the output result of the visual model and the serial number of the character position, and then a softmax function is connected without changing the input dimension; then, performing dimension increasing operation to obtain more information; and then what the current position is presumed according to the context, the information of the current position is shielded when the Attention is calculated, and the action of re-correction is realized by updating the input.

The fusion model splices the results of the visual model and the text model together, then learns a weight value, and adjusts the influence of the visual model and the text model on the final prediction result. The fusion model outputs a segment of character position serial number codes and visual model results, and the segment of character position serial number codes and the visual model results are sent to the text model for correction, and then the language model is repeatedly executed for multiple rounds through the idea of iterative correction, so that the recognition effect is gradually corrected, and the final output result is obtained.

The loss function of the license plate recognition network LPRNet is general multi-classification cross soil moisture loss, and correspondingly, the loss of three parts, namely the loss of a visual model, the loss of a text model and the loss of a fusion model, should be considered.

Corresponding to the embodiment of the vehicle identification and tracking method, the invention also provides the corresponding electronic equipment. The electronic device includes one or more processors and a storage device for storing one or more programs that, when executed by the electronic device, cause the electronic device to implement a deep learning based vehicle identification tracking method. The electronic device may be applied to any data processing capable device, such as a computer or the like. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation.

For the device embodiment, since it basically corresponds to the method embodiment, reference may be made to the partial description of the method embodiment for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement without inventive effort.

Embodiments of the present invention further provide a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the deep learning-based vehicle identification and tracking method in the foregoing embodiments.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD card, a Flash memory card (Flash card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer readable storage medium is used to store the computer 20169the program and other programs and data required by the any data processing capable device, and may also be used to temporarily store data that has been or will be output.

The following provides a specific application example of the vehicle identification and tracking method of the present invention. A picture shown in fig. 6 is input into a vehicle detection network model canet, and the model outputs the coordinates of the prediction frame, the category information, and the confidence of each vehicle shown in fig. 7. As can be seen from fig. 7, the vehicle detection network model can accurately identify the category of the vehicle and output a prediction box with accurate size and position. FIG. 8 is a vehicle image cropped according to one of the prediction boxes of FIG. 7, the image is input into a license plate detection network model, and the model outputs the license plate detection box and confidence level shown in FIG. 9. And inputting the vehicle license plate image cut out according to the detection frame in the figure 9 into a vehicle license plate recognition network model, performing OCR recognition, and mapping the result to a dictionary to obtain corresponding characters. Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A vehicle identification and tracking method based on deep learning is characterized by comprising the following steps:

constructing and training a license plate detection network model, wherein the input of the license plate detection network model is a cut vehicle image, the output of the license plate detection network model is a prediction frame coordinate of a license plate, the storage is carried out, and the image of the license plate is cut according to the prediction frame coordinate of the license plate;

2. The deep learning-based vehicle identification tracking method according to claim 1, wherein the vehicle detection network model comprises a backbone network and a detection head; the backbone network performs five times of downsampling on the input image through convolution on the basis of resnet, and feature maps of the last three scales are reserved; an SPPF module is also inserted into the backbone network and is used for serially processing input features through a plurality of maximum pooling layers with the size of 5x 5; the detection head fuses the obtained feature maps of three scales in an FPN + PAN mode, so that shallow features firstly pass through the FPN and then are combined with low-level features through upsampling to generate higher-level features, and more accurate position information is transmitted.

3. The deep learning-based vehicle identification and tracking method according to claim 2, wherein all activation functions of a backbone network of the vehicle detection network model adopt SILU activation functions.

4. The deep learning-based vehicle identification and tracking method according to claim 3, wherein the detection head of the vehicle detection network model adopts frame regression idea to generate a corresponding prediction frame through a preset anchor frame with fixed size and aspect ratio; and calculating the central coordinate of the prediction frame through the central coordinate of the anchor frame, and calculating the position information of the prediction frame through the scaling of the anchor frame.

5. The deep learning-based vehicle identification and tracking method according to claim 1, wherein the license plate detection network model comprises a backbone network and a detection head, the backbone network of the license plate detection network model is based on DarkNet, detects on feature maps of four scales, and introduces an attention mechanism in convolution.

6. The deep learning-based vehicle recognition and tracking method according to claim 1, wherein the license plate recognition network model comprises a visual model, a text model and a fusion model;

7. The deep learning-based vehicle identification and tracking method according to claim 6, wherein the fusion model outputs a segment of character position serial number code and the visual model result, and the segment of character position serial number code and the visual model result are sent to the text model for correction, and then the language model is repeatedly executed for multiple rounds through the idea of iterative correction, so that the identification effect is gradually corrected, and the final output result is obtained.

8. The vehicle identification and tracking method based on deep learning of claim 1, wherein mosaics data are used for enhancing vehicle pictures in a real traffic scene, 4 pictures are adopted and spliced in a random scaling, random cutting and random arrangement mode to form a new picture, and a labeling frame of the new picture is obtained according to labeling frames of the original four pictures; and training a vehicle recognition tracking model by using the new picture.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs that, when executed by the electronic device, cause the electronic device to implement the deep learning-based vehicle identification tracking method according to any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that a program is stored thereon, which when executed by a processor, implements the deep learning-based vehicle identification tracking method according to any one of claims 1 to 8.