CN115937591A

CN115937591A - Method and system for improving traffic scene target detection classification precision

Info

Publication number: CN115937591A
Application number: CN202211596741.1A
Authority: CN
Inventors: 莫王忠; 吴劲峰; 宋腾飞; 张巧焕; 陈瑞生; 蒋栋奇
Original assignee: Zhejiang Supcon Information Industry Co Ltd
Current assignee: Zhejiang Supcon Information Industry Co Ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-04-07

Abstract

The invention discloses a method and a system for improving the classification precision of traffic scene target detection, which comprises the following steps: s1: acquiring a traffic video; s2: converting the traffic video into a traffic image; s3: identifying the traffic image based on the convolutional neural network to obtain a classification result of the traffic image; s4: and carrying out early warning based on the classification result of the traffic image. The invention has the beneficial effects that: the convergence rate of the model can be improved, the training of the model is accelerated, and the recognition accuracy of the model is improved.

Description

Method and system for improving traffic scene target detection classification precision

Technical Field

The invention relates to the technical field of traffic detection, in particular to a method and a system for improving the classification precision of traffic scene target detection.

Background

In recent years, the target detection technology based on deep learning is more and more widely applied in traffic scenes. The position information of the traffic target is more concerned by people in the early period, but the category demand of the traffic target is more and more refined along with the closer combination of the technology and the life.

In the prior art, the recognition of the fine classification is mainly realized by adding an attention module, so that the network refines the characteristics of an intermediate layer, and the classification precision is further improved. However, after the attention module is added, the complexity reasoning time of the model is increased, the model is not favorable for deployment and use, especially the requirement on the reasoning speed of the model in a traffic scene is high, and when the model is deployed on Jetson and other equipment, the lightweight of the model must be ensured. There is a problem in that the model recognition time is long.

For example, a "pedestrian traffic light identification method based on traffic light geometric attributes" disclosed in the chinese patent document, the publication number: CN113011251A, application date: in 2021, 03.02/03, the dynamic state of the traffic light is recognized through the traffic light frame image and the recognized shape of the traffic light, so that more accurate guidance is provided for visually impaired people, but the method has the problems of long model recognition time and low accuracy.

Disclosure of Invention

Aiming at the defects of long model identification time and low accuracy in the prior art, the invention provides a method and a system for improving the classification accuracy of traffic scene target detection, which can improve the convergence speed of a model, accelerate the training of the model and improve the identification accuracy of the model.

The invention discloses a method for improving the classification precision of traffic scene target detection, which comprises the following steps:

s1: acquiring a traffic video;

s2: converting the traffic video into a traffic image;

s3: identifying the traffic image based on the convolutional neural network to obtain a classification result of the traffic image;

s4: and carrying out early warning based on the classification result of the traffic image.

In the scheme, the traffic video is collected through the video collection module and converted into the traffic image, so that the convolutional neural network can conveniently identify and classify the traffic image to obtain the classification result of the traffic image, and early warning is performed based on the classification result of the traffic image. The classification information can be obtained according to the traffic video, and the convergence speed and the recognition accuracy of the model can be improved.

Preferably, S3 comprises the steps of:

s31: inputting a traffic image, calculating an anchor frame and enhancing data;

s32: performing five times of downsampling on the traffic image by using a CSP structure;

s33: respectively extracting and fusing the features of the feature maps obtained by the third, fourth and fifth downsampling to obtain the feature maps T3, T4 and T5;

s34: after T3, T4 and T5, the GSConv structure is accessed, and the generated characteristic diagrams are marked as P3, P4 and P5 respectively;

s35: performing channel superposition on the P5 and the P4 after twice upsampling, further refining the fusion characteristics after the C3 structure, performing channel superposition on the P3 after 2 times upsampling, and further refining the fusion characteristics after the C3 structure to obtain characteristics M3; performing channel superposition on the M3 and the P4 after 2 times of down-sampling, and further refining and fusing the characteristics after passing through a C3 structure to obtain a characteristic M4;

the M4 is subjected to 2 times of downsampling, then is subjected to channel superposition with the P5, and is subjected to C3 structure, and then fusion characteristics are further refined, so that characteristics M5 are obtained;

s36: m3, M4 and M5 respectively obtain characteristic graphs through a conv + BN + SiLU convolution structure block, and the characteristic graphs are respectively marked as Q3, Q4 and Q5;

s37: respectively inputting Q3, Q4 and Q5 into the detection head;

s38: and predicting Q3, Q4 and Q5, generating a boundary box and predicting the classification of the traffic image.

According to the scheme, the traffic image is downsampled five times, the feature graph obtained through downsampling is processed through a GSConv structure, channel superposition is carried out on the feature graph according to a specified rule to obtain refined fusion features, a new feature graph is obtained through standard convolution, the new feature graph is input to a detection head to be predicted, classification of the traffic image is obtained, and convergence speed and recognition accuracy of a model can be improved.

Preferably, in step S31, the traffic image is scaled in an equal ratio according to the image size of the convolutional neural network model.

According to the scheme, the traffic image is scaled according to the image size equal ratio preset by the convolutional neural network model, so that the traffic image can enter the convolutional neural network model for identification on the premise that the content of the traffic image is not deformed, and the identification precision of the model is improved.

Preferably, in step S33, feature extraction and feature fusion are performed using 1 × 1 convolution.

In the scheme, the convolution process of the 1x1 convolution kernel is equivalent to a fully-connected calculation process, and the nonlinearity of the network can be increased by adding the nonlinear activation function, so that the network can express more complex characteristics, the functions of model optimization and parameter reduction can be achieved in model design, and the identification precision of the model is improved.

Preferably, in step S34, the GSConv structure is: inputting the feature graph C1, obtaining a feature graph C21 by the C1 through a conv + BN + Silu convolution structure block, obtaining a feature graph C22 by the feature graph C21 through a conv + BN + Silu convolution structure block again, performing channel superposition on the C21 and the C22, adding a shuffle to perform channel cutting, and outputting a feature graph C2.

In the scheme, after the feature map is processed by using the GSConv structure, the processed feature map has more complex features and has higher precision in identifying and classifying.

Preferably, the sample allocation strategy in the convolutional neural network model comprises the following steps:

s301: matching anchors and GT, and determining positive samples anchors of the current feature map;

s302: allocating positive samples of the current feature map to corresponding grid;

s303: calculating regression and classification loss of each positive sample to each GT, and acquiring a cost matrix and an IoU matrix;

s304: based on the IoU matrix, sorting and selecting the first ten candidate frames;

s305: adding the IoUs of the ten candidate frames and rounding down to obtain the number k of the candidate frames;

s306: and picking out the first k candidate frames according to the cost matrix and removing the repeated candidate frames.

In the scheme, positive and negative samples are balanced without increasing a data set, and the precision of the model is improved.

Preferably, in step S4, if the classification result matches the early warning classification, the classification result is early warned through characters, sound and light.

In the scheme, the traffic images needing early warning are classified into early warning classifications, the early warning classifications comprise one or more classification results, the same or different early warning schemes are set for different classification results, the early warning schemes comprise early warning characters, early warning sound, early warning light and the like, and early warning is carried out on the classification results through the characters, the sound and the light. The system can perform early warning on the specified classification result, and improves the applicability of the system.

Preferably, in step S2, the traffic video is converted into traffic images every 5 to 20 frames.

In the scheme, because the number of frames contained in one traffic video is too many, all the frames are not required to be converted into the traffic images, the traffic images are converted once every 10 frames, and the identification speed of the model can be improved.

A system for improving the accuracy of traffic scene target detection classification comprises: the system comprises a video acquisition module, an operation and maintenance transmission module, a video storage module, a video processing module and a display module, wherein the video acquisition module is connected with the operation and maintenance transmission module, the operation and maintenance transmission module is connected with the video storage module, the video storage module is connected with the video processing module and the display module, the video processing module is connected with a target detection module, the target detection module is connected with a target storage module and a target early warning module, and the target storage module is connected with the display module.

In the scheme, the video acquisition module is used for acquiring traffic videos; the operation and maintenance transmission module is used for transmitting the traffic video to the database; the video storage module is used for storing traffic videos; the video processing module is used for extracting image frames of the video traffic video; the target detection module is used for detecting, identifying and classifying traffic images; the target storage module is used for storing the classification result of the traffic image; the target early warning module is used for early warning the traffic image classification result; the display module is used for displaying the traffic video and the classification result.

Preferably, the video acquisition module is arranged on a support rod where the traffic signal lamp is arranged or monitoring rods on two sides of a road.

In the scheme, the video acquisition module is convenient to acquire traffic videos on the lane.

The invention has the beneficial effects that: improving the identification precision of the model with the least time cost; positive and negative samples are balanced without increasing a data set, so that the accuracy of the model is improved; the convergence rate of the model is improved, and the training of the model is accelerated.

Drawings

FIG. 1 is a schematic diagram of a system for improving the classification accuracy of traffic scene target detection according to the present invention.

FIG. 2 is a flowchart of a method for improving classification accuracy of traffic scene object detection according to the present invention.

FIG. 3 is a model data transfer diagram of a method for improving traffic scene target detection classification accuracy.

Fig. 4 is a GSConv structure diagram of a method for improving the classification accuracy of traffic scene target detection according to the present invention.

In the figure 1, a video acquisition module; 2. an operation and maintenance transmission module; 3. a video storage module; 4. a video processing module; 5. a target detection module; 6. a target storage module; 7. a target early warning module; 8. and a display module.

Detailed Description

The technical scheme of the invention is further specifically described by the following embodiments and the accompanying drawings.

Example (b): as shown in fig. 1, a system for improving classification accuracy of traffic scene object detection includes:

the video acquisition module 1 is used for acquiring traffic videos;

the operation and maintenance transmission module 2 is used for transmitting the traffic video to a database and is connected with the video acquisition module 1;

the video storage module 3 is used for storing traffic videos and is connected with the operation and maintenance transmission module 2;

the video processing module 4 is used for extracting image frames of the video traffic video and is connected with the video storage module 3;

the target detection module 5 is used for detecting, identifying and classifying the traffic images and is connected with the video processing module 4;

the target storage module 6 is used for storing the classification result of the traffic image and is connected with the target detection module 5;

the target early warning module 7 is used for early warning the traffic image classification result and is connected with the target detection module 5;

and the display module 8 is used for displaying the traffic videos and the classification results and is connected with the video storage module 3 and the target storage module 6.

And the video acquisition module 1 is used for acquiring traffic videos and is connected with the operation and maintenance transmission module 2. The video acquisition module 1 can be a camera or a monitor, and is generally arranged on a support rod where a traffic signal lamp is arranged or monitoring rods on two sides of a road. The video acquisition module 1 monitors vehicles and the like running on a traffic road, and traffic analysis is convenient to perform according to traffic videos.

And the operation and maintenance transmission module 2 is used for transmitting the traffic video to the database and is connected with the video storage module 3 and the video acquisition module 1. The operation and maintenance transmission module 2 is wirelessly connected with the video storage module 3.

And the video storage module 3 is used for storing traffic videos and is connected with the operation and maintenance transmission module 2 and the video processing module 4. The video storage module 3 may be a database. When the video storage module 3 stores traffic videos, the traffic videos of different video acquisition modules 1 are respectively stored into different data tables, or the traffic videos of different video acquisition modules 1 are stored in the same data table, and the sources of the traffic videos are distinguished according to the numbers of the video acquisition modules 1.

And the video processing module 4 is used for extracting image frames of the video traffic video and connecting the video storage module 3 and the target detection module 5. The video processing module 4 extracts the traffic video from the video storage module 3, converts the traffic video into a traffic image, stores the traffic image into a corresponding folder, and uses the converted traffic image for detection and identification by the target detection module 5.

And the target detection module 5 is used for detecting, identifying and classifying the traffic images and is connected with the video processing module 4. The target detection module 5 adopts yolov5 convolution neural network to identify and classify the traffic images: inputting a traffic image into a convolutional neural network, adaptively calculating an anchor frame, and performing data enhancement processing on the traffic image; the feature extraction main network adopts a full convolution network, and a CSP structure is used for carrying out five times of downsampling on the traffic image; respectively performing feature extraction and feature fusion on the feature maps obtained by the third, fourth and fifth downsampling by adopting 1x1 convolution to obtain feature maps which are respectively marked as a feature map T3, a feature map T4 and a feature map T5; after the characteristic diagram T3, the characteristic diagram T4 and the characteristic diagram T5 are respectively accessed into the GSConv structure, the generated characteristic diagrams are respectively marked as a characteristic diagram P3, a characteristic diagram P4 and a characteristic diagram P5; performing channel superposition on the feature map P5 after twice upsampling with the feature map P4, further refining the fusion features after passing through a C3 structure, then performing channel superposition on the feature map P3 after performing 2 times upsampling again, and further refining the fusion features after passing through a C3 structure to obtain a feature which is marked as a feature M3; the feature M3 is subjected to 2 times of down sampling and then is subjected to channel superposition with the feature map P4, and the feature is further refined and fused after passing through a C3 structure, so that a feature is obtained and is marked as a feature M4; the characteristic M4 is subjected to 2 times of down sampling, then is subjected to channel superposition with the characteristic map P5, and is subjected to C3 structure, and then the fusion characteristic is further refined, so that a characteristic marking characteristic M5 is obtained; respectively obtaining a feature graph Q3, a feature graph Q4 and a feature graph Q5 by the feature M3, the feature M4 and the feature M5 through a conv + BN + SiLU convolution structure block; inputting the characteristic diagram Q3, the characteristic diagram Q4 and the characteristic diagram Q5 into a detection head respectively; and predicting the characteristic diagram Q3, the characteristic diagram Q4 and the characteristic diagram Q5 to generate a boundary frame and predict the classification of the traffic image. The recognition accuracy of the model is improved with the least time cost.

And the target storage module 6 is used for storing the classification result of the traffic image predicted by the convolutional neural network and is connected with the target detection module 5. The target storage module 6 stores the classification result of the traffic image, so that the classification condition of the traffic image can be conveniently counted.

And the target early warning module 7 is used for early warning according to the traffic image classification result of the target detection module 5 and is connected with the target detection module 5. The traffic image classification that sets up to need the early warning is early warning classification, and early warning classification includes one or more classification result, sets up the same or different early warning scheme to different classification results, and the early warning scheme includes early warning characters, early warning sound and early warning light etc. promptly through characters, sound and light to carry out the early warning to classification result. When the classification result of the target detection module 5 is the early warning classification, the target early warning module 7 executes an early warning scheme.

And the display module 8 is used for displaying the traffic video and the classification result and is connected with the video storage module 3 and the target storage module 6. The display module 8 extracts and displays the traffic video from the video storage module 3, and extracts and displays the classification result from the target storage module 6. The traffic condition can be conveniently and accurately checked.

As shown in fig. 2, a method for improving the classification accuracy of traffic scene target detection includes the following steps:

s1: acquiring a traffic video;

s2: converting the traffic video into a traffic image;

The traffic video is collected through the video collecting module 1, the traffic video is converted into a traffic image, the convolutional neural network is convenient to recognize and classify the traffic image to obtain a classification result of the traffic image, and early warning is carried out based on the classification result of the traffic image.

S1: and acquiring a traffic video.

Specifically, a camera or a monitor is arranged on a support rod where the traffic signal lamp is located or monitoring rods on two sides of a road to collect traffic videos. And the collected traffic video is transmitted to a database through an operation and maintenance transmission module 2 which is in wireless connection with the video storage module 3.

S2: and converting the traffic video into a traffic image.

Specifically, the video processing module 4 extracts the traffic video from the video storage module 3, converts the traffic video into a traffic image, and stores the traffic image in a corresponding folder, so that the converted traffic image is used for detection and identification by the target detection module 5. The size of the traffic image is 1920 × 1080, and when the traffic image is large, the detection effect of the small target can be improved, and the feature extraction of the small target is enhanced. Since one traffic video contains too many frames, it is not necessary to convert all the frames into traffic images, and therefore it is set to convert the traffic images once again every 10 frames.

S3: and identifying the traffic image by the convolutional neural network to obtain the classification of the traffic image.

Specifically, image recognition and classification are performed by using yolov5 convolutional neural network, as shown in fig. 3 and 4, data transmission in the convolutional neural network model improves the recognition accuracy of the model with the least time cost, and the method comprises the following steps:

s31: and inputting the traffic image into a convolutional neural network, adaptively calculating an anchor frame, and performing data enhancement processing on the traffic image. The input traffic image is scaled according to the input image size.

S32: the feature extraction backbone network adopts a full convolution network, and performs five times of downsampling on the traffic image by using a CSP structure.

S33: and respectively performing feature extraction and feature fusion on the feature maps obtained by the third, fourth and fifth downsampling by adopting 1x1 convolution to obtain feature maps which are respectively marked as a feature map T3, a feature map T4 and a feature map T5.

S34: the feature map T3, the feature map T4, and the feature map T5 are respectively connected to the GSConv structure, and the generated feature maps are respectively denoted as a feature map P3, a feature map P4, and a feature map P5.

The GSConv structure is: inputting the feature graph C1, obtaining a hidden feature graph C21 by the feature graph through a conv + BN + SiLU convolution structure block, obtaining a hidden feature graph C22 by the feature graph C21 through a conv + BN + SiLU convolution structure block again, performing channel superposition on the feature graph C21 and the feature graph C22, adding shuffle, performing channel cutting, and outputting a feature graph C2.

S35: performing channel superposition on the feature map P5 after twice upsampling with the feature map P4, further refining the fusion features after passing through a C3 structure, then performing channel superposition on the feature map P3 after performing 2 times upsampling again, and further refining the fusion features after passing through a C3 structure to obtain a feature which is marked as a feature M3;

the characteristic M3 is subjected to 2 times of down sampling, then is subjected to channel superposition with the characteristic map P4, and is subjected to C3 structure, and then the fusion characteristic is further refined to obtain a characteristic which is recorded as the characteristic M4;

the characteristic M4 is subjected to 2 times of down sampling, then is subjected to channel superposition with the characteristic map P5, and is subjected to C3 structure, and then the fusion characteristic is further refined, so that a characteristic marking characteristic M5 is obtained;

s36: the feature M3, the feature M4 and the feature M5 respectively obtain a feature map Q3, a feature map Q4 and a feature map Q5 through a conv + BN + SiLU convolution structure block;

s37: inputting the characteristic diagram Q3, the characteristic diagram Q4 and the characteristic diagram Q5 into a detection head respectively;

s38: and predicting the characteristic diagram Q3, the characteristic diagram Q4 and the characteristic diagram Q5 to generate a boundary frame and predict the classification of the traffic image.

The sample distribution strategy in the convolutional neural network model balances positive and negative samples without increasing a data set, improves the precision of the model, and comprises the following steps:

s301: matching the prior boxes (anchors) with GT, and determining positive samples anchors of the current feature map;

s302: assigning positive samples of the current feature map to a corresponding grid (grid);

s303: calculating regression and classification loss of each positive sample to each GT, and acquiring a cost matrix and an IoU (Interactive orthogonal unit) matrix;

s304: sorting and selecting the top ten candidate frames based on the IoU matrix;

s305: adding the IoUs of the ten candidate frames, and rounding down to finally obtain the number k of the distributed candidate frame objects of the target frame;

s306: and picking out the first k candidate frames according to the cost matrix, and removing the repeated candidate frames.

Specifically, the traffic image classification needing early warning is set as early warning classification, the early warning classification comprises one or more classification results, the same or different early warning schemes are set for different classification results, the early warning schemes comprise early warning characters, early warning sound, early warning light and the like, and early warning is carried out on the classification results through the characters, the sound and the light. When the classification result of the target detection module 5 is the early warning classification, the target early warning module 7 executes an early warning scheme.

Improving the identification precision of the model with the least time cost; positive and negative samples are balanced without increasing a data set, so that the accuracy of the model is improved; the convergence rate of the model is improved, and the training of the model is accelerated.

Claims

1. A method for improving the classification precision of traffic scene target detection is characterized by comprising the following steps:

s1: acquiring a traffic video;

s2: converting the traffic video into a traffic image;

2. The method for improving the traffic scene object detection and classification accuracy according to claim 1, wherein the step S3 comprises the following steps:

s31: inputting a traffic image, calculating an anchor frame and enhancing data;

s35: performing channel superposition on the P5 and the P4 after twice upsampling, further refining the fusion characteristics after passing through a C3 structure, then performing channel superposition on the P3 after 2 times upsampling, and further refining the fusion characteristics after passing through a C3 structure to obtain a characteristic M3; performing channel superposition on the M3 and the P4 after 2 times of down-sampling, and further refining and fusing the characteristics after passing through a C3 structure to obtain a characteristic M4;

performing channel superposition on the M4 and the P5 after 2 times of down-sampling, and further refining and fusing the characteristics after passing through a C3 structure to obtain a characteristic M5;

s37: respectively inputting Q3, Q4 and Q5 into the detection head;

s38: the predictions of Q3, Q4 and Q5 are made, a bounding box is generated and the classification of the traffic image is predicted.

3. The method of claim 2, wherein in step S31, the traffic image is scaled according to the image size of the convolutional neural network model.

4. The method for improving the traffic scene object detection and classification accuracy according to claim 2 or 3, wherein in step S33, 1x1 convolution is adopted for feature extraction and feature fusion.

5. The method for improving the accuracy of classification of traffic scene object detection according to claim 2 or 3, wherein in step S34, the GSConv structure is: inputting the characteristic diagram C1, obtaining a characteristic diagram C21 by the C1 through a conv + BN + SiLU convolution structure block, obtaining a characteristic diagram C22 by the characteristic diagram C21 through a conv + BN + SiLU convolution structure block again, overlapping channels of the C21 and the C22, adding shuffle, performing channel cutting, and outputting a characteristic diagram C2.

6. The method for improving the classification accuracy of the traffic scene object detection according to claim 1, wherein the sample distribution strategy in the convolutional neural network model comprises the following steps:

7. The method for improving the traffic scene target detection and classification precision according to claim 1, wherein in the step S4, if the classification result is matched with the early warning classification, the classification result is early warned through characters, sound and light.

8. The method of claim 1, wherein in step S2, the traffic video is converted into the traffic image every 5 to 20 frames.

9. A system for improving the classification accuracy of traffic scene target detection is applied to the method for improving the classification accuracy of traffic scene target detection according to any one of claims 1 to 8, and is characterized by comprising the following steps: the system comprises a video acquisition module, an operation and maintenance transmission module, a video storage module, a video processing module, a target detection module, a target early warning module and a display module, wherein the video acquisition module is connected with the operation and maintenance transmission module, the operation and maintenance transmission module is connected with the video storage module, the video storage module is connected with the video processing module and the display module, the video processing module is connected with the target detection module, the target detection module is connected with the target storage module and the target early warning module, and the target storage module is connected with the display module.

10. The system for improving the classification accuracy of traffic scene object detection according to claim 9, wherein the video capture module is disposed on a support rod where a traffic signal lamp is located or monitoring rods on two sides of a road.