CN111797769B

CN111797769B - Small-target-sensitive vehicle detection system

Info

Publication number: CN111797769B
Application number: CN202010639920.3A
Authority: CN
Inventors: 毕远国; 黄子烜; 刘威; 尹晓宇; 郭茹博; 刘纪康
Original assignee: 东北大学
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2023-06-30
Anticipated expiration: 2040-07-06
Also published as: CN111797769A

Abstract

The invention belongs to the technical field of computer vision application, and provides a small-target-sensitive vehicle detection system. The system comprises a data module, a network structure module, a system configuration module, a training module, a testing module, a function supporting module, a log acquisition module, an effect analysis module, a detection module and an interaction module. The core algorithm of the invention: the full convolution neural network algorithm sensitive to small targets redesigns the existing CNN on the basis of R-FCN and proposes a new layer: the small target sensitive pooling layer can enrich the characteristics of small target vehicles, so that small-size vehicles can be detected more accurately, and meanwhile, a new voting mechanism is designed in the system, so that occlusion vehicles can be detected more accurately. Finally, the detection system is further simplified and designed, so that the detection can be performed more timely, and the real-time requirement is met.

Description

Small-target-sensitive vehicle detection system

Technical Field

The invention belongs to the technical field of computer vision application, and relates to a small-target-sensitive vehicle detection system.

Background

In recent years, image and computer vision related theory and technology are increasingly widely applied, and deep learning technology is mature and perfected in the scope of computer vision research. With the following demands, the demands for assisting people to complete complex works by using artificial intelligence are becoming urgent. Autopilot is an important application in the field of computer vision, and related technologies are becoming breakthrough emphasis of various universities and institutions.

The automatic driving technology comprises the theories of multiple disciplines such as microcomputer, automatic control technology, message fusion and the like. The system can collect and analyze information on the running path of the vehicle in real time, and is mainly used for assisting us in making driving decisions so as to avoid traffic accidents. Since 1970 or so, the technology related to unmanned automobiles has been actively developed. Compared with manual driving of human beings, the automatic driving safety is higher, and the use efficiency of the road is higher. A special plan for researching unmanned is started in hundred degree companies in China for a long time. Later years, hundred-degree companies announced that the developed novel automatic driving vehicle completes the test on the road, and automatic driving application is realized. The Apollo car from hundred degrees was then run unmanned on the Kong Zhu Australian bridge. Hundred degrees also suggest early popularization of unmanned technology in the androstane district.

At the same time, deep learning technology is fast-moving, which indirectly enables computer vision to be greatly developed. Therefore, the invention provides a new system based on deep learning, and provides a core algorithm of the invention: small target sensitive full convolutional neural network algorithm (Small-Sensitive Fully Convolutional Neural Network, SS-FCN). Firstly, the algorithm redesigns the existing CNN on the basis of the R-FCN and proposes a new layer: small target sensitive pooling layer. It can enrich the features of small target vehicles, so that small-sized vehicles can be detected more accurately. Meanwhile, a new voting mechanism is designed in the system, so that the occlusion vehicle can be detected more accurately. Finally, the detection system is further simplified and designed, so that the detection can be performed more timely, and the real-time requirement is met.

Disclosure of Invention

The invention provides a small-target-sensitive vehicle detection system and provides a vehicle detection system based on the algorithm. First, the present invention analyzes the existing problem of how to accurately identify a small target vehicle in a picture or video. Because the vehicle dimensions differ greatly with the viewing distance in road scenes. While the performance of CNNs may change as the size of the vehicle changes, the features extracted from different sized objects of the same class may be quite different. In addition, for a vehicle object of particularly small size (e.g., only 30 pixels or even lower), it is possible that only single-digit-level pixels will be present after feature extraction (which typically reduces the size of the object by a factor of 8 or 16). If pixels of this bit level are detected, the effect is very poor. This causes conventional vehicle detection algorithms to easily take small vehicle targets as background when they are detected. Another problem with CNN application to vehicle detection is that occlusion is severe in traffic scenarios, which is difficult to handle with conventional CNNs. If there is an obstruction, the feature map calculated from the vehicle contains a lot of disturbances. If the characteristic diagram of the occluded vehicle and the characteristic diagram of the non-occluded vehicle are sent to the same detection network, the detection accuracy of the vehicle is influenced. In addition, it is also necessary to consider timely detection of vehicles in traffic scenes, because the vehicle frames can still be kept for 0.1-0.4 seconds after the vehicles disappear in human eyes during running, so the real-time speed of vehicle detection cannot be lower than about 10 frames per second. Therefore, the invention provides a small target sensitive pooling layer for solving the problems of large vehicle scale span and low detection precision of a small target vehicle in a traffic scene; in order to improve the shielding problem, a voting mechanism is designed; in order to meet the real-time performance, a more simplified detection network structure is designed.

The technical scheme of the invention is as follows:

the small-target sensitive vehicle detection system comprises a data module, a network structure module, a system configuration module, a training module, a test module, a function support module, a log acquisition module, an effect analysis module, a detection module and an interaction module;

and a data module: the system is used for storing pictures and marking data, realizing preprocessing work on images and marking information, and transmitting the processed data to a training module and a testing module;

and a network structure module: the SS-FCN is two-stage, i.e., with a region recommendation network to specifically generate candidate regions. Mainly comprises: a feature extraction network and a detection network; the feature extraction network adopts PVANET, so that the speed is higher under the condition of small precision loss, and the parallel structure has better effect on detecting small targets; the detection network is based on R-FCN, the full connection layer is removed, and the full convolution network is used for predicting the vehicle. The invention modularizes the network structure, which is convenient to update because of the expansibility requirement. When the structure is updated, only the network structure definition file needs to be replaced, and other parts do not need to be changed. The basic flow of the network structure module is as follows: firstly, inputting a complete picture; the second step, the picture is extracted by five convolution groups, the characteristic extraction network adopts PVANET, and mainly comprises C.ReLU, acceptance, residual error structure and parallel structure; the speed is higher under the condition of small precision loss, and the parallel structure has better detection effect on small targets; thirdly, taking the output of the fourth convolution group as the input of an RPN (regional recommendation network), wherein the output of the RPN is a group of candidate boxes; step four, leading out two branch channels in a fifth convolution layer to respectively carry out convolution to obtain a regression score characteristic diagram and a classification score characteristic diagram, and inputting the regression score characteristic diagram and the classification score characteristic diagram into a small target pooling layer together with the candidate frame obtained in the step three; and fifthly, predicting through a voting mechanism to obtain final classification and regression results.

And a system configuration module: parameters used for defining the training or testing of the system;

training module: training for completing a network; calculating a predicted value through a network, and then comparing the predicted value with a true value; calculating a loss value through a loss function, and then carrying out back propagation on the loss value to update the network; the weight of the convolution kernel is updated continuously to minimize the loss function; the training method of the system core algorithm SS-FCN is a supervised learning method. The supervised learning is a process of adjusting the weight of a convolution kernel in a detection network by using a set of marked samples, and enabling the convolution kernel to reach required performance in the repeated iteration process. Firstly, calculating a predicted value through a network, then comparing the predicted value with a true value (read in through a labeling file), and calculating a loss value through a loss function. These loss values are then back-propagated to update the network. The optimization method of the present vehicle detection system uses a random gradient descent algorithm (SGD). The loss function is a difference between a predicted value and a true value of a network through a certain mapping relation, and the network training process is essentially to continuously update the weight of a convolution kernel to achieve the effect of minimizing the loss function. In the invention, each example consists of a picture and a markup file, and the markup file stores relevant information of all vehicles in the picture. The true value is this annotation file. Therefore, the selection of the loss function is closely related to the effect of the final training of the network. The system encapsulates the training configuration and the complex flow completely, and a user can start training only by inputting a command at the terminal;

And a testing module: the device is used for completing the effect test function, including speed and precision test, and packaging all the details and configuration of the test process, and a user can start the test by inputting a line of commands at the terminal;

and the function supporting module is as follows: for providing functional support to the logic portion and the network fabric portion in the form of encapsulated layer files or packages;

the log acquisition module is used for: the device is used for completing the acquisition and recording of important information in the training and testing process; the module does not need to be explicitly invoked by a user, and the system can be automatically executed in a training or testing process;

and an effect analysis module: the system is used for analyzing the performance of the test result and the data acquired by the log;

and a detection module: finishing the detection of the input picture;

and an interaction module: the system is used for providing a graphical interface and a command line tool, completing the encapsulation of the logic part, increasing the usability of the system, verifying the authority of the user and also comprising the function of registering and logging in.

Taking a classification score chart as an example, voting is carried out after the classification score chart is generated, and for each region of interest, the score chart in each dimension is averaged to obtain a (C+1) dimension vector by setting rc, wherein C is the number of categories, and the system only predicts vehicles, so C is 1. Then calculate the probability that the prediction box is the target:

This value is used as a loss value during training and for ordering the regions of interest during prediction. Similarly, in the frame regression, C+1 is replaced by 4x2, which represents frame prediction for the vehicle and the background, and each type predicts 4 regression values. The invention modularizes the network structure, which is convenient to update because of the expansibility requirement. When the structure is updated, only the network structure definition file needs to be replaced, and other parts do not need to be changed. According to the invention, anchor frames in the RPN are set to be in three proportions of 1:2, 1:1 and 2:1, and three scales of 92, 196 and 384 are set, namely each anchor point predicts 9 anchor frames. The invention uses a two-channel mode to respectively generate a classification score map and a regression score map, wherein the regression score map is generated by using the output of a fourth convolution group which is slightly shallow in the characteristic extraction network, and the classification score map is generated by using the output of a fifth convolution group which is deeper.

One of the characteristics of vehicles in traffic scenes is that the size span is large (the display of vehicles close to a camera on a picture is large, and the display of vehicles far away is small), and as a result, the characteristics extracted from large vehicles and small vehicles via a convolutional neural network are greatly different. The feature information extracted by the large vehicle is rich, the feature information extracted by the small vehicle is lack, and the minimum vehicle feature is only 4 pixels. If these very different features are input into the same detection network, this tends to result in a decrease in accuracy.

One of the main ideas to solve this problem is the multi-scale idea. The input size can be enlarged by using a picture pyramid method to enrich the features of the small target vehicles, and the thought of a feature pyramid, namely, the shallow large-resolution features can be used for predicting the small target vehicles. The feature pyramid correlation algorithm is widely applied, target features are classified according to sizes, and features with different sizes are sent to different detection networks. The method improves the detection precision of the small-size vehicle. However, such methods still have drawbacks, and taking multi-branch prediction still consumes a lot of time, and although the speed is improved to some extent compared to the method of picture pyramid, the overhead in time becomes large. Meanwhile, the memory overhead is excessive, because each branch is equivalent to one more detection network. And when the feature map is smaller than a certain size, the pooling operation also damages the structure of the original feature map. After vehicles with different sizes pass through one pooling layer, the pooled features of vehicles with normal sizes are found to be rich, and the feature map of a small vehicle is destroyed after being pooled. If these two features are fed into the same detection network, a small target vehicle is easily mistaken for a time background. Furthermore, two distinct features are both represented by the vehicle, which can also create difficulties in training. Therefore, after the reason of inaccurate detection of the small-size vehicle is analyzed through the feature visualization, a small-target sensitive pooling layer is provided for extracting features of candidate frames of the small-size vehicle after the feature map is generated. The inputs are candidate boxes and feature graphs generated by the RPN network, and the outputs are features of the candidate boxes.

In the small target pooling layer of the network structure module, the small target pooling layer is formed by the following steps:

the first step: setting a threshold value of the height and the width of a rectangular frame as a classification standard;

and a second step of: classifying the received rectangular frames according to a threshold value, classifying the rectangular frames as small candidate frames if the height or width of the candidate frames is smaller than the threshold value, and classifying the rectangular frames as normal candidate frames if the height or width of the candidate frames is larger than the threshold value;

and a third step of: processing normal candidate frames according to the region-of-interest pooling operation, and for small candidate frames, performing deconvolution expansion twice on the regression and score feature graphs corresponding to the small candidate frames, performing the region-of-interest pooling operation on the new regression and score feature graphs in a position-sensitive manner, and uniformly transmitting all outputs into a detection module; obtaining a classification score and a regression score;

the above process is expressed by the following formula:

y(i,j)＝F(z ^* ) (1)

z ^* ＝deconv(z,proposal) (2)

wherein y (i, j) represents a result obtained after the interesting pooling operation, z is a score graph calculated by the feature extraction network, deconv (z) is a candidate frame generated by performing deconvolution operation on z, proposal is RPN, the sizes of the candidate frames are judging conditions, deconvolution operation is performed when the edges of the candidate frames are smaller than a threshold set by a system, otherwise, z=z is directly performed; z is the SSpooling input score map, including the location score map or the classification score map, and F (z) is the process of the region of interest pooling operation.

Aiming at the shielding problem in traffic scenes. The traditional detection network is difficult to process the shielding phenomenon in the traffic scene, and a great deal of interference exists in the characteristics extracted from the shielded vehicles. If the characteristic map of the occluded vehicle and the characteristic map of the non-occluded vehicle are fed into the same detection network, the detection accuracy is affected.

The basis of the voting mechanism of the system is the voting rules of the R-FCN. In the original voting algorithm, the detection network performs global average pooling or maximum pooling on the result of PSRoIpaling. Average pooling is the average of the weights within the region, and maximum pooling is the maximum within the region. The result of pooling is a score, which is the predicted result. However, experiments find that the average pooling is easy to miss the blocked target, because in this case, only the non-blocked part has higher confidence, the blocked part has lower confidence, and after the averaging operation, the final score is seriously affected by the confidence of the blocked part, so that the score is misjudged as a negative sample. If the maximum pooling is adopted, false detection is easy to be caused on the negative sample, because if a certain part of the negative sample is much like a vehicle, the score of the part is very high, the maximum pooling output result is often the score of the part, and the scores of other parts cannot influence the output, so that the negative sample is misjudged as the positive sample.

Furthermore, the voting mechanism method of the network structure module of the system is as follows:

the average pooling and maximum pooling voting results are fused, and a new score calculating mode is adopted:

S＝α×S _ave +(1-α)×S _max (3)

wherein S is a scoring result, S _ave Is the result of average pooling, S _max The result of maximum pooling, alpha is the weight, and experiments of the system show that the best effect is achieved when alpha is 0.65.

The loss function is a function that represents the difference between the predicted and the actual values of the network by some mapping relationship. In the system, each instance consists of a picture and a markup file, and the markup file stores relevant information of all vehicles in the picture. The true value is this annotation file. The training is essentially to continuously update the weights to achieve the effect of minimizing the loss function, so that the selection of the loss function is closely related to the effect of the final training of the network.

Before calculating the loss function, a label is set for each anchor frame, and the system is set with the following standard:

1) If the intersection ratio of the frame value of the anchor frame and the frame value of the target frame in a certain label is the largest, or the intersection ratio of the frame value of the anchor frame and the frame value of the target frame in any label is more than 0.7, the anchor frame is considered to be a positive sample,

2) If the cross ratio is smaller than 0.3, judging a negative sample;

3) Anchor frames with the cross ratio between 0.3 and 0.7 do not participate in training;

the form of the loss function is as follows:

wherein N is _cls And N _reg For normalization; i is the number of the anchor frame, p _i Is the probability of the ith anchor frame as the foreground, p _i ^* Is the probability (1 or 0) that the real box is the foreground, the class loss function L _cls Is a binary loss function, t _i Is the weight value between the classification loss and the regression loss of the i-th anchor box (a 4-dimensional array), and the form is as follows:

wherein p is _p * The method comprises the steps of representing whether a predicted class p is a real class p, wherein a formula (5) is a result obtained by taking negative logarithm of a predicted value, if the final result is 1, otherwise, the final result is 0;

L _reg the regression loss function is expressed as follows:

wherein,,

the following are provided:

t in formula (6) _i Is the position representation (a 4-dimensional array) of the ith anchor frame, t _i ^* Is a representation of the position of the real box (a 4-dimensional array), the 4-dimensional data containing the following 4 items:

t _x ＝(x-x _a )/w _a

t _y ＝(y-y _a )/h _a

t _w ＝log(w/w _a )

t _h ＝log(h/h _a ) (8)

wherein t is _x 、t _y 、t _w 、t _h Respectively the offset of the predicted candidate frame relative to the anchor frame in the horizontal direction, the vertical direction, the width and the height, x and y are the central coordinate values of the anchor frame, w and h are the width and the height of the wide anchor frame, and w _a To predict the width of the frame, h _a To predict the height of the frame, x _a ，y _a Is the central coordinate value of the prediction frame.

The structure of the detection network is simplified to meet the requirement of the system on real-time performance. This is a problem to be considered in vehicle detection, and for this reason, the system redesigns the structure of the vehicle detection network, making it more compact. On the premise of not affecting the precision, the real-time requirement under the traffic scene is met, namely the speed of detecting each frame of picture is improved. The flow of the detection network in the R-FCN is shown in the left diagram of FIG. 6, a convolution operation is firstly carried out on the output of the feature extraction network to generate a new feature diagram, then convolution is carried out to respectively generate regression score diagrams for predicting the regression value of the target frame, and classification score diagrams are generated for predicting the target category, so that a final result is generated. The improvement of the detection network is as follows.

The structure proposed by the invention is shown in the right diagram of fig. 6, and a regression score map and a classification score map are respectively generated by using double channels. The regression score map is used for determining the position information of the object, relatively rich characteristic map information is needed, and the classification score map is used for determining the type information of the object and relatively accurate semantic information is needed. The output of the fourth convolution set, which is slightly shallower in the feature extraction network, is used in the present invention to generate the regression score map, and the output of the fifth, deeper convolution set is used to generate the score map. It should be noted that the essence of convolution is to downsample the feature map so that the feature map is scaled down. The design aims at detecting a small target object, the characteristic diagram information of the small target object after the fourth convolution group outputs is relatively rich, the position information of the target is easier to determine, if the size of the output characteristic diagram is smaller than that of the fourth convolution group after the fifth convolution group convolves, the characteristic diagram information is correspondingly lacking, and omission is easier to generate. The classification score map needs to accurately know the semantic information of the target, and the effect of the generated deeper feature map is better after five groups of convolutions.

An additional convolution layer between the feature extraction network and the score map is unnecessary in vehicle detection compared to the R-FCN algorithm. First, the layer does not participate in pre-training, and if the structure is too complex, the network is over-fitted, and the time cost is increased. Secondly, the same feature map which is convolved and output by the same convolution group simultaneously generates a regression score map and a classification score map which are contradictory. Because the former requires relatively rich feature map information generated by the shallow network, and the latter requires relatively accurate semantic information generated by the deep network, the convolution output of the same layer inevitably results in a situation that one of the regression score map and the classification score map meets the requirements, and the other does not meet the requirements. Thus, the additional convolution layer between the feature extraction network and the score graph is removed herein to generate the score graph and the regression score graph from different feature graphs, which experiments demonstrate that doing so does not reduce accuracy while increasing speed.

The specific flow of the training module of the system is as follows: first, a deep learning framework is used to prepare a training model for initializing network training and configuration.

And secondly, before training, initializing a pre-training model loaded into a basic network part in the SS-FCN, wherein the pre-training model is a classification model trained by the PVANET network on an ImageNet data set.

Thirdly, XAVER initialization operation is carried out on the detection network part.

And fourthly, loading image data and a labeling file, preprocessing the image and the labeling, and automatically reading the converted data in the data file by using a deep learning framework. The image data is then fed into the network for training operations.

Fifth, the network training is performed, and the network training process is a forward propagation process, which includes a convolution layer and a pooling layer. The input layer in the system is realized by convolution, and the vehicle picture can directly carry out forward propagation convolution operation; the pooling layer mainly adjusts the output scale of the previous layer. And in the training process, the deep learning framework generates relevant information of a loss function, and back propagation operation is carried out according to the data information generated in the last step, wherein the operation is a process of updating the weight of each convolution kernel of the network in the system, and the aim is to iteratively update the weight until the accuracy of the system meets the requirements of us.

And sixthly, saving the updated weight to a designated position.

The specific flow of the test module of the system is as follows: the specific flow of the test module is as follows: first, a test network and configuration are initialized, the test network employing a trained PVANET network and some test parameter configuration files. And then loading the weight obtained by training into a test network. The image then needs to be preprocessed, the main purpose of which is format conversion to allow the test module to recognize correctly. After the data is sent into the test network, a final predicted value is generated after a series of forward propagation operations such as convolution and pooling, and the generated value is stored in a file. And sequentially testing all the test data until all the pictures are completely tested.

The beneficial effects of the invention are as follows: the algorithm foundation of the invention is R-FCN, and then a small target sensitive full convolution neural network algorithm is designed and realized on the basis. The innovation point mainly comprises three parts: small target sensitive pooling layer, voting mechanism, and simple detection network. Experiments show that when the requirements of small size and shielding objects are lowered, the accuracy of detection is greatly improved, and the improvement direction of the invention is proved to have pertinence; through comparison of different algorithm training change trends, the small target sensitive pooling layer has larger influence on precision improvement, but slightly sacrifices the speed, but the simplified network structure can compensate the sacrifice speed of the layer, so that the invention realizes great improvement of the precision rate on the premise of not sacrificing the speed. The improvement direction of the invention is effective.

Drawings

Fig. 1 is a training flow diagram.

FIG. 2 is a flow chart of a test module.

Fig. 3 is a graph of performance comparisons.

Fig. 4 is an SS-FCN network architecture.

Fig. 5 is a SSpooling flowchart.

Fig. 6 is a diagram showing a comparison of network structures.

Detailed Description

The following describes the specific implementation of the present invention in detail.

In the method of the embodiment, the software environment is a ubuntu system, and the simulation environment is a caffe framework.

Step one: and building a dependent environment.

Before the caffe is built, firstly, a display card driver is built, and Ubuntu provides a very convenient installation environment. The update option is found in the self-contained settings, and after entering, the user switches to the drive tab to select the proper version to start automatic installation. And restarting the system after the automatic installation is finished. After the drive is successfully built, a CUDA parallel computing platform is built next. Firstly, selecting a proper version on the CUDA functional network, downloading, installing according to the description, restarting the system without error in the installation process, and providing a verification program for the CUDA installation program to verify whether the CUDA is successfully installed. Firstly, entering a catalog of a test file, compiling the test program under the catalog to generate an operation file, and executing the operation file after compiling is completed to test the CUDA platform function. The next step after success is to configure the CUDNN, which is a graphics card acceleration library of the CUDA, which can improve performance, increase usability and reduce memory overhead. The essence is individual library files. With the library files, the GPU can perform deep learning calculation, so that the display card can perform parallel calculation faster than the CPU. The CUDNN is simple to build, firstly, library files of the CUDNN are downloaded on the official network, a plurality of library files are obtained after downloading and decompressing, and the files are copied to the system library. Next configure the python environment, caffe depends on python, here, ubuntu's own python is not used, instead Anaconda is chosen instead of the system native python. Anaconda integrates a large number of operation packages on the python native version, and various dependency packages are not required to be installed in sequence, so that the method is quite convenient. After Anaconda is configured, some dependencies, such as opencv and some other dependency packages, are set up, wherein the opencv 3.3 version is selected and is compatible with CUDA9.1 installed in the system, and the dependencies are installed according to prompts after being downloaded on a official network. And after the installation is successfully restarted, displaying the version number of the installed opencv platform after the terminal inputs an order for verifying the opencv version, and indicating that the installation is successful. Other dependency packages such as libprotobuf, libleveldb, libsnappy, libopencv, and the like. After the above-mentioned dependent environment is installed, the next step starts to configure caffe.

Step two: the caffe dependent environment is installed.

And after the installation depends on the environment, the caffe can be installed. caffe is open source, written in c++ and cuda languages, with a python interface. First pull the caffe source file with the git clone https:// gitsub. Entering the root directory of caffe, copying the sample file of the installation configuration file, removing the tail sample word, and taking effect on the representation of the configuration file. The installation configuration file is modified for the system as follows. After the installation configuration file is opened, the annotation symbol in front of the CUDNN option is removed, and the CUDNN computing platform is used on behalf of the vehicle system; then delete the version 3 annotation of opencv to determine the opencv version, because the default version of the configuration file is 2; adding a configuration using a python layer, representing the present system using a python layer, so that the python interface provided by caffe can be used; the default python path is then modified to an Anaconda path; and finally, modifying the default library file path into the path of the library file of the system. After all the configuration is finished, the caffe is compiled, the caffe catalog is entered, a compiling command is executed, the compiling is started, and the link installation program can read the header file and the source file for compiling. And then entering a tool catalog to see various execution programs which have generated caffe and illustrate that the compiling is successful.

Step three: the model is trained.

The training link is the basis of testing and detection, and the primary step after the data set is processed is training.

Training flow figure 1 shows that the caffe framework is first used to prepare the training model for initializing the network training and configuration.

And step four, training data comprising image data and annotation files are loaded, the images and the annotations are preprocessed, and the converted data are placed under the data files, so that the caffe framework can automatically read. The training parameters and the network structure path of the system are set, and the system can read according to the configured parameters, including reading training network structures, optimizers, image data, labels and the like. The configuration file storage and the network structure file are in the same directory. The configuration of the system is shown in table 1 below:

table 1 system configuration parameters

Fifth step, network training. In the trained network, a pre-training model is loaded into a basic network part in the SS-FCN to initialize, wherein the pre-training model uses a classification model which is trained by the PVANET network on an ImageNet data set. In the small target pooling layer, a first step sets the rectangular box height and width thresholds, here set as (96, 160); classifying the received rectangular frames according to a threshold value, classifying the rectangular frames as small candidate frames if the height or width of the candidate frames is smaller than the threshold value, and classifying the rectangular frames as normal candidate frames if the height or width of the candidate frames is larger than the threshold value; thirdly, processing the normal candidate frames according to a region of interest pooling operation (Position-sensitive Region of Interest pooling, PSRoIpaling), and for the small candidate frames, performing deconvolution expansion twice on the regression and score feature graphs corresponding to the small candidate frames, performing the region of interest pooling operation on the new regression and score feature graphs in a Position-sensitive manner, and uniformly transmitting all outputs into a detection module; obtaining a classification score of 1×2×5×5 (since only vehicles are detected, only vehicles and background) and a regression score of 1×8×5×5;

The above process is expressed by the following formula:

y(i,j)＝F(z ^* ) (1)

z ^* ＝deconv(z,proposal) (2)

wherein y (i, j) represents a result obtained after the interesting pooling operation, z is a score graph calculated by the feature extraction network, deconv (z) is a candidate frame generated by z for RPN (regional recommendation network), the size of the candidate frame is a judging condition, deconvolution operation is performed when the edge of the candidate frame is smaller than a threshold set by the system, otherwise, z=z is directly caused; z is the SSpooling input score map, including the location score map or the classification score map, and F (z) is the process of the region of interest pooling operation.

The calculation score in the voting mechanism method is set as follows:

S＝α×S _ave +(1-α)×S _max (3)

wherein S is a scoring result, S _ave Is the result of average pooling, S _max Is the result of maximum pooling, α is the weight, and α takes 0.65.

The anchor box setting criteria for the loss function are as follows:

2) If the cross ratio is smaller than 0.3, judging a negative sample;

the form of the loss function is as follows:

wherein p is _p* The method comprises the steps of representing whether a predicted class p is a real class p, wherein a formula (5) is a result obtained by taking negative logarithm of a predicted value, if the final result is 1, otherwise, the final result is 0;

L _reg the regression loss function is expressed as follows:

wherein,,

the following are provided:

t _x ＝(x-x _a )/w _a

t _y ＝(y-y _a )/h _a

t _w ＝log(w/w _a )

t _h ＝log(h/h _a ) (8)

The network training process is first a forward propagation process that includes a convolutional layer and a pooling layer. The input layer in the system is realized by convolution, and the vehicle picture can directly carry out forward propagation convolution operation; the pooling layer mainly adjusts the output scale of the previous layer. And in the training process, the deep learning framework generates relevant information of a loss function, and back propagation operation is carried out according to the data information generated in the last step, wherein the operation is a process of updating the weight of each convolution kernel of the network in the system, and the aim is to iteratively update the weight until the accuracy of the system meets the requirements of us.

And sixthly, saving the updated weight to a designated position.

For training convenience, the training process is packaged into a script, and the packaged training script is firstly called when training is started, wherein parameters of a command line are set in the script, including a data set to be used, a path of a supporting file and a tool class required by training, a pre-training weight and the like. The saved path of the setup log file is then used in the script. After the package is finished, the user does not need to know the details of the bottom layer, and can directly call the script. The system finishes reading the configuration file and the command line parameters under the python file, and then finishes the core process of training by calling the training function. The parameters and the meanings of the parameters to be transmitted by the training function are shown in the following table 2, and the result is finally saved under the appointed directory.

Table 2 training parameters

Step four: the module is tested.

The purpose of the test link is to obtain a predicted value on the test image, which is the basis of effect analysis. The flow of the test module is shown in fig. 2, a caffe frame is used for preparing the initialization network and configuration for the test model, the trained weight is loaded into the network, and then the obtained test image is preprocessed, mainly by format conversion, so that the test module can correctly identify. After the test data is sent into the test network, a final predicted value is generated after a series of forward propagation operations such as convolution and pooling, and the generated value is stored in a file. And sequentially testing all the test data until all the pictures are completely tested.

For testing convenience, the invention packages the testing steps as scripts. The test is started by first calling a test script under the root directory, wherein parameters of a command line are set in the script, and the parameters comprise a data set to be used, a path of a test network structure file, a trained weight path, other test parameter configuration files and the like. The save path of the log file is then set in the script. And then invoking the test script to start testing. The parameters configured in the parameter configuration file mainly have two aspects, namely, whether the regional recommendation network is contained is firstly set, and then whether only foreground and background prediction is carried out is declared. The above field values are all set to true in this system.

The system finishes the reading of the configuration file and command line parameters in the test process under the python file, and then finishes the core process of the test by calling the test function. The parameters that the test function needs to enter and their meanings are shown in table 3 below. And then, starting to predict, and finally, storing a predicted result into a pkl file under a designated directory, wherein the predicted result can be directly read by a subsequent evaluation module.

Table 3 test parameters

Step five: performance analysis of the results

The main function of the module is to analyze the system performance of the test result. The design and implementation of this module is described below:

(1) The input is the result of the test, i.e. the test result file generated by the test module under the specified directory. The file stores a 2 x 5 x k tensor. The height coordinate is a category, and the height is 2 because the system only detects vehicles, and only two categories of vehicles and backgrounds exist. The width coordinate is the number of the picture to determine which picture, a total of 3769 test pictures are used for testing in the present system, so the width is 3769. The length is an array of indefinite length of 5 x k, where k represents the target number of current categories in the current picture. For each target, 5 values are predicted altogether, the first 4 values being the frame predicted values, indicating the position of the target in the picture, the last value being the confidence level, representing the probability of being the target. After knowing the meaning of the test result file, the required information is extracted therefrom and stored in the text for further evaluation.

After txt prediction files of all pictures are obtained, comparison can be made with the markup files, so that accuracy of the system can be obtained. The KITTI official network classifies evaluation standards into three grades according to the detected difficulty level: simple, medium and difficult. The grading criteria have three indicators: minimum height, maximum shielding degree and maximum clipping proportion of the target vehicle. None of these criteria are involved in the evaluation. The specific partitioning is shown in table 4 below:

Table 4 rating scale

The KITTI official provides the source code for evaluation, and on the basis of the source code, the system of the invention is modified, for example, the target class is changed to be the only vehicle, and the storage paths of the real labels and the predicted files are modified. And compiling to generate an executable file, and directly evaluating the predicted file generated in the previous step to finally generate an evaluation result. In the system, 40 equally dividing is carried out according to recall rates from 0 to 1, and accuracy rates under three standards of simplicity, medium and difficulty are generated for each recall rate. The average accuracy (mean average precision, mAP) is the average of the accuracy at all recall rates. And meanwhile, a picture is generated as a graphical display of the result.

Fig. 3 is a display of a final result, and through comparison of variation trends of different algorithms during training, the SS-FCN algorithm of the system is more stable during training, which indirectly indicates that the characteristics processed by the small target pooling layer are easier to learn, faster to converge and higher in finally achieved precision. The SS-FCN may be 3 to 4 percent higher than the baseline algorithm after stabilization, and the above-described boosting effect is significant considering the huge dataset of this embodiment, over 20000 vehicle targets in the picture.

Claims

1. The small-target sensitive vehicle detection system is characterized by comprising a data module, a network structure module, a system configuration module, a training module, a testing module, a function supporting module, a log acquisition module, an effect analysis module, a detection module and an interaction module;

and a network structure module: the system mainly comprises a feature extraction network and a detection network; the feature extraction network adopts PVANET; the detection network is based on R-FCN, a full connection layer is removed, and a full convolution network is used for predicting the vehicle; the basic flow of the network structure module is as follows: firstly, inputting a complete picture; the second step, the picture is extracted by five convolution groups, the characteristic extraction network adopts PVANET, and mainly comprises C.ReLU, acceptance, residual error structure and parallel structure; thirdly, taking the output of the fourth convolution group as the input of the RPN, wherein the output of the RPN is a group of candidate boxes; step four, leading out two branch channels in a fifth convolution layer to respectively carry out convolution to obtain a regression score characteristic diagram and a classification score characteristic diagram, and inputting the regression score characteristic diagram and the classification score characteristic diagram into a small target pooling layer together with the candidate frame obtained in the step three; fifthly, predicting through a voting mechanism to obtain final classification and regression results;

training module: training for completing a network; calculating a predicted value through a network, and then comparing the predicted value with a true value; calculating a loss value through a loss function, and then carrying out back propagation on the loss value to update the network; the weight of the convolution kernel is updated continuously to minimize the loss function; the system encapsulates the training configuration and the complex flow completely, and a user can start training only by inputting a command at the terminal;

And a detection module: finishing the detection of the input picture;

2. The small target sensitive vehicle detection system of claim 1, wherein, in the small target pooling layer,

the first step: setting threshold values of the height and the width of the candidate frames as classification standards;

and a second step of: classifying the received candidate frames according to a threshold value, classifying the received candidate frames as small candidate frames if the height or width of the candidate frames is smaller than the threshold value, and classifying the candidate frames as normal candidate frames if the height and width of the candidate frames are larger than the threshold value;

the above process is expressed by the following formula:

y(i,j)＝F(z ^* ) (1)

z ^* ＝deconv(z,proposal) (2)

3. The small object sensitive vehicle detection system according to claim 1, wherein the voting mechanism of the network structure module of the system is as follows:

average pooling is the average of weights within the region, and maximum pooling is the maximum within the region; fusing the average pooling and maximum pooling voting results, and adopting a new score calculating mode:

S＝α×S _ave +(1-α)×S _max (3)

wherein S is a scoring result, S _ave Is the result of average pooling, S _max Is the result of maximum pooling, α is the weight.

4. The small object sensitive vehicle detection system of claim 1, wherein in the system training module, a label is set for each anchor frame before calculating the loss function, and the system setting criteria are as follows:

1) If the intersection ratio of the anchor frame and the frame value of the target frame in a certain label is the largest, or the intersection ratio of the anchor frame and the frame value of the target frame in any label is more than 0.7, the anchor frame is considered to be a positive sample,

2) If the cross ratio is smaller than 0.3, judging a negative sample;

the form of the loss function is as follows:

wherein N is _cls And N _reg For normalization; i is the number of the anchor frame, p _i Is the probability of the ith anchor frame as the foreground, p _i ^* Is the probability of the real frame as the foreground, and classifies the loss function L _cls Is a binary loss function, t _i Is a 4-dimensional array position representation of the ith anchor box, and the weight value between the classification loss and the regression loss represented by lambda is as follows:

wherein,,

indicating whether the predicted class p is the true class p ^* The formula (5) is a result obtained by taking the negative logarithm of the predicted value, if the final result is 1, otherwise, the final result is 0;

L _reg the regression loss function is expressed as follows:

wherein,,

the following are provided:

t in formula (6) _i Is the 4-dimensional array position representation of the ith anchor frame, t _i ^* Is a 4-dimensional array position representation of a real frame, the 4-dimensional array position representationThe following 4 items are included:

t _x ＝(x-x _a )/w _a

t _y ＝(y-y _a )/h _a

t _w ＝log(w/w _a )

t _h ＝log(h/h _a ) (8)

wherein t is _x 、t _y 、t _w 、t _h Respectively the offset of the predicted candidate frame relative to the anchor frame in the horizontal direction, the vertical direction, the width and the height, x and y are the central coordinate values of the anchor frame, w and h are the width and the height of the anchor frame, and w _a To predict the width of the frame, h _a To predict the height of the frame, x _a ，y _a Is the central coordinate value of the prediction frame.

5. The small object sensitive vehicle detection system of claim 1, wherein the network architecture module: a regression score map and a classification score map are generated using the two channels, respectively, a regression score map is generated using the output of the fourth convolution set that is slightly shallower in the feature extraction network, and a classification score map is generated using the output of the fifth convolution set that is deeper.

6. The small target sensitive vehicle detection system of claim 1, wherein the training module specifically processes:

firstly, performing initial network training and configuration preparation work for a training model by using a deep learning framework;

secondly, before training, loading a pre-training model into a basic network part in the SS-FCN, wherein the pre-training model is a classification model trained by the PVANET network on an ImageNet data set;

thirdly, XAVER initialization operation is carried out on the detection network part;

loading image data and a labeling file, preprocessing the image and the labeling, and automatically reading the converted data in the data file by using a deep learning frame; then sending the image data into a network for training operation;

Fifthly, network training, wherein the network training process is a forward propagation process, and the process comprises a convolution layer and a pooling layer; the input layer in the system is realized by convolution, and the vehicle picture can directly carry out forward propagation convolution operation; the pooling layer is used for adjusting the output scale of the previous layer; the deep learning framework generates relevant information of a loss function in the training process, and back propagation operation is carried out according to the data information generated in the last step, wherein the operation is a process of updating the weight of each convolution kernel of the network in the system, and the aim is to iteratively update the weight until the accuracy of the system meets the requirement;

and sixthly, saving the updated weight to a designated position.

7. The small target sensitive vehicle detection system of claim 1, wherein the specific flow of the test module is: firstly, initializing a test network and configuration, wherein the test network adopts a trained PVANET network and a test parameter configuration file; then, loading the weight obtained by training into a test network; preprocessing the test image for format conversion to enable the test module to correctly identify; after the test network is sent, generating a final predicted value after convolution and pooling forward propagation operation, and storing the generated value in a file; and sequentially testing all the test data until all the pictures are completely tested.