CN114067564B

CN114067564B - Traffic condition comprehensive monitoring method based on YOLO

Info

Publication number: CN114067564B
Application number: CN202111347583.1A
Authority: CN
Inventors: 吴婧; 王旭冬; 薛庆丰
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-08-29
Anticipated expiration: 2041-11-15
Also published as: CN114067564A

Abstract

The invention discloses a traffic condition comprehensive monitoring method based on a YOLO algorithm, which comprises the steps of calibrating a camera to obtain internal and external parameters of the camera, correcting images, and pre-marking acquired data by a detection algorithm, manually marking and cleaning; generating an anchor box and inputting a YOLOv4 configuration file for training; the target detection part directly calls a YOLOv4 model by adopting an opencv dnn function interface, so that the reasoning speed is improved, a non-maximum suppression algorithm is used for screening a target frame, and a low-confidence target is filtered; pnP monocular ranging acquires target image coordinates, and calculates the distance between the target image coordinates and the vehicle by combining the world coordinates; the Sort target tracking estimates motion information through Kalman filtering, and then adopts a Hungary algorithm to carry out data association and encapsulation; the UFLD lane detection adopts a lane line detection method to obtain a lane line and encapsulates the lane line; the user interface adopts a Qt combined visual studio mode, and a user can conveniently realize the functions by relying on a simple and easily understood operation interface.

Description

Traffic condition comprehensive monitoring method based on YOLO

Technical Field

The invention belongs to the technical field of traffic condition monitoring and intelligent auxiliary driving, and particularly relates to a traffic condition comprehensive monitoring method based on YOLO, which aims at detecting, tracking, ranging and lane line detection.

Background

With the continuous increase of the quantity of vehicles in China and the continuous perfection of infrastructure, the road congestion phenomenon is increasingly serious, and meanwhile, the information which drivers need to pay attention to is also continuously increased. The driver's attention includes, but is not limited to, maintaining a safe distance between workshops, avoiding pedestrians, paying attention to road signs, paying attention to traffic signs, not changing lanes at will, and so forth. In the actual driving process, the driver is difficult to keep attention to all information all the time, often miss an overhead intersection because of not paying attention to the indication board, and the fuel consumption is increased and the time is wasted. Or violates traffic rules and even causes traffic accidents due to missed view of traffic signs, illegal lane changes, unnoticed inter-vehicle distances, and the like.

The current solution to the problems is deficient, the LED display screen is adopted to replace the text part of the traditional road sign on the road, but the cost is high and the LED display screen is adopted only in a few cities or key road sections. The safety distance of the workshop is kept by the experience of a driver, a solution is basically not provided, the cost of laser/millimeter wave radar detection and binocular range detection is high on one hand, the data processing is complex on the other hand, and the experimental stage is also provided. The lane line detection method is more, but the traditional lane line detection has poor robustness, and the detection method based on semantic segmentation has the advantages of low running speed due to high calculation complexity on one hand and small local receptive field due to the limitation of convolution size on the other hand. The comprehensive traffic monitoring method in research also basically adopts a thunder and vision combination mode, so that the cost is too high, and the comprehensive traffic monitoring method is only used in the unmanned research and development field and is difficult to put into the market.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a traffic condition comprehensive monitoring method based on the YOLO, which utilizes a target detection model YOLOv4 to detect the position of a traffic target in a camera or video, has the advantages of high detection precision, high detection speed and the like, calculates the distance between a vehicle target and a host vehicle by using a PnP algorithm on the basis of reasoning pictures and acquiring the position of the target in the images by the YOLOv4, carries out frame-crossing target tracking by using a Sort target tracking algorithm, and simultaneously optimizes the problem of discontinuous tracking ID. The lane line detection is carried out by adopting the UFLD algorithm, so that the problem of small local receptive field is avoided, and the method has the advantage of high detection speed. Finally, the ranging, tracking and lane line detection method is packaged into a function interface, and a software interface is written based on a Qt & VS framework. The whole monitoring process only needs a monocular camera, the cost is extremely low, the monitoring content is comprehensive, and assistance can be provided for drivers or unmanned systems.

In order to achieve the above object, the present invention provides a YOLO-based traffic condition comprehensive monitoring method, comprising:

s1: the method comprises the steps that a calibration camera obtains internal parameters and external parameters of a monocular camera, lens distortion of the monocular camera is corrected according to the internal parameters and the external parameters of the monocular camera, video is recorded and stored, the video is divided into image data sets frame by frame, a pre-trained YOLOv4 weight model is utilized to detect the image data sets to serve as pre-labeling, then manual re-labeling is carried out, and data cleaning is carried out to obtain labeled sample data sets;

S2: processing the marked sample data set by using a kmeans algorithm to generate an anchor frame conforming to the characteristics of the data set, inputting anchor frame parameters into a configuration file, training under a dark net frame, generating and storing a YOLOv4 network weight file, reading a video stream or a camera by adopting a trained YOLOv4 model, performing target detection frame by frame, screening out an approximate target frame by adopting a non-maximum suppression algorithm, reserving an optimal, setting a confidence threshold value to filter a target with low credibility, and feeding back a target frame of a final target detection result into each frame of image;

s3: based on a target detection result in each frame of image, acquiring the position of a ranging target in a current image as an image coordinate, setting a world coordinate according to the real size of the ranging target, calibrating the obtained internal and external parameters of the camera by combining the camera, obtaining the three-dimensional coordinate of the ranging target through a PnP monocular ranging algorithm, and calculating the distance between each ranging target and the vehicle;

s4: based on the target detection result in each frame of image, acquiring the coordinate position and the time point of the target to be tracked in the image, predicting the next frame position of the target to be tracked by using a Kalman filter, updating the target information of the target to be tracked when the next frame is performed, correlating the information in the current image with the predicted information of the previous frame, realizing multi-target tracking, and performing algorithm encapsulation;

S5: and carrying out lane line detection by adopting a UFLD lane line detection algorithm, and packaging the UFLD lane line detection algorithm.

In some alternative embodiments, step S1 comprises:

s11: calibrating camera parameters by adopting a matlab tool box camera calibrator, shooting a checkerboard through cameras to be calibrated at multiple angles, importing a checkerboard image in camera calibrator, inputting the size of the checkerboard, and generating camera parameters after selecting the camera type, wherein the generated camera parameters comprise a camera internal reference matrix, camera radial parameters and camera tangential parameters;

s12: correcting the camera by adopting a correction conversion mapping function of opencv, creating a distortion parameter matrix and a camera internal parameter matrix according to camera parameters, generating a correction matrix, and generating a corrected image through a remap function;

s13: recording and storing videos by adopting a corrected camera, splitting the videos into image data sets frame by frame, detecting the image data sets by utilizing a pre-trained YOLOv4 weight model to serve as pre-labeling, manually labeling by using labelimg labeling software, and cleaning data by using yolo_mark software to obtain a labeled sample data set.

In some alternative embodiments, step S2 comprises:

S21: calculating anchor frame clusters of the marked sample data set by using a kmeans algorithm in the dark net, comprehensively considering the number of anchor frames and the average IOU, and selecting the optimal number of anchor frames and the corresponding size to adapt to the marked sample data set;

s22: loading a network structure and network parameters by using a dnn interface of opencv, deploying the loaded network structure on a GPU (graphics processing unit) to correlate the operation of a GPU acceleration program, continuously receiving images acquired by a camera, transmitting the received images into the network structure, performing forward reasoning, and obtaining all detection target frames in the images, and category information and confidence information corresponding to the target frames;

s23: screening out approximate target frames by using an Nms non-maximum inhibition method, reserving an optimal target frame, correcting output information subjected to forward reasoning, screening detection frames of each type of targets according to corrected type information and confidence information, sequentially judging downwards from the beginning with the maximum confidence, and if the IOU of the current low-confidence detection frame and the IOU of the high-confidence detection frame exceed a threshold value, erasing the low-confidence detection frame, thereby acquiring the classification and the position with the highest confidence of each possible target, drawing the target frame and writing the position information, and finishing target detection, wherein the IOU is the ratio of the intersection area and the union area of the two detection frames.

In some alternative embodimentsIn which, the width=d2× W, height =d3×h,Is->Correcting the output information subjected to forward reasoning, wherein d0, d1, d2 and d3 are respectively the abscissa of the upper left coordinate point of the target frame, the ordinate of the upper left coordinate point of the target frame, the width of the target frame and the height of the target frame, W and H are the width and the height of the image acquired by the camera, and left, top, width, height are respectively the abscissa of the upper left coordinate point of the corrected target frame, the ordinate of the upper left coordinate point of the target frame, the width of the target frame and the height of the target frame.

In some alternative embodiments, step S3 comprises:

according to the real coordinate point and the image coordinate point of the object to be measured, the camera parameter matrix and the distortion parameter matrix, adopting the opencv software PnP function to obtain three-dimensional estimated coordinates x, y and z, and then the distance between the object to be measured and the vehicle is as followsWherein x represents the relative distance between the object to be measured and the host vehicle in the x direction, y represents the relative distance between the object to be measured and the host vehicle in the y direction, and z represents the relative distance between the object to be measured and the host vehicle in the z direction.

In some alternative embodiments, step S4 comprises:

s41: establishing a target position model on a time axis when cross-frame association is carried out, wherein the modeling state of each target is as follows: t= [ box, id _m ，t _m ，h _m ，a _m ，]Wherein, the box is a target frame and comprises the horizontal coordinate and the vertical coordinate of the upper left coordinate point of the target frame, the width parameter and the height parameter of the target frame, and the id _m A for each tracking object in the order of appearance in the time axis _m To the number of frames in which the tracked object appears, it is calculated from the first appearance, h _m To continuously track the number of frames to the target, t _m For beforeThe number of frames that the tracked target continuously disappears;

s42: judging whether the tracking target number is 0 and the frame number is smaller than a preset frame number threshold value or not in the initial state, initializing a vector trks of the tracking object T by using a Kalman filter if the tracking target number is in the initial state, and waiting for a next frame detection result;

s43: if the target matching method is not in the initial state, predicting trks of the previous frame, performing IOU calculation with a current frame detection result vector dets, and performing Hungary maximization matching according to an IOU result, wherein the target matching result in each trks and dets has three possibilities: if the corresponding item value in the IOU matrix is greater than the preset IOU threshold value, the matching is successful, the previous frame and the current frame are correlated, and the state T of the target is updated, wherein the box is corrected to the current frame position, id _m Unchanged, t _m ＝0，h _m +1，a _m +1; if the target in dets can not be matched with trks, the target is a new object, and the new object is initialized by using a Kalman filter to obtain a corresponding state T, if a plurality of subsequent frames still can be matched with the target, and when h _m After the preset frame number threshold value of the target is continuously tracked, the target is visualized; if there is a target in trks that cannot be matched with dets, the current frame of the target is not detected, if the target t _m The target is removed from the tracking list if the preset threshold number of frames for which the target tracked before continuously disappears is greater than or the state T, a of the target is corrected _m +1，t _m +1；

S44: and storing the corrected states of all the tracking targets into a tracking list, waiting for the detection result of the next frame, and repeating the processes of the steps S42 to S43.

In some alternative embodiments, step S5 comprises:

s51: the trained pth weight file is transmitted into a py script, the number of detection points, the number of grids and the number of network architecture layers of each lane are respectively set according to a network structure, a blank self-adaptive network structure is created, the parameter of the pth weight file is transmitted into the self-adaptive network structure, then all-zero weight in a given form is created, the self-adaptive network structure is written into the all-zero weight, and the converted weight is stored into a weight file with suffix name of the pt;

S52: packaging UFLD detection algorithm as I ₀ ＝RunlaneDetection{I ₁ N }, wherein I ₁ For input image, N is the network model of the input, which is loaded with a self-pt weight file, I ₀ For outputting an image, runlaneection represents the UFLD detection algorithm.

In some alternative embodiments, the method further comprises:

s6: and designing a user operation interface by adopting a Qt framework, adding a Qt vs tool extension item in the visual studio to realize joint programming, and designing the user interface by adopting a Qt designer, wherein the user interface comprises user login, custom configuration, camera setting and comprehensive detection functions.

In some alternative embodiments, step S6 includes:

s61: creating a stackWidget in a main window of a UI interface as an association window among functions, creating a QTYOLO class inherited in a QMAINWindow class in a public way, changing index of the stackWidget through signals and a slot function, and linking a pushButton to realize the jump of a function page;

s62: the method comprises the steps that a first page of a stack Widget is used as a login interface, a gif map background depending on a label control is arranged on the login interface, the login interface comprises a plurality of pushButton controls, label controls and lineEdit controls, the pushButton controls are used as login, login buttons, accounts, password text prompts and account Password text input boxes, qt automatically creates a member object in a UI class, each control is declared in a Qt format under private slots in a corresponding header file, the controls are used as member functions of the interface, a line editing display type is modified into a Password by setting a member function setEchomode of the lineEdit, when a user inputs a Password, and when the user clicks login, if the user inputs a mistake, the prompt dialog box is popped up;

S63: the second page of the stack Widget is a loading page, a user self-defined configuration function is realized, a plurality of labels are added to serve as text prompts, buttons serving as loading files and buttons for selecting detection targets are added, a button box is added below the second page to provide selection for a user, after the user selects a camera, a video acquired by the camera is used as input, camera parameter setting is carried out, and if the user does not select or selects not to enable the camera, the video loaded by the user is used as input;

s64: the third page of the stack Widget is a camera setting interface, so that when a user starts a camera, an algorithm can be adapted, if and only if the user selects to start the camera in the second page, the camera can be jumped to the third page, all parameters in the third page are manually input by the user, the parameters can be acquired by a member function text of the lineEdit, and the parameters are assigned to member variables of QTYOLO;

s65: the fourth page of the stack Widget is a comprehensive monitoring page, and a tabWidget is contained in the fourth page and used for switching an original picture and a detected effect picture, and the two Widgets respectively contain a label type object testWindow1 and testWindow2 and are used for refreshing each frame of image; 4 pushbuttons are used for starting, pausing, ending and returning to the previous step respectively; the 4 checkboxes are respectively used for displaying the target confidence coefficient, the target ranging, the target tracking and the lane lines, and in addition, the detection effect can be overlapped and presented when the checkboxes simultaneously pick up a plurality of targets.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) According to the traffic condition comprehensive monitoring method based on the YOLOv4, sort+PnP+UFLD algorithm fusion detection, time consumption of Sort and PnP is extremely short, and detection speed of YOLOv4 and UFLD is quite superior on the premise of high detection precision, so that the method has the advantages of high precision and high detection speed on the whole.

(2) In the invention, the image and video data are collected in the actual driving process, and the anchor box is clustered by using a kmeans algorithm, so that the training model is more suitable for the data set. Namely, the algorithm model in the invention is more suitable for the condition of the road of the country.

(3) The method optimizes the problem of discontinuous tracking of the Sort algorithm and improves the user experience.

(4) The invention designs and develops a user interface, is convenient for users to use, and provides comprehensive detection functions of login, custom configuration, camera setting and self-selection.

Drawings

FIG. 1 is a flow chart of a method provided by an embodiment of the present invention;

FIG. 2 is a camera calibration diagram provided by an embodiment of the present invention;

FIG. 3 is a partial traffic data set dataset presentation diagram provided by an embodiment of the present invention;

FIG. 4 is a trend chart of the cluster accuracy of an anchor box provided by the embodiment of the invention;

FIG. 5 is a diagram of a detection effect of a YOLOv4 traffic target provided by an embodiment of the present invention;

FIG. 6 is a graph of a target ranging effect of a YOLOv4+PnP algorithm according to an embodiment of the present invention;

fig. 7 is a target tracking effect diagram of yolov4+sort algorithm according to an embodiment of the present invention;

fig. 8 is a diagram of a UFLD lane line detection effect provided by an embodiment of the present invention;

FIG. 9 is a diagram showing the comprehensive monitoring effect of the YOLOv4+PnP+Sort+UFLD algorithm according to the embodiment of the present invention;

FIG. 10 is a flowchart of an interface development provided by an embodiment of the present invention;

FIG. 11 is an effect diagram of a login interface according to an embodiment of the present invention;

FIG. 12 is an effect diagram of a loading interface provided by an embodiment of the present invention;

FIG. 13 is an effect diagram of a camera setup interface provided by an embodiment of the present invention;

fig. 14 is an effect diagram of a comprehensive monitoring page and realizing traffic condition monitoring according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

In the examples of the present invention, "first," "second," etc. are used to distinguish between different objects, and are not used to describe a particular order or sequence.

Fig. 1 is a schematic diagram of a traffic condition comprehensive monitoring method based on YOLO according to an embodiment of the present invention, including the following steps:

(1) Early data preparation phase: the calibration camera acquires the internal parameters and the external parameters of the monocular camera, corrects the lens distortion of the monocular camera through C++ by means of the acquired internal and external parameters, and records and stores video; driving and recording urban, overhead and expressway road videos on a vehicle, splitting the videos into image data sets frame by utilizing matlab, detecting the image data sets by utilizing a YOLOv4 weight model trained by a coco data set as pre-marking, generating txt files corresponding to each frame, loading labelimg marking software for manual re-marking, and importing yolo_mark software for data cleaning;

in this embodiment, the data preparation stage obtains a road traffic data set containing 71000 pictures, 14200 labeling pictures and txt tag files.

In this embodiment, the step (1) may be specifically implemented by:

(11) Monocular camera calibration, as shown in fig. 2: calibrating camera parameters by adopting a camera labeling plug-in camera calibrator in a matlab toolbox, shooting a 10×7 checkerboard by using cameras to be calibrated at multiple angles, importing images in camera calibrator, inputting the size of the checkerboard, selecting a camera type of standard, clicking Export to generate the camera parameters in a working area, wherein the generated camera parameters comprise a camera internal reference matrix, camera radial parameters and camera tangential parameters;

(12) Camera imaging correction: correcting a camera by adopting a corrected conversion mapping function initUndicator transformation map of opencv through a C++ program, creating an object distortion parameter matrix distCoeffs of Mat matrix types and a camera internal reference matrix camera matrix according to camera parameters in step (11), transmitting the object distortion parameter matrix distCoeffs and the camera internal reference matrix camera matrix into an initUndicator transformation map function to generate a correction matrix map, and generating a corrected image through a remap function remap;

in this embodiment, the correction conversion mapping function initunderstatotrectifypap outputs two correction matrices, and the remap function uses the two correction matrices to relocate each pixel point in the original image to a correct position, so as to obtain a corrected picture.

In this embodiment, the distortion parameter matrix includes three radial distortion parameters and two tangential distortion parameters: discoeffs= [ k ₁ ，k ₂ ，k ₃ ，p ₁ ，p ₂ ]The method comprises the steps of carrying out a first treatment on the surface of the The camera intrinsic matrix contains 3×3 camera parameters (unique to each camera):

wherein c _x And c _y S is the tilt parameter (s=0 when the x-axis and y-axis are perfectly perpendicular) for the camera optical center. f (f) _x And f _y Related to camera focal length: f (f) _x ＝f×s _x ，f _y ＝f×s _y F is the focal length of the camera (in mm), s _x Sum s _y Representing the amount of pixels per millimeter in the (x, y) direction of the image taken by the camera.

(13) Labeling and cleaning data: in the step (1), labelimg labeling software is used for labeling, yolo_mark software is used for cleaning data, labeling categories comprise Human, car, bicycle, trunk, traffic Sign, indicator, electric mobile, bus and Van, and because targets are not in certain frames, labelimg does not automatically generate blank txt label documents, the yolo_mark labeling software is used for cleaning data, a manual data set is avoided, and part of data set images and labeled txt documents are shown in fig. 3.

(2) YOLOv4 target detection: generating an anchor frame conforming to the characteristics of a data set by using a kmeans algorithm, inputting anchor frame parameters into a configuration file, training under a dark frame, generating and storing a network weight file, directly calling a trained YOLOv4 model by using a dnn interface of opencv after training, reading a video stream or a camera by using a C++ program, performing target detection frame by frame, screening out an approximate target frame by using an Nms non-maximum suppression algorithm, reserving an optimal, setting a confidence threshold value, filtering a target with low credibility, and feeding back a final detection result into each frame of image;

in this embodiment, step (2) may be implemented by:

(21) Calculating anchor frame clusters of the data set processed in the step (1) by using a kmeans algorithm in the dark net, comprehensively considering the number of anchor frames and average IOU, and selecting the optimal number of anchor frames and corresponding sizes to adapt to the data set, wherein the number of anchor frames and the precision change trend are shown in figure 4;

(22) Creating an object net=readNetFromDarknet (cfg, weight) of a network type Net class by using a dnn interface of opencv, running by using a member function setPrepreferableTarget and setPrepreferableBackend associated GPU acceleration program of the Net, using an object cap of creating a video Capture class, and using a cycle to transfer camera reading data to an object frame of an image class frame by frame, wherein the frame is used as a real parameter to be transferred into a member function setInput of the Net, and obtaining output information from a member function forward of the Net;

The readNetFromDarknet is a loading function of a Darknet network architecture and is used for generating a corresponding Darknet deep learning network structure through a cfg configuration file and a weight file; cfg and weight are paths of a configuration file with a suffix name of cfg and a weight file with a suffix name of.weight, respectively, where network structure and network parameters are loaded.

The setpreferabaleTarget and setpreferabaleBackend are respectively a set preferred objective function and a set preferred back-end function; through the two function interfaces, the loaded network architecture is deployed on the GPU display card, so that network reasoning can be accelerated by the GPU.

The video capture class is used for creating blank videos and is used for continuously receiving images acquired by the camera.

Wherein the member function forward represents a forward reasoning function; the method is used as a member function of net, the received image is transmitted into a network structure, forward reasoning is carried out, and category information and confidence information corresponding to all detection target frames and frames in the image are obtained as output information.

(23) The approximate target frame is screened out by using an Nms non-maximum inhibition method, the optimal value is reserved, the output information of net is corrected and then is input into an NMSBox function, and the function input parameters comprise an output threshold value of an object box of a vector < Rect > class, an output threshold value of an object confidences, net of a vector < float > class and an Nms screening threshold value, wherein the correction formula is as follows:

width＝d2×W

height＝d3×H

Wherein d0, d1, d2, d3 and input parameter confidences are output information of a member function forward of net, and W and H are width and height of an image acquired by the camera. width, height, left and top together form boxes, the classification and the position with the highest confidence coefficient of each possible target are obtained from the NMSboxes function, and the detection result is drawn into an image by utilizing the rectangle and the pubext functions of opencv, namely the target detection process is completed, and the detection effect is shown in figure 5.

The NMSBox function represents a non-maximal inhibition screening function and is used for screening the detection frames of each type of targets according to the category information and the confidence information, sequentially judging downwards from the beginning with the maximum confidence, and if the IOU of the current low-confidence detection frame and the IOU of the high-confidence detection frame exceed a threshold value, erasing the low-confidence detection frame. The IOU is the ratio of the intersection area and the union area of two detection frames.

The object boxes of the vector < Rect > class represent detection frame vectors and are used for storing all detection frame information in output information.

Where the object confidences of vector < float > class represent confidence vectors for storing all confidence information in the output information (one-to-one with the detection boxes).

The rectangle and the text functions respectively represent a drawing rectangle function and an adding text function, and the drawing rectangle function is used for drawing a rectangle frame at a specified position of a specified image according to given coordinates of corner points of the rectangle, width and height parameters, color of the rectangle frame and line type parameters. The text adding function is used for adding text at a specified position of a specified picture according to given parameters such as starting point coordinates, fonts, word sizes, character colors and the like and text contents.

(3) The PnP monocular distance measurement effect is shown in fig. 6, the distance measurement is based on the target detection result, the position of the distance measurement target in the current image is obtained and used as an image coordinate, the world coordinate is set according to the real size of the target, the camera inner and outer parameters obtained by the camera calibration are combined, the solvePnP function is called, the three-dimensional coordinate of the distance measurement target is returned, and the distance between each target and the host vehicle is calculated;

the solvePnP function represents a function for solving the multi-point perspective problem, and the function calculates and returns the pose of the camera according to 4 world coordinate points, 4 image coordinate points and a 3 x 3 camera internal reference matrix of the target. The formula is as follows:

the method can be simplified to s _i U _i ＝ILP _i (i can take 1-4 due to four pairs of coordinates, i.e. there are four equations), where s _i U for depth information of the ith point in world coordinate system _i For the pixel coordinates of the point in the image, I is a camera internal reference matrix, the matrix is obtained by camera calibration, L is a camera pose matrix, and the matrix comprises a 3×3 rotation matrix R and 3×1 translation vectors T and P _i Is the coordinates of the object in the world coordinate system. Where the coordinates are actually transformed from non-homogeneous to homogeneous.

The equation is true under ideal conditions, but due to unknown camera pose and noise at the observation point, there is an error on both sides of the equation, minimizing this error can approximate the true camera pose and thus range. Based on 4 pairs of world-image coordinate points constructed in advance, 4 pairs of point errors are summed to construct a least squares problem, and then the best camera pose is found to minimize it:

epsilon is an error obtained by comparing the projection position (pixel coordinates) of the target in the image acquired by the camera with the position obtained by projecting the 3D coordinate point according to the currently estimated pose, and is therefore called a reprojection error.

And then adopting a Levenberg-Marquardt iterative optimization method to approximate the correct camera pose in the process of minimizing the reprojection error. When epsilon is smaller than the threshold value, the resolving function returns to the pose of the camera, namely returns to the rotation matrix R and the translation vector T, wherein T is the vector of three-dimensional estimated coordinates x, y and z.

In this embodiment, step (3) may be implemented by:

the method comprises the steps of calling a solvePnP function of opencv to return three-dimensional estimated coordinates, wherein input parameters of the solvePnP comprise a real coordinate point and an image coordinate point of a target to be measured, a camera parameter matrix and a distortion parameter matrix, the real coordinate point is P1 (-w/2, -h/2), P2 (w/2, -h/2), P3 (w/2,h/2) and P4 (-w/2,h/2), w and h are respectively preset target width and height, three-dimensional estimated coordinates x, y and z are returned after ranging the solvePnP, and as the data set comprises conditions of ascending and descending of an overhead, and the like, and the actual driving possibly has height difference, the z-direction distance is taken into consideration, and the measured distance formula is as follows:

(4) The Sort target tracking is also based on a target detection result, the detection and tracking effect is shown in fig. 7, the coordinate position of a target to be tracked in an image and the time point (specific frame number) of the target are obtained, the position of the next frame of the target is predicted by using a Kalman filter, the target information to be tracked is updated in the next frame, the information in the current image and the predicted information of the previous frame are correlated, multi-target tracking is realized, and a Sort algorithm is packaged into a TestSort function so as to be convenient to call across frames;

in this embodiment, the SORT algorithm estimates motion information by kalman filtering, and then uses the hungarian algorithm to perform data correlation. The Kalman filter predicts the target location at time t+1 based on the current measurement at time t and the target location model. This is an effective way to maintain target positioning in the event of short occlusion. The hungarian algorithm is a combinatorial optimization algorithm that helps assign a unique identification number to identify a given object in a set of image frames by checking whether the object in the current frame is the same as the object detected in the previous frame.

In this embodiment, step (4) may be implemented by:

(41) The SORT algorithm establishes a target position model on a time axis when cross-frame association is carried out, and the modeling state of each target is as follows: t= [ box, id _m ，t _m ，h _m ，a _m ，]Wherein, the box is a target frame, and comprises four parameters x, y, w, h, namely, the horizontal coordinate and the vertical coordinate of the upper left coordinate point of the target detection frame, and the width parameter and the height parameter of the target frame; id (id) _m The order in which each tracking object appears in the time axis is also unique identity information thereof; a, a _m For the age (number of frames present) of the tracked object, counting from the first occurrence; h is a _m To continuously track the number of frames to the target, when h _m When the number of frames of the target is smaller than a preset frame number threshold for continuously tracking the target, the tracking target is temporarily not displayed in order to avoid error tracking; t is t _m For the number of frames that the previously tracked target continuously disappeared, when t _m Above a preset threshold number of frames at which the previously tracked target continues to disappear, the target will be removed from the tracking list;

(42) The following tracking flow is specifically: judging whether the tracking object is in an initial state (the tracking target number is 0 and the frame number is smaller than a preset frame number threshold value) at present, initializing the tracking object, namely a vector trks of T, by using a Kalman filter if the tracking object is in the initial state, and waiting for the detection result of the next frame;

(43) If the current frame is not in the initial state, for example, in the second frame, the trks of the previous frame is predicted and the detection result of the current second frame is compared with the trks of the previous frameCarrying out IOU calculation on dets, wherein the obtained IOU result is an m multiplied by n matrix, m and n are the number of elements of trks and dets vectors respectively, and carrying out Hungary maximum matching according to the IOU result, wherein the target matching result in each trks and dets has three possibilities: the corresponding item value > iou_threshold in the IOU matrix is successfully matched, correlated and the state T of the target is updated, wherein the box is corrected to the current frame position and id _m Unchanged, t _m ＝0，h _m +1，a _m +1; if the target in dets can not be matched with trks, the target is a new object, and the new object is initialized by using a Kalman filter to obtain a corresponding state T, if a plurality of subsequent frames still can be matched with the target, and when h _m After the preset frame number threshold value of the target is continuously tracked, visualizing the target; if there is a target in trks that cannot be matched with dets, the current frame of the target is not detected, if the target t _m The preset threshold number of frames for which the target tracked before continuously disappears is removed from the tracking list, otherwise its state T, a is corrected _m +1，t _m +1；

(44) All the corrected states of the tracking targets are saved in the tracking list, the next frame detection result is waited, and the processes (42) to (43) are repeated.

In this embodiment, the Sort algorithm may be encapsulated by: packaging the Sort algorithm into a TestSort function facilitates cross-frame invocation. The implementation is to define sth_need_loop TestSort (Mat imag, vector < TrackingBox > mydetdata, sth_need_loop collectors_and_car id, int maxFrame). The function return value type is a custom type sth_need_loop, and the input value types are respectively an image class Mat, a custom vector class vector < TrackingBox >, a custom type sth_need_loop, and integer int. The sth_need_loop is used for storing cross-frame tracking information, imag is a current picture, mydetdata is target frame information detected in the current picture, and maxframe is a current frame number. Is used in the TestSort function to implement the content of step (4) and is looped through the main function.

Wherein, the encapsulation function can be expressed as: { objects_new, car_id } = teststart (img, dets, objects_old, n), objects_new of the return value represents the tracking object information left at the end of this frame; car_id represents the serial number of these tracking object information; img in the output parameters represents the current frame image; dets represents detection frame information of the current frame; objects_old represents tracking object history information left by the previous frame; n represents the number of frames of the current frame on the time axis.

(5) The UFLD lane line detection effect is shown in figure 8, an Ultra-fast-lane-detection method is adopted, a weight file with a pth suffix is converted into a pt file, a calling file is rewritten into a C++ program, and the C++ program is packaged into a RunLane detection function so as to facilitate cross-frame calling;

in this embodiment, step (5) may be implemented by:

(51) When an ultra-fast-lane-detection method is adopted to detect lane lines, weights are required to be converted, a trained pth weight file is transmitted into a trans-py script, three parameters of cls_num_per_lane, gridding_num and backbone are respectively set according to a network structure, a network model Net is defined through a parsingNet function, the pth weight file parameters are transmitted into the Net, a torch.zeros function is used for defining blank weights, the form of the blank weights is 1 multiplied by 3 multiplied by 288 multiplied by 800, the Net is converted into the blank weights through a torch.fit.trace function, and the converted pt weights are saved through a save function;

wherein cls_num_per_lane represents the number of detection points of each lane; grid_num represents the grid number; the backhaul represents the network architecture layer number; the parsingNet function defines a network model Net for creating a blank self-adaptive network structure Net, and generating a new network structure according to three parameters of cls_num_per_lane, grid_num and backbone; the torch. Zero function represents the zeroing function of the pytorch for creating all zero weights of a given form; the torch.jit.trace function represents the chase function of pyrtorch for writing the content of the previous Net network to the all zero weights; the save function represents a save function for saving the converted weights as a weight file with suffix name.

(52) And packaging the UFLD detection method into a RunLane detection function, wherein the return value type is Mat, the input value type is Mat and torch, the input value type is jit, the script is Module, namely, the road picture and the network model are transmitted, and the picture painted with the lane line is returned after reasoning.

Wherein, the encapsulation function expression is: i ₀ ＝RunlaneDetection{I ₁ ,N}，I ₁ For input image, N is the network model of the input, which is loaded with a self-pt weight file, I ₀ To output an image. The function functions are as follows: 18 specific pixel rows of the input image are detected by the network model (because cls_num_per_lane=18 is set, 18 anchor points are detected at most on behalf of each lane, and thus detected in 18 specific pixel rows), and possible lane line location points for each row are found. These points are then plotted on the input image and returned as the output image. Mat is I ₀ The return value type of (a) i.e. the image class (matrix class); torch:: jit:: script:: module is the template class, i.e., type of N. Mat is I ₀ The return value type of (a) i.e. the image class (matrix class). Torch:: jit:: script:: module is the template class, i.e., type of N.

In this embodiment, the comprehensive monitoring effects of target detection, ranging, tracking and lane line detection are shown in fig. 9.

(6) The operation interface integrates the functions, and is added with functional interfaces such as user login, detection target selection, user-defined camera configuration, comprehensive monitoring and the like, the user operation interface adopts a Qt framework, qt vs tool extension items are added in the visual studio to realize joint programming, and the user interface is designed through a Qt designer, so that the user interface comprises the functions of user login, user-defined configuration, camera setting, comprehensive detection and the like. The implementation mode and the steps are as follows: the interface operation flow is shown in fig. 10.

(61) Creating a stackWidget in a main window of a UI interface as an association window among various functions, creating a QTYOLO class which inherits the QMAINWindow class in a public way, changing index of the stackWidget through signals and a slot function, and linking the pushButton to realize the jump of a functional page;

(62) The method comprises the steps that a first page1 of a stack Widget is used as a login interface, a login interface effect diagram is shown in fig. 11, a plurality of pushButton controls, a label control and a lineEdit control are respectively dragged into a control column of a Qt designer as a registration, a login button, an account number, a Password text prompt and an account Password text input box, the Qt automatically creates member objects of the pushButton control, the login interface is provided with gif dynamic background, the control is also supported by the label control, the controls are declared in a Qt format under private slots in a corresponding header file, the controls are used as member functions of the interface, such as pushButton_1_click (), the slot functions represent that the UI interface comprises a pushButton, when the pushButton is clicked, the definition of the slot function is triggered, the content of the slot function is added in a corresponding source file, the line edit display type is modified by setting the member function setEchoMode of the eEdit, when the user is in a Password, the user is in a condition that the user is not visible, and the user is in a dialog state when the user is clicked, and the Password prompt is input is popped;

(63) The second page2 of the stack Widget is a loading page, a user self-defined configuration function is realized, a loading interface effect diagram is shown in FIG. 12, a plurality of labels are added as text prompts, 4 toolButton and 9 checkBox are added as buttons for loading files and buttons for selecting detection targets respectively, and the four loading buttons are used for loading an 'mp 4 video file',an 'names object type file',an 'weight file' and an 'cfg configuration file' respectively; 9 checkboxes are multiple choices, and can be used for respectively selecting 'Human', 'Car', 'Bicycle', 'Truck', 'Traffic' Sign ',' Indicator ',' electric mobile ',' Bus ',' Van 'as detection objects, and adding a button box under page2 to provide a choice of' yes/no camera enabled 'for a user instead of directly using video detection'; when the user selects the camera, taking the video collected by the camera as input, and after the next button of the page2 is pressed, jumping to a third page3 to set camera parameters; in contrast, if the user does not select, or selects not to activate, the video loaded by the user is taken as input, and page3 is skipped;

(64) The page3 of the stack Widget is a camera setting interface, so that when a user starts a camera, an algorithm can be adapted, an interface effect diagram is shown in fig. 13, a tabWidget is dragged into the page3, a flat camera or a fisheye camera can be selected by switching pages, the Widget of the flat camera contains 12 labels, and the content comprises' camera external parameters: "," camera reference matrix: "radial distortion parameters", "tangential distortion parameters", etc., and 13 lineEdit's are also included in the Widget, corresponding to 9 parameters of the camera internal reference matrix and 4 parameters of the camera external reference, respectively;

(65) If and only if a user selects to enable a camera in the page2, the page3 is jumped, and all parameters in the page are manually input by the user, parameters are acquired by a member function text of lineEdit and assigned to a member variable of QTYOLO, for example, an internal reference matrix is assigned to inCammatrix, and the member variable is subsequently called when an algorithm calls the camera parameters, so that the user camera is adapted;

(66) The fourth page4 of the stack Widget is a comprehensive monitoring page, the effect diagram is shown in fig. 14, the page4 needs to contain a tabWidget for switching the original picture and the detected effect diagram, and the two Widgets respectively contain a label type object testWindow1 and testWindow2 for refreshing each frame of image; 4 pushbuttons are used for starting, pausing, ending and returning to the previous step respectively; the 4 checkboxes are respectively used for displaying the target confidence, the target ranging, the target tracking and the lane lines, and in addition, when the checkboxes simultaneously pick up a plurality of the target confidence, the target ranging, the target tracking and the lane lines, the detection effect can be overlapped and presented, and the implementation mode is similar to that described in the step (62).

It should be noted that each step/component described in the present application may be split into more steps/components, or two or more steps/components or part of operations of the steps/components may be combined into new steps/components, according to the implementation needs, to achieve the object of the present application.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the application and is not intended to limit the application, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the application are intended to be included within the scope of the application.

Claims

1. A traffic condition comprehensive monitoring method based on YOLO is characterized by comprising the following steps:

s5: carrying out lane line detection by adopting a UFLD lane line detection algorithm, and packaging the UFLD lane line detection algorithm;

the step S4 includes:

s41: establishing a target position model on a time axis when cross-frame association is carried out, wherein the modeling state of each target is as follows: t= [ box, id _m ,t _m ,h _m ,a _m ,]Wherein, box is the target frame, packageThe horizontal coordinate and the vertical coordinate of the upper left coordinate point of the target frame obtained after target detection, the width parameter and the height parameter of the target frame, and the id _m A for each tracking object in the order of appearance in the time axis _m To the number of frames in which the tracked object appears, it is calculated from the first appearance, h _m To continuously track the number of frames to the target, when h _m When the number of frames of the continuous tracking target is smaller than a preset frame number threshold, the tracking target is temporarily not displayed in order to avoid error tracking, t _m For the number of frames that the previously tracked target continuously disappeared, when t _m Above a preset threshold number of frames at which the previously tracked target continues to disappear, the target will be removed from the tracking list;

s43: if the target matching method is not in the initial state, predicting trks of the previous frame, performing IOU calculation with a current frame detection result vector dets, and performing Hungary maximization matching according to an IOU result, wherein the target matching result in each trks and dets has three possibilities: corresponding entry values in IOU matrix>If the IOU threshold is preset, the matching is successful, the previous frame and the current frame are correlated, and the state T of the target is updated, wherein the box is corrected to the current frame position and id _m Unchanged, t _m ＝0，h _m +1，a _m +1; if the target in dets can not be matched with trks, the target is a new object, and the new object is initialized by using a Kalman filter to obtain a corresponding state T, if a plurality of subsequent frames still can be matched with the target, and when h _m >After a preset frame number threshold value for continuously tracking the target is reached, visualizing the target; if there is a target in trks that cannot be matched with dets, the current frame of the target is not detected, if the target t _m >The preset threshold of the number of frames for which the previously tracked target continuously disappears is removed from the tracking list, otherwise its state T, a is corrected _m +1，t _m +1；

S44: storing the corrected states of all tracking targets into a tracking list, waiting for the detection result of the next frame, and repeating the processes of the steps S42-S43;

the method further comprises the steps of:

s6: a Qt framework is adopted to design a user operation interface, a Qt vs tool extension item is added in a visual studio to realize joint programming, and the user interface is designed through a Qt designer, which comprises user login, custom configuration, camera setting and comprehensive detection functions;

the step S6 comprises the following steps:

2. The method according to claim 1, wherein step S1 comprises:

3. The method according to claim 2, wherein step S2 comprises:

4. A method according to claim 3, characterized in that the method consists of width = d2× W, height = d3×h,Is->Correcting the output information subjected to forward reasoning, wherein d0, d1, d2 and d3 are respectively the abscissa of the upper left coordinate point of the target frame, the ordinate of the upper left coordinate point of the target frame, the width of the target frame and the height of the target frame which are output through forward reasoning, W and H are the width and the height of the image acquired by the camera, and left, top, width, height are respectively the abscissa of the upper left coordinate point of the corrected target frame, the ordinate of the upper left coordinate point of the target frame, Target frame width and target frame height.

5. The method according to claim 4, wherein step S3 comprises:

6. The method of claim 4, wherein step S5 comprises: