CN113706584A

CN113706584A - Streetscape flow information acquisition method based on computer vision

Info

Publication number: CN113706584A
Application number: CN202111026783.7A
Authority: CN
Inventors: 王峥; 吴东鹏; 黄秀君
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-11-26

Abstract

The invention discloses a street view flow information acquisition method based on computer vision, which comprises the steps of detecting and identifying objects in street views by adopting a target detection algorithm YOLOv5 for each frame of video; extracting the appearance characteristic of the detected object to assist the matching of the detection frame and the prediction frame; predicting the position of the next frame of each detected target by using a Kalman filtering algorithm; calculating a cost matrix by using the extracted appearance features and the extracted motion features by using the Hungarian algorithm, and realizing cascade matching of the detection frames to allocate tracking target serial numbers to the identified objects; maintaining appearance characteristics and tracking serial numbers of the detected object by using data structures such as a prototype library and the like, and judging whether the detected object appears in the video for the first time; and intercepting the small image of the object detected for the first time, transferring the small image to a specified path, counting the number of different objects appearing in each category, and displaying the motion track of the object in a video. The invention realizes the real-time acquisition of the image information of common objects and the counting statistical information of different types of objects under the street view.

Description

Streetscape flow information acquisition method based on computer vision

Technical Field

The invention belongs to the technical field of multi-target tracking intersection, and particularly relates to a street view flow information acquisition method based on computer vision.

Background

Video target tracking is an important task in computer vision, and refers to a process of continuously deducing the state of a target in a video sequence, and the task is to generate a motion track of the target by positioning the target in each frame of a video and provide a complete target area where a tracking target appears in the video at each moment. Video tracking technology has very wide application in the field of computer vision. The invention combines and improves the target tracking technology and the detection technology, and optimizes the problems of low inference speed and tracking loss of the common street view flow information collection technology.

The target detection algorithm commonly used in the street view traffic information collection technology is usually based on an R-CNN algorithm (R-CNN, Fast R-CNN, etc.), firstly, a target candidate frame is generated through algorithm calculation, and then the generated candidate frame is classified and regressed and screened. The problems generated by the method are that the reasoning speed is low, the real-time performance of video detection is difficult to meet, and frame extraction processing is often required to be performed on the video. Aiming at the problem, the latest YOLOv5 algorithm in one-stage is adopted, so that the method has higher reasoning speed, higher precision and robustness, and meanwhile, the accuracy of detecting the small target object is improved.

Meanwhile, the tracking algorithm adopted by the technology commonly used for street view flow information collection is the sort algorithm, and track-id switching can occur after a tracking target is temporarily shielded due to the fact that image information cannot be fully utilized. Aiming at the phenomenon, the invention extracts the image information of the tracking target detection frame by using a simple CNN convolution network, and adds the extracted image information in the data cascade, thereby improving the precision of the whole algorithm.

Disclosure of Invention

The invention aims to provide a street view flow information acquisition method based on computer vision, which solves the problems in the prior art.

The invention adopts the following technical scheme for realizing the functions:

the street view flow information acquisition method based on computer vision is characterized by comprising the following steps:

s1: identifying more than ten common objects under street views appearing in each frame of the video by using a YOLOv5 algorithm, framing the detected object out of the video by using frames with different colors according to different categories, and displaying the category and the confidence coefficient of the detection at the upper left corner of the object;

s2: extracting appearance characteristics of the detected object, storing the appearance characteristics as a low-dimensional vector and providing a basis for associated data;

s3: predicting the position of the next frame of object by using a Kalman filtering algorithm to generate a prediction frame;

s4: the prediction box and the detection boxes are in cascade matching by using a Hungarian algorithm, and a tracking serial number is allocated to each detection box;

s5: and intercepting the small image of the object appearing in the video for the first time, storing the small image to a specified path, and counting the number of various objects appearing.

Further optimization, the specific process of step S1 is as follows:

s11: the input end adopts three methods of Mosaic data enhancement, self-adaptive anchor frame calculation and self-adaptive picture scaling to preprocess the input image data:

(1) and Mosaic data enhancement: the training images are spliced in the modes of random scaling, random cutting and random arrangement, the background of a detected object is enriched, and the data of four pictures can be calculated at one time during BN calculation, so that the mini-batch size does not need to be large, a GPU can achieve a good effect, and enrichment of a data set and enhancement of detection accuracy of a small target object are facilitated;

(2) and (3) self-adaptive anchor frame calculation: in Yolov3, Yolov4, the calculation of the initial anchor box values is run by a separate program when training different data sets. However, the Yolov5 embeds the function into the code, and the optimal anchor frame value in different training sets is calculated in a self-adaptive manner during each training;

(3) adaptive picture scaling: the idea of scaling the picture size by the prior YOLO algorithm is changed, the least black edges are added to the original image in a self-adaptive mode, the black edges at two ends of the image height are reduced, the calculated amount is reduced during reasoning, and the target detection speed is improved.

S12: the Backbone adopts a Focus structure, a CSP structure:

(1) focus structure: cutting an input picture through a slicing operation, taking a value in every other pixel in one picture, obtaining four pictures through the operation, complementing the four pictures, and having no information loss, concentrating two-dimensional information of the images into a channel space, widening an input channel by 4 times, namely changing the spliced pictures into 12 channels relative to an original RGB three-channel mode, and finally performing convolution operation on the obtained new picture to finally obtain a double-sampling feature map under the condition of no information loss;

(2) CSP structure: different from the YOLOv4 algorithm, two CSP structures are designed in YOLOv5, a CSP1_ X structure is applied to a Backbone network of a backhaul, and another CSP2_ X structure is applied to a Neck.

S13: the heck adopts an FPN + PAN structure, i.e. a bottom-up feature pyramid is added behind the FPN layer, which contains two layers of PAN structures. The operation is combined, the FPN layer transmits strong semantic features from top to bottom, the feature pyramid transmits strong positioning features from bottom to top, the two structures interact with each other, and parameter aggregation is performed on different detection layers from different trunk layers.

S14: the output end adopts GIOU _ Loss as a Loss function of a Bounding box

Wherein C is the minimum circumscribed rectangle, IOU is the cross-over ratio, the numerical value is equal to the overlap area divided by the union area, and the evaluation target detection algorithm precision standard is obtained.

Further optimizing, the specific process of the second step is as follows: a CNN which is relatively simple and small in calculation amount is adopted to extract appearance features of an object to be detected (a detection frame coverage area) and is represented by a 128-dimensional vector, and after each frame of detection and tracking, the appearance features of the object are extracted and stored. While preserving appearance characteristics uses the data structure galery, i.e. the

L_kThe index i indicates that only the appearance feature of the target k in the frame 100 before the current time can be stored, and i indicates the tracking number.

Specifically, the step S3 specifically includes:

(1) the track's state at time t-1 is predicted based on its state at time t-1.

x'＝Fx (1)

P'＝FPF^T+Q (2)

(2) The predicted position is updated based on detection.

y＝z-Hx' (3)

S＝HP'H^T+R (4)

K＝P'H^TS^-1 (5)

x＝x'+Ky (6)

P＝(I-KH)P' (7)

In

formulas

1 and 2, F is a state transition matrix, x is the mean value of the track at t-1, Q is a system noise matrix, and F^TTranspose of state transition matrix, mean error of y detection and track, S noise error, I identity matrix.

In formula 3, z is a mean vector of detection and does not contain a speed variation value, i.e., z ═ cx, cy, r, H is a measurement matrix that maps a mean vector x' of track to the detection space, and the formula calculates mean errors of detection and track;

in formula 4, R is a noise matrix of the detector, which is a diagonal matrix of 4 × 4, values on the diagonal are two coordinates of the center point and the width and height noise, respectively, and are initialized with arbitrary values, and the width and height noise is generally set to be larger than the noise of the center point.

Calculating a Kalman gain K by formula 5, wherein the Kalman gain is used for estimating the importance degree of the error;

equations

6 and 7 obtain the updated mean vector x and covariance matrix P.

Further optimization, the specific process of step S4 is as follows:

(1) computing mahalanobis distance of motion features as a cost function

In the above formula, i represents the tracking number, and j represents the detection frame number (y)_i,S_i) Representing the projection of the i-th Kalman filter distribution in the measurement space, y_iIs mean value, S_iIs the covariance. Because the distance measurement is performed with the measured values (detection frames), the measurement must be performed in the same spatial distribution. The Mahalanobis distance is used to calculate the uncertainty between the state estimates by measuring the standard deviation between the mean values of the tracked positions of the Kalman filter and the detection box, i.e., d⁽¹⁾(i, j) is the mahalanobis distance (uncertainty) between the ith trace distribution and the jth detection box;

d_ja motion feature vector representing the target j,

The covariance is indicated.

(2) Calculating the minimum cosine distance between the 128-dimensional vector (appearance characteristic) extracted by the small CNN and the appearance characteristic of the first 100 frames stored in the galery to obtain a cost matrix

And realizing cascade matching of the prediction box and the detection box by using a Hungarian algorithm.

The detection frame with reference numeral j is transposed,

Representing the appearance features extracted by object k.

Further optimization, the specific process of step S5 is as follows: and storing a 'category-tracking sequence number' pair by using a queue, checking whether the detected object exists in the data stored in the queue after the tracking sequence number is matched for the detection frame, and if the object appears in the video sequence for the first time, intercepting a small image of the frame of the object by using opencv and storing the small image to a specified position.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the YOLOv5 with high performance in reasoning speed and precision is combined with a multi-target tracking algorithm, so that the monocular camera collects image information of common objects in the street view in real time, classifies and counts the identified objects, and reports the small pictures appearing in the video for the first time. The method has higher reasoning speed, higher precision and robustness, and simultaneously improves the accuracy of detecting the small target object.

Meanwhile, the invention extracts the image information of the tracking target detection frame by using a simple CNN convolutional network, and the extracted image information is added in the data cascade, thereby improving the precision of the whole algorithm.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of the idea of step five tracking;

FIG. 3 is a frame occurring in a video;

FIG. 4 is the result of video annotation after the present invention has been run;

FIG. 5 shows the result of the image information of a common object in the street view video collected by the present invention;

fig. 6 shows the statistics of the number of objects in each category.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As shown in fig. 1, first, a frame of unprocessed video is subjected to the YOLOv5 algorithm for object recognition processing, so as to obtain a detection frame of each object in the current frame. And predicting the position of the object detection frame in the current frame by a Kalman filtering algorithm according to the information of the previous frame detection frame to obtain a prediction frame. And processing the intersection in the detection frame by a lightweight CNN to obtain a 128-dimensional vector for representing the appearance characteristics of the object. And calculating the Mahalanobis distance of the motion characteristic represented by the prediction box and the minimum cosine distance of the appearance characteristic to obtain a cost matrix, calculating by using a Hungarian algorithm to obtain the best match, and allocating a tracking sequence number and a detection category to each detection box to be displayed at the upper left corner of the detection box. And storing the small detection frame image in the video which appears for the first time to a corresponding path, counting how many objects appear in each category, and displaying the statistical data on the left side of the image in real time.

The invention relates to a street view flow information acquisition method based on computer vision, which comprises the following steps:

the method comprises the following steps: and performing target recognition on each frame of the original video by using a YOLOv5 algorithm to obtain a prediction frame, distinguishing different types of objects by using prediction frames with different colors, and displaying the category information and the confidence coefficient of the detected object at the upper left corner of the detection frame. Compared with the previous generation YOLO algorithm, the YOLO 5 algorithm improves the network structure and training skills so as to obtain higher inference speed and detection accuracy, and meanwhile, the detection accuracy can further enhance the tracking accuracy of the tracking algorithm.

Step two: and representing the motion state of the detected object detection frame by using an 8-dimensional vector, and predicting the position of the next frame of object detection frame by using a Kalman algorithm according to the change of the motion state of the previous frame.

Step three: the appearance characteristics of the detected object are extracted by using a simple CNN network and are stored by using a data structure galery. The appearance characteristic effectively improves the ID _ Switch phenomenon of the tracking object, and greatly improves the accuracy of the tracking algorithm.

Step four: and calculating the Mahalanobis distance of the motion state and the minimum cosine distance of the appearance characteristics to obtain a cost matrix, performing cascade matching on the cost matrix by using a Hungarian algorithm, and allocating a corresponding tracking serial number to each detection box.

Step five: by recording the data pair of "category-tracking sequence number", as shown in fig. 2, it is determined whether each detected object appears in the video for the first time, if so, the corresponding region picture is captured by opencv and saved to the corresponding path, as shown in fig. 5, and the number of different objects appearing in the video for each category is recorded, as shown in fig. 4 and 6. If not, the central point of the position where the previous frame appears is correlated, and a moving track of the object is presented in the video by using opencv.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. The street view flow information acquisition method based on computer vision is characterized by comprising the following steps:

s1: identifying more than ten kinds of common objects under street views appearing in each frame of the video, framing the detected object out of the video by using frames with different colors according to different categories to obtain a detection frame, and displaying the category and the confidence coefficient of the detection on the upper left corner of the object;

s2: extracting appearance characteristics of the detected object and providing a basis for the associated data;

s3: predicting the position of the next frame of object to generate a prediction frame;

s4: the prediction frames and the detection frames are in cascade matching, and a tracking serial number is distributed to each detection frame;

2. The street view traffic information collection method based on computer vision as claimed in claim 1, wherein the step S1 is specifically performed by:

(1) and Mosaic data enhancement: the training images are spliced in the modes of random zooming, random cutting and random arrangement, so that the data set is enriched and the detection precision of small target objects is enhanced;

(3) adaptive picture scaling: the least black edges are added to the original image in a self-adaptive manner, so that the black edges at two ends of the image height are reduced, and the calculated amount is reduced during reasoning, namely the target detection speed is improved;

s12: backbone: focus structure, CSP structure:

(1) focus structure: clipping an input picture through a slicing operation;

(2) CSP structure: different from a YOLOv4 algorithm, two CSP structures are designed in a Yolov5, a CSP1_ X structure is applied to a Backbone network of a backhaul, and the other CSP2_ X structure is applied to a Neck;

S14: and the output end adopts GIOU _ Loss as a Loss function of the Bounding box.

3. The street view traffic information collection method based on computer vision as claimed in claim 1, wherein the step S2 is specifically performed by: adopting CNN to extract appearance characteristics of a detection frame coverage area in a detected object and using 128-dimensional vector to represent, after each frame detection and tracking, extracting and storing the appearance characteristics of the object once, and storing the appearance characteristics by using a data structure, namely, a data structure galery

L_kIndicating that the appearance features of the target k at most 100 frames before the current time can be stored in the galery, wherein i represents a tracking serial number;

representing the appearance features extracted by the target k, i.e. the 128-dimensional vector of the target box extracted by CNN.

4. The street view traffic information collection method based on computer vision as claimed in claim 1, wherein the step S3 is specifically performed by:

(1) predicting the state of the track at the t moment based on the state of the track at the t-1 moment;

(2) the predicted position is updated based on detection.

5. The street view traffic information collection method based on computer vision as claimed in claim 1, wherein the step S4 is specifically performed by: and calculating the motion characteristics obtained through Kalman filtering and the 128-dimensional vector appearance characteristics extracted by the small CNN to obtain a cost matrix, and realizing cascade matching of the prediction frame and the detection frame by using a Hungary algorithm.

6. The street view traffic information collection method based on computer vision as claimed in claim 1, wherein the step S5 is specifically performed by: and storing a 'category-tracking sequence number' pair by using a queue, checking whether the detected object exists in the data stored in the queue after the tracking sequence number is matched for the detection frame, and if the object appears in the video sequence for the first time, intercepting a small image of the frame of the object by using opencv and storing the small image to a specified position.