CN112487920B

CN112487920B - Convolution neural network-based crossing behavior identification method

Info

Publication number: CN112487920B
Application number: CN202011338744.6A
Authority: CN
Inventors: 詹瑾瑜; 周巧瑜; 江维; 范翥峰; 周星志; 孙若旭; 温翔宇; 宋子微; 廖炘可
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2022-03-15
Anticipated expiration: 2040-11-25
Also published as: CN112487920A

Abstract

The invention discloses a climbing behavior identification method based on a convolutional neural network, which is applied to the field of target identification and aims at solving the problem of low detection precision in the behavior identification of pedestrian climbing over a railing in the prior art; by drawing the bounding box with the same size as the figure, the invention overcomes the defects of low real-time performance and no enlargement or reduction of the bounding box in the traditional target detection method; predicting image feature types by adopting a Yolo target detection network, and tracking a target by adopting a GOTURN network; and finally, rapidly using the relative position relation between the railing and the track point set by a priori knowledge method to judge whether the track point set is a crossing behavior, and if the track point set is the crossing behavior, outputting a crossing label and initiating warning.

Description

Convolution neural network-based crossing behavior identification method

Technical Field

The invention belongs to the field of target detection, and particularly relates to a behavior recognition technology.

Background

Aiming at a plurality of types of target scenes, the target detection method aims at accurately judging the type and the position of a target in an image, and the two-stage method can solve the problems. Researchers mainly generate a candidate frame by a Region Proposal method and then carry out coordinate regression prediction according to the candidate frame. Ross Girshick et al adopts a CNN network to extract image features, improves the representation capability of the features to samples from experience-driven artificial feature normal forms HOG and SIFT to data-driven representation learning models, solves the problems that small samples are difficult to train or even over-fit and the like by adopting a mode of supervised pre-training and fine-tuning of the small samples under large samples, and improves the accuracy of target detection to a certain extent. Ross Girshick et al proposed a Fast convolutional network method (Fast R-CNN) based on regional recommendations for target detection. Fast R-CNN uses deep convolutional networks on the basis of previous work, and can classify objects more efficiently. Compared with the previous work, Fast R-CNN carries out multiple innovations, improves the detection precision and the training and testing speed.

Generally, the two-stage method has high network complexity and low processing speed, so that the real-time performance is not high, people in a monitoring video cannot be predicted in real time, and the one-stage method can perfectly solve the problem of low real-time performance. In the one-stage approach, researchers generate coordinate regression predictions, primarily by performing regression directly. Joseph Redmon et al propose a novel target detection method (Yolo). The core idea of the method is to directly regress the position of the bounding box and the category of the bounding box at an output layer by using the whole graph as the input of the network. The velocity of the Yolo method is much faster than the two-stage method, and the basic Yolo model processes images in real time at a rate of 45 frames per second. Wei Liu et al propose a method (SSD) for detecting objects in an image using a single deep neural network. The method combines the regression idea of Yolo and the anchor box mechanism of fast R-CNN, compared with the method proposed by the predecessor, the improved point of SSD is divided into two points, firstly, feature maps with different scales are extracted for detection, a large-scale feature map (the feature map closer to the front) can be used for detecting small objects, and a small-scale feature map (the feature map closer to the rear) is used for detecting large objects; the second is that the SSD employs a priori boxes of different dimensions and aspect ratios. Moreover, the SSD realizes end-to-end training and has higher precision. For different one-stage convolutional neural network methods, different researchers start from the improvement rate to improve the network model, and meanwhile, the accuracy of target detection is improved under the condition of guaranteeing the real-time performance.

The visual target tracking refers to detecting, extracting, identifying and tracking a moving target in an image sequence to obtain motion parameters of the moving target, such as position, speed, acceleration, motion track and the like, so that the next step of processing and analysis is performed, the behavior understanding of the moving target is realized, and a higher-level detection task is completed.

Researchers in the field of target tracking divide the tracking algorithm into generator and discriminant methods. The generating method adopts a characteristic model to describe the appearance characteristic of the target, and then minimizes the reconstruction error between the tracking target and the candidate target to confirm the target; the generating method focuses on feature extraction of the target, ignores background information of the target, and is prone to target drift or target loss when the target appearance changes violently or is shielded. The discriminant method regards target tracking as a binary classification problem, and determines a target from candidate targets by training classifiers related to the target and the background. Most of the current target tracking algorithms based on deep learning also belong to discriminant methods.

Under such a background, a target detection method is used for drawing a figure boundary frame, and then a target tracking method is used for positioning the figures frame by frame, so that the process of primarily judging the motion trajectory of the figures becomes a mainstream trend. Usually, the crossing behavior usually occurs on a road, pedestrians cross the road to cross a railing, or the pedestrians cross the railing around a cell or a school fence to enter and exit, whether the pedestrians cross the railing can be predicted more quickly through a target detection and target tracking method in deep learning, meanwhile, warning is initiated in time, the safety of the pedestrians is guaranteed, a traffic system is standardized, and a community system is perfected.

Disclosure of Invention

In order to solve the technical problem, the invention discloses a crossing behavior identification method based on a convolutional neural network.

The technical scheme adopted by the invention is as follows: a rollover behavior identification method based on a convolutional neural network comprises the following steps:

s1, processing the video data: screening and cutting, namely screening out videos with behaviors of crossing the railing and other behaviors near the railing, and cutting the videos into pictures of video frames;

s2, detecting the video frame obtained in the step S1 through a Yolo target detection network; specifically, the method comprises the following steps: the Yolo target detection network comprises at least three parts: a Backbone section, a tack section, a Head section; aggregating and forming image features through a backhaul part, combining and transmitting the image features to a prediction layer through a Neck part, predicting the image features through a Head part, generating a boundary frame and predicting categories;

s3, transferring the boundary frame, the prediction type person and the current video frame to a GOTURN network for target tracking, inputting the coordinates of the boundary frame from the current frame to the GOTURN network, inputting the target from the previous frame to the GOTURN network, and the GOTURN network learning and comparing the targets to find the target object in the current image, drawing track points frame by frame and connecting the track points into a track line;

and S4, judging whether the behavior is a behavior of crossing the railing by using the priori knowledge through the relative position relation between the track point set and the railing position.

Further, the step S1 includes the following sub-steps:

s11, searching and downloading a plurality of video data sets containing various character actions;

s12, screening character videos containing behaviors of crossing the rail and other object videos similar to the character shape from the video data set;

and S13, cutting the screened videos, and continuously cutting each video into video frames according to 25fps to obtain a series of continuous video frames and storing the video frames.

Further, the partially polymerizing and forming the image feature by the Backbone is specifically as follows: the input video frames are aggregated through a CSPRESNext50 neural network and form image features to realize image feature extraction.

Further, the grouping and combining by the tack part and transferring the image features to the prediction layer specifically includes: and combining the image characteristics through SPP-block and PANET and transmitting the image characteristics to a prediction layer.

Further, the prediction category includes at least a prediction category person.

Further, the step S3 includes the following sub-steps:

s31, the boundary box generated in the step S24 and the prediction type are person, and the current video frame is transmitted to a GOTURN network;

s32, cutting the current video frame and the boundary box to obtain a central area with a target, and cutting the previous frame of video to obtain a search area with the target;

s33, enabling the search area of the previous frame target and the current frame obtained in the S32 to pass through the CNN convolutional layer at the same time, enabling the output of the convolutional layer to pass through the full connection layer to be used for regressing the position of the boundary frame of the current frame target, and drawing the center point of the coordinate frame of the current frame as a track point for subsequent track analysis;

and S34, repeating the steps S31-S32 until all the video frames enter the GOTURN network.

Further, the step S4 includes the following sub-steps:

s41, recording the track points generated in the step S33, and generating a track point set;

s42, manually marking the position coordinates of the railing line for subsequent track analysis;

and S43, judging whether the track is a crossing behavior or not by using the prior knowledge according to the relative position relation between the track point set and the railing position, if so, outputting a cross label, and if not, not outputting the cross label.

The invention has the beneficial effects that: the one-stage target detection method can detect the figure in real time and accurately draw the boundary box with the same size as the figure, and overcomes the defects of low real-time performance and unavailable size of the boundary box in the traditional target detection method. Meanwhile, the discriminant target tracking method is simple in network structure, high in figure tracking speed and accurate, track points of subsequent video frames can be effectively drawn, and finally whether the crossing behavior is achieved or not is judged by quickly applying the relative position relation between the railing and the track point set through a priori knowledge method.

Drawings

FIG. 1 is a flow diagram of a convolutional neural network-based traversal behavior identification technique of the present invention;

FIG. 2 is an overall design diagram of the convolutional neural network-based crossing behavior recognition technique of the present invention;

FIG. 3 is a diagram of a one-stage network architecture according to the present invention;

fig. 4 is a diagram of the architecture of the GOTURN network according to the present invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

As shown in fig. 1, the method for identifying a crossing behavior based on a convolutional neural network of the present invention includes the following steps:

s1, processing the video data: screening and cutting, namely screening out videos with crossing behaviors and part of other behaviors, and cutting the videos into pictures of video frames; as shown in fig. 2, the method specifically includes the following sub-steps:

and S11, searching and downloading a plurality of video data sets containing various character actions for subsequent work. The downloaded video data set is also limited due to the large amount of memory required for the video data set.

And S12, screening out the character video containing the turning behavior and other object videos similar to the character form from the video data set. According to the action classification in the video data set, videos with characters having crossing behaviors, videos with characters having other behaviors and the like can be screened out.

In general, video cropping (or segmentation) methods are mainly classified into a time-domain-based video object segmentation method, a motion-based video object segmentation method, and an interactive video object segmentation method. The time domain segmentation mainly utilizes the continuity and the correlation between adjacent video images for segmentation, one specific method is to obtain a differential image by subtracting a current frame and a background frame, and the other method is to obtain the differential image by utilizing the difference between two frames or between multiple frames; the video object segmentation based on motion is mainly based on methods such as an optical flow field and the like to carry out motion parameter estimation, a pixel area which accords with a motion model is solved, and then the area is combined to form a motion object to carry out video segmentation; whereas in interactive segmentation, the user initially segments the video image through a graphical user interface and then segments subsequent frames with motion and spatial based information.

S2, passing the video frame through a Yolo target detection network: aggregating and forming image features through a backhaul, combining and transmitting the image features to a prediction layer through a Neck, predicting the image features through a Head, generating a boundary frame and predicting categories; as shown in fig. 3, this step specifically includes the following sub-steps:

s21, inputting the processed video frame;

s22, Backbone section. Gathering input video frames through a CSPRESNext50 neural network and forming image features so as to realize image feature extraction;

s23, part Neck. Combining the image characteristics through SPP-block and PANET and transmitting the image characteristics to a prediction layer;

s24, Head section. The image features of the prediction layer are predicted by Head, a bounding box is generated, a prediction type is predicted, and if the prediction type is person, the process proceeds to step S3, and if the prediction type is not person, the process proceeds to step S21.

The Yolo detection network comprises 24 convolutional layers and 2 full-connection layers, and the Yolo network uses the GoogLeNet classification network structure for reference. In contrast, Yolo does not use an initiation module, but instead uses a 1 × 1 convolutional layer (where the presence of the 1 × 1 convolutional layer is for cross-channel information integration) + a simple replacement for the 3 × 3 convolutional layer. The Yolo fully-connected output layer divides the input image into S × S grids, each grid being responsible for detecting objects 'falling into' the grid, S representing the number of cells, e.g., when S ═ 7, S × S represents the division of the image into 7 × 7 cells, 7 cells in the horizontal direction, and 7 cells in the vertical direction. If the coordinates of the center position of an object fall into a certain grid, the grid is responsible for detecting the object. Each trellis outputs B bounding box information, and C probability information that the object belongs to a certain class. The bounding box information contains 5 data values, x, y, w, h, and confidence. Where x and y refer to coordinates of the center position of the bounding box of the object predicted by the current grid. w, h are the width and height of the bounding box. Note that: in the actual training process, the values of w and h are normalized to a [0,1] interval by using the width and the height of the image; x, y are offset values of the bounding box center position relative to the current grid position and are normalized to [0,1 ]. The confidence reflects whether the current bounding box contains the object and the accuracy of the position of the object, and the calculation method is as follows:

confidence＝P(object)*IOU

if the bounding box contains an object, p (object) is 1; otherwise, p (object) is 0.IOU (intersection over intersection) is the prediction bounding box.

Yolo optimizes model parameters using mean square sum error as a loss function, i.e., Yolo detects the mean square sum error of S x S (B x 5+ C) dimensional vector output by the network and the corresponding S x S (B x 5+ C) dimensional vector of the real image:

coordError, iouError, classror represent coordinate error, IOU error, and classification error between the predicted data and the calibration data, respectively.

S3, transferring the boundary frame and the prediction type person as well as the current video frame to a GOTURN network for target tracking, inputting the coordinates of the boundary frame from the current frame to the network, inputting the target from the previous frame to the network, and comparing the targets through network learning to find the target object in the current image, drawing track points frame by frame and connecting the track points into a track line; as shown in fig. 4, the method specifically includes the following sub-steps:

the convolution layer of the GOTURN network adopts a 5-layer structure, the structure refers to a structure in CaffeNet, excitation functions of the convolution layers all adopt relu excitation functions, a pooling layer is added behind part of the convolution layers, a full connection layer is composed of 3 layers, 4096 nodes are arranged in each layer, and dropout and relu excitation functions are adopted among the layers to prevent overfitting and gradient disappearance. And simultaneously passing the target of the previous frame and the search area of the current frame through the CNN convolution layer, and then passing the output of the convolution layer through the full-connection layer for returning the position of the target of the current frame.

The loss function is L1-loss, expressed as follows:

where n denotes the total number of predicted objects, y_iRepresenting the actual output, d_iRepresenting a real tag.

S34, repeating step S3 until all video frames enter the GOTURN network.

And S4, judging whether the crossing behavior is the crossing behavior by using the priori knowledge through the relative position relation between the track point set and the railing position. As shown in fig. 2, the method specifically includes the following sub-steps:

English in fig. 4 indicates: the Current frame is the Current frame, the Previous frame is the Previous frame, the Search Region is the Search box set, What is tracked by the What to track, Conv Layers are the conversion Layers, full-Connected Layers are the full-Connected Layers, and the Predicted location of the target with the Search Region is the Predicted position of the target in the Search area.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A rollover behavior identification method based on a convolutional neural network is characterized by comprising the following steps:

2. The convolutional neural network-based crossing behavior recognition method as claimed in claim 1, wherein the step S1 comprises the following sub-steps:

3. The method for identifying the rollover behavior based on the convolutional neural network as claimed in claim 1, wherein the partially aggregating and forming of the image feature by the backhaul is specifically as follows: the input video frames are aggregated through a CSPRESNext50 neural network and form image features to realize image feature extraction.

4. The method for identifying the crossing behavior based on the convolutional neural network as claimed in claim 3, wherein the image features are combined and transmitted to the prediction layer through the tack part, specifically: and combining the image characteristics through SPP-block and PANET and transmitting the image characteristics to a prediction layer.

5. The convolutional neural network-based traversal behavior identification method as claimed in claim 4, wherein the prediction categories at least include prediction category persons.

6. The convolutional neural network-based traversal behavior identification method as claimed in claim 1, wherein the step S3 comprises the following sub-steps:

7. The convolutional neural network-based traversal behavior identification method as claimed in claim 6, wherein the step S4 comprises the following sub-steps: