CN112132089A

CN112132089A - Excavator behavior analysis method based on 3D convolution and optical flow

Info

Publication number: CN112132089A
Application number: CN202011054904.4A
Authority: CN
Inventors: 张钦海; 米松; 王思俊; 魏云; 曹帅兵
Original assignee: Tianjin Tiandi Weiye Intelligent Security Technology Co ltd
Current assignee: Tianjin Tiandi Weiye Intelligent Security Technology Co ltd
Priority date: 2020-09-28
Filing date: 2020-09-28
Publication date: 2020-12-25

Abstract

The invention provides an excavator behavior analysis method based on 3D convolution and optical flow, which comprises the following steps: s1, firstly, acquiring a data set, and performing enhanced processing on the data in the data set; s2, training a sample by using a caffe, and using a python layer as a data input layer; s3, building a network architecture of target classification based on resnet18-3D deep learning; and S4, configuring parameters of the training model and training the detection model. The excavator behavior analysis method based on the 3D convolution and the optical flow is combined with a deep learning method, the accuracy rate is improved by utilizing the time sequence of the video, and great application value can be generated.

Description

Excavator behavior analysis method based on 3D convolution and optical flow

Technical Field

The invention belongs to the technical field of video monitoring analysis, and particularly relates to an excavator behavior analysis method based on 3D convolution and optical flow.

Background

At present, a monitoring camera is visible everywhere, but is artificially monitored in the monitoring process, so that a lot of manpower, material resources and financial resources are wasted, and the efficiency is low. Secondly, recent deep learning compares fire and heat, but the feasibility technology for video behavior analysis is few and few. The existing technology for analyzing the current video has the following defects:

1) most of the current production and life use traditional video monitoring systems, the video results are artificially stored and played back, and sometimes alarms cannot be generated in real time.

2) Research aiming at the traditional video analysis method is based on artificial feature extraction, wherein the process of feature extraction is complicated, and particularly the video analysis in a complex scene is difficult to realize.

3) In the research aiming at video analysis deep learning, currently, judgment is mostly made aiming at a single-frame picture in a video, time dimension in the video is difficult to fuse, and a behavior analysis technology mainly aims at analysis on actions such as human-to-human, human-to-object and the like. The result of the behavior is difficult to judge only by a single picture, resulting in low accuracy. And the accuracy rate is effectively improved by utilizing the time information in the video. In addition, an effective and fast classification method is provided for the limitation of the current computer hardware. And the running speed is improved while the accuracy is ensured.

4) Although the speed is high, the accuracy is not high, the training time is long, and convergence is difficult. TSN networks, while highly accurate, are slow. The current popular 3D convolution is fast, integrates the space and time information of the video, and balances the speed and the accuracy.

Disclosure of Invention

In view of the above, in order to overcome the above-mentioned drawbacks, the present invention aims to provide a method for analyzing excavator behavior based on 3D convolution and optical flow,

in order to achieve the purpose, the technical scheme of the invention is realized as follows:

an excavator behavior analysis method based on 3D convolution and optical flow comprises the following steps:

s1, firstly, acquiring a data set, and performing enhanced processing on the data in the data set;

s2, training a sample by using a caffe, and using a python layer as a data input layer;

s3, building a network architecture of target classification based on resnet18-3D deep learning;

and S4, configuring parameters of the training model and training the detection model.

Further, in step S1, the data set is obtained from the monitoring device, and the data set is obtained from the UCF101 and the HMDB 51.

Further, the data enhancement processing in step S1 includes brightness processing of the image, sharpening processing of the image, random salt and pepper processing, and fine adjustment of the rotation angle of the image ([ -10 °, 10 ° ]).

Further, the specific method of step S3 is as follows:

the data sets are sent to two branches, one 3D network receiving RGB format and one 3D network receiving optical flow;

the basic backbone network adopts a resnet18-3D residual 3D network, 16 frames of pictures are input into the network each time, the size of a convolution kernel is [ N multiplied by H multiplied by W multiplied by C multiplied by T ], wherein T is a time channel, and the time sequence of 16 pictures is subjected to fusion convolution;

the final loss function is the sum of the losses of the two networks, and then the total loss is sent to a SoftmaxWithLoss layer for classification.

Further, the specific method of step S4 is as follows:

the learning rate of the network is set to 0.01 at the beginning, the learning rate decreases by 0.1 per 10000 iterations, and 40000 iterations are performed totally; the Dropout values are set to 0.7 and 0.9, respectively; and selecting an SGD algorithm by an optimization function, checking the Loss value and the accuracy of the test set in the training process, and obtaining a proper classification model.

Further, the loss value is calculated as follows:

the loss in the training process is divided into two, one is a loss function of a space network, the other is a loss function of an optical flow network, and the loss function utilizes the cross entropy loss of standard classification;

Loss_{general assembly}＝Loss_RGB+Loss_{Optical flow}

Wherein, in Loss_{Optical flow}Wherein C is the number of classes, N is the number of samples, P is_iIs the probability value, P, of sample i_i ^*Is the probability value of the true sample i, G_iScore value, y, for each category_iIs a predicted value of the ith category, Loss_{Optical flow}Loss value of optical flow, Loss_RGBFor RGB loss values, the final computed score probability value is calculated as follows, where k is the number of classes:

y_t＝argmax(output(x_j))

wherein, [ x ]₁、x₂...x_i]Solving a maximum function for the eigenvalues of each class argmax ();

the final output result of the network is the sum x of the scores of the space network and the time network_iThen solving probability values of all categories through a SoftMax layer, wherein the maximum probability is the final prediction result y_t。

Compared with the prior art, the excavator behavior analysis method based on the 3D convolution and the optical flow has the following advantages:

the excavator behavior analysis method based on the 3D convolution and the optical flow solves the problem that most of the existing video behavior analysis is manual monitoring or the traditional method is used for feature extraction. Not only is manpower wasted, but also the accuracy of the traditional method is low. If a deep learning method is combined, the accuracy is improved by utilizing the time sequence of the video, great application value can be generated, and the benefits are as follows:

1) the application field is wide. Behavioral information for many apparent actions can be monitored.

2) Aiming at the problem of cost, the camera is used for acquiring video information and recognizing video content through deep learning, extra manpower is not needed, and manpower, material resources and financial resources are greatly saved.

3) By applying the deep learning method, the convolution network has accurate extraction characteristics, high recognition rate, high 3D convolution speed and fusion time sequence, and the optical flow information ensures that the result is more accurate.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic view of an application scenario of a monitoring device according to an embodiment of the present invention;

FIG. 2 is a network structure diagram of an analysis method according to an embodiment of the present invention;

fig. 3 is a flowchart of an analysis method according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

The method is suitable for most scenes of monitoring equipment installation, and can automatically switch parameter configuration of the monitoring equipment according to real-time monitored image information.

The method can be applied to various fields, such as a campus corner of a school, places where few people go and go, and fighting events can be monitored. For example, the cable safety monitoring device can be applied to outdoor cable safety monitoring, and the cable is prevented from being damaged or stolen by misoperation of an excavator. Such as in prisons to prevent fighting events. For example, the early warning can be realized by monitoring the actions of falling of the old and the children.

The specific implementation method comprises the following steps:

fig. 1 is a schematic diagram of an application scenario: the indoor installation of the monitoring equipment is generally not less than 2.5 meters, and the outdoor installation is not less than 3 meters.

The camera can be placed in a crowd area or a target area and cannot be shielded.

Fig. 2 is a schematic diagram of a network structure: the network adopts a resnet18-3D structure, extracts a fixed frame number (default 16 frames) each time according to information acquired by video streaming, then respectively sends the fixed frame number to different networks for classification, and finally judges a classification result.

FIG. 3 is an algorithm flow chart: the algorithm combines 3D and optical flow network processing videos, 3D extracts spatial information and time information in the videos, then optical flow extracts somatic action information, and the two are fused, so that the result is greatly improved.

(i.) first, the dataset is obtained, the actual dataset is obtained from the monitoring device, and the data is obtained from an open source dataset on the network, such as UCF101, HMDB51, etc. And then carrying out data enhancement processing on the acquired data set, wherein the data enhancement processing comprises brightness processing of an image, sharpening processing of the image, random salt and pepper processing and fine adjustment of the rotation angle ([ -10 degrees, 10 degrees ]) of the image. Training a sample: the test sample was 8: 1. And (4) utilizing the cafe training and using the python layer as a data input layer.

(ii.) build a network architecture based on the object classification of resnet18-3D deep learning, the data sets are fed into two branches, one receiving the 3D network in RGB format and one receiving the 3D network of optical flow. The basic backbone network adopts a resnet18-3D residual 3D network, the network inputs 16 frames of pictures at each time, the size of a convolution kernel is [ N multiplied by H multiplied by W multiplied by C multiplied by T ], wherein T is a time channel, and the time sequence of 16 pictures is subjected to fusion convolution. The final loss function is the sum of the losses of the two networks, and then the total loss is sent to a SoftmaxWithLoss layer for classification.

(iii.) parameters of the training model were configured, the learning rate of the net was initially set to 0.01, with 10000 learning rate drops by 0.1 per iteration, 40000 total iterations. The Dropout values are set to 0.7 (spatial network) and 0.9 (temporal network), respectively. And (3) selecting an SGD (random gradient descent) algorithm by an optimization function, checking the Loss value and the accuracy of the test set in the training process, and obtaining a proper classification model.

(iv) initially, 16 pictures are taken from the acquisition device.

And (v.) sending the pictures into a network, sending the pictures into a trained model according to the pictures obtained in the step (iv), wherein the pictures are divided into two branches, firstly sending 16 frames of pictures into a space network for 3D convolution to obtain a probability value, secondly extracting the light stream of the 16 frames of pictures and sending the light stream of the 16 frames of pictures into a light stream network to obtain a probability value, and then adding the probability values obtained by the two branches and sending the sum to a SoftMax layer to obtain a final classification result.

(vi.) judging the target action type according to the result obtained in (v.).

The calculation method of the loss value in the step (iii) is shown as the following formula. The loss value is calculated as follows:

Loss_{general assembly}＝Loss_RGB+Loss_{Optical flow}

y_t＝argmax(output(x_j))

the final output result of the network is the sum x of the scores of the space network and the time network_iThen solving probability values of all categories through a SoftMax layer, wherein the maximum probability isFinal predicted result y_t。

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. An excavator behavior analysis method based on 3D convolution and optical flow is characterized by comprising the following steps:

2. The excavator behavior analysis method based on 3D convolution and optical flow according to claim 1, characterized in that: in step S1, a data set is obtained from the monitoring device, and data is obtained from the UCF101 and the HMDB51 network open source data sets.

3. The excavator behavior analysis method based on 3D convolution and optical flow according to claim 1, wherein the data enhancement processing in step S1 includes brightness processing of an image, sharpening processing of an image, random salt and pepper processing, and fine adjustment of a rotation angle ([ -10 °, 10 ° ]) of an image.

4. The excavator behavior analysis method based on 3D convolution and optical flow according to claim 1, wherein the specific method of step S3 is as follows:

5. The excavator behavior analysis method based on 3D convolution and optical flow according to claim 1, wherein the specific method of step S4 is as follows:

6. The excavator behavior analysis method based on 3D convolution and optical flow according to claim 5, wherein: the loss value is calculated as follows:

Loss_{general assembly}＝Loss_RGB+Loss_{Optical flow}

Wherein, in Loss_{Optical flow}Wherein C is the number of classes, N is the number of samples, P is_iIs the probability value, P, of sample i_i ^*Is the probability value of the true sample i, G_iScore value, y, for each category_iIs a predicted value of the ith category, Loss_{Optical flow}Loss value of optical flow, Loss_RGBLoss value for RGBThe final computed score probability value is calculated as follows, where k is the number of categories:

y_t＝argmax(output(x_j))