CN112800934A

CN112800934A - Behavior identification method and device for multi-class engineering vehicle

Info

Publication number: CN112800934A
Application number: CN202110098578.5A
Authority: CN
Inventors: 汪霖; 李一荻; 曹世闯; 汪照阳; 胡莎; 刘成; 陈晓璇; 姜博; 李艳艳; 周延
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2021-05-14
Anticipated expiration: 2041-01-25
Also published as: CN112800934B

Abstract

The invention provides a multi-category engineering vehicle behavior recognition method and device, which are characterized in that videos to be recognized are input into a trained target detection model, so that the trained target detection model recognizes the videos to be recognized, a prediction frame containing an engineering vehicle target in the videos to be recognized is output, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, then images in the range of the prediction frame are input into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the videos to be recognized and recognizes the behavior of the engineering vehicle target to obtain the category of the behavior of the engineering vehicle target in the videos to be recognized, the behavior recognition network simulates time domain information through the displacement of different groups of feature vectors in channel dimensions, and therefore, the speed of the behavior recognition process is greatly improved, different behaviors of a plurality of engineering vehicles can be identified in real time.

Description

Behavior identification method and device for multi-class engineering vehicle

Technical Field

The invention belongs to the technical field of video image recognition, and particularly relates to a behavior recognition method and device for a multi-class engineering truck.

Background

In the field of video behavior recognition, existing methods are mainly classified into two categories. The first type is a behavior recognition method based on image information of video frames, such as a two-stream method and a three-dimensional convolution method. the two-stream method is to send the light flow graph and the video frame into a Convolutional Neural Network (CNN) for joint training to obtain a behavior class; the three-dimensional convolution method is to add time dimension information into a video frame sequence and directly carry out three-dimensional convolution on the sequence to obtain a behavior category. The second method is a skeleton-based behavior recognition method, which firstly estimates key nodes through RGB images and then predicts behaviors through a Recurrent Neural Network (RNN) or a Long Short-Term Memory Network (LSTM), but the method is mostly suitable for human body behavior recognition and other skeleton-fixed scenes.

The existing behavior identification method based on video frame image information can only identify one object and one action type of the object when a section of video is input for identification. However, since a fixed skeleton structure needs to be encoded into a vector and input into a network for motion classification, the recognition method is difficult to recognize when the motion of the object to be recognized varies greatly.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method and a device for identifying the behaviors of multi-class engineering vehicles. The technical problem to be solved by the invention is realized by the following technical scheme:

in a first aspect, the invention provides a method for identifying behaviors of a multi-class engineering truck, which comprises the following steps:

acquiring a video to be identified;

the method comprises the following steps that a video to be identified comprises a plurality of frames of images, wherein each frame of image comprises a plurality of engineering truck targets;

inputting the video to be recognized into a trained target detection model so that the trained target detection model recognizes the video to be recognized and outputs a prediction frame;

the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained through a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked out by a real frame, the first training set is clustered to obtain k prior frames, the prior frames are input into a preset target detection model, so that the preset target detection model determines the prior frame which is intersected with the real frame and has the largest ratio with the real frame, the offset between the prediction frame and the prior frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is subjected to iterative training until a first training cut-off condition is reached;

inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target, and the category of the behavior of the engineering truck target in the video to be recognized is obtained;

the trained behavior recognition network is obtained by obtaining a second training set, wherein the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimensional feature maps output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimensional feature maps contained in each group is minimum, each group of dimensional feature maps are shifted according to the serial number of the group, a vacancy in a feature vector corresponding to the shifted dimensional feature map is filled with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.

Optionally, the trained target detection model is obtained through the following steps:

step 1: acquiring original image data;

step 2: dividing the original data into a training set, a testing set and a verification set;

and step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using real boxes;

and 4, step 4: clustering the training set by using a k-means clustering algorithm to obtain k prior frame scales;

each prior frame corresponds to prior frame information, the prior frame information comprises a scale of the prior frame, and the scale comprises a width and a height;

and 5: performing data enhancement on each sample in the training set;

step 6: dividing each sample after enhancement into s × s grids;

each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence coefficient and c category probabilities;

and 7: inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the maximum intersection with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum intersection with the real frame and the confidence of the grid where the object center position is located, calculating the offset between a prediction frame and the prior frame, and outputting the prediction frame;

and 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;

and step 9: repeating the steps 7 to 8 until a first training cutoff condition is reached;

wherein the first training cutoff condition comprises: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;

step 10: and determining the preset target detection model with the loss function reaching the minimum as the trained target detection model.

Optionally, step 7 includes:

inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the largest intersection with the real frame, calculating the offset between a prediction frame and the prior frame by using the following formula (1) based on the confidence of the prior frame with the largest intersection with the real frame and the grid where the object center position is located, and outputting the prediction frame;

the formula (1) is:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein, b_xAbscissa representing the prediction box, b_yOrdinate representing the prediction frame, b_wRepresenting preset target detection model predictionsWide offset of the prediction frame with respect to the prior frame which is the largest in intersection with the real frame, b_hHigh offset, p, of the prediction frame representing the prediction of the preset target detection model with respect to the prior frame with the largest intersection with the real frame_wRepresenting the current prior frame width, p_hRepresents the current prior box height; c. C_xAnd c_yRepresents the coordinate of the upper left corner of the grid where the center point is located, sigma (t)_x) And σ (t)_y) Representing the distance, t, between the center point C of the prediction frame and the coordinate of the upper left corner of the grid where the center point is located_wWide offset, t, of the priori frame relative to the true frame representing the prediction of the predetermined target detection model_hAnd the prior frame predicted by the preset target detection model has high offset relative to the real frame, wherein sigma represents a Sigmoid function and is used for quantizing the coordinate offset to a (0, 1) interval.

Wherein the loss function is:

loss＝lbox+lcls+lobj

where lbox represents the position loss of the prediction frame and the real frame, λ_coordAnd B represents the number of the prior frames arranged in each grid.

If the judgment value indicating that the prediction frame includes an object is 1, x is not 0_i、y_iCoordinates representing the real box, w_i、h_iA value representing the width and height of the real box,

the coordinates of the prediction box are represented by,

coordinates and width and height values representing the prediction box; lcls denotes class loss, λ_classWeights representing class penalties by a cross-entropy penalty function

Calculating the class loss, p_i(c) The class c representing the prediction of the prediction box has the same probability as the true class, 1 for the same, 0 for the different,

representing the probability of prediction as class c; lobj denotes confidence loss, λ_noobjWeight, λ, indicating that the prediction box does not contain the actual machineshop truck target_objThe representation prediction box contains the weights of the actual machineshop truck objectives,

if no engineering vehicle target in the prediction box at i, j is 1, an engineering vehicle target is 0, c_iThe confidence level of the prediction box is represented,

expressed as the confidence level predicted by the prediction box.

Optionally, the trained behavior recognition network is obtained through the following steps:

step 1: acquiring a second data set;

step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain the behavior category recognized by the preset behavior recognition network;

and step 3: adjusting parameters of a preset behavior recognition network;

and 4, step 4: for each sample, comparing the behavior class of the sample identified by the preset behavior identification network with the real behavior class of the sample, and calculating a loss function of the preset behavior identification network;

and 5: repeating the step 2 to the step 4 until the preset behavior recognition network reaches a second training cut-off condition;

wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;

step 6: and determining the preset behavior recognition network reaching the second training cutoff condition as the trained behavior recognition network.

Optionally, the preset behavior recognition network is a TSN time-sequence-based segmentation network, a TSM time displacement module is connected between the TSN network residual error layers, the TSM time displacement module of each layer shifts the feature dimension map output by the previous layer residual error layer at the corresponding position according to the group number, and fills 0 in the vacancy in the feature vector corresponding to the shifted dimension feature map.

Optionally, the TSM time shift module of each layer shifts the feature dimension map output by the previous residual layer according to the serial number of the group, and compensating for 0 in the empty space in the feature vector corresponding to the shifted feature map includes:

the TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;

shifting the first group of dimension characteristic graphs to the left by one bit according to the time sequence of the image, and supplementing 0 to a characteristic vector vacancy corresponding to the shifted group of characteristic dimension graphs;

and shifting the second group of the dimension characteristic graphs to the right by one bit according to the time sequence of the image, and supplementing 0 to the characteristic vector vacancy corresponding to the group of the characteristic dimension graphs after shifting.

Optionally, before inputting the prediction frame into the trained behavior recognition network in the form of continuous frames, the behavior recognition method further includes:

performing equal inter-frame division on the image within the range of the prediction frame according to the image time sequence, randomly extracting a frame from each sub-frame segment as a key frame, and stacking all the key frames to obtain divided image data;

and inputting the image data into the trained behavior recognition network.

Optionally, the recognition result output by the trained behavior recognition model is as follows:

OutPut＝{TSN₁(T₁,T₂,...T_k)，TSN₂(T₁,T₂,...T_k)，...，TSN_m(T₁,T₂,...T_k)}；

TSN(T₁,T₂,...T_k)＝H(G(F(T₁,w),F(T₂,w)...F(T_k,w)))

wherein (T)₁,T₂,...T_k) Representing a sequence of video key frames, each key frame T_kFrom its corresponding video segment S_kObtaining the intermediate random sample; f (T)_kW) shows that the convolutional network with w as a parameter acts on the frame T_kFunction F returns T_kScores against all categories; g is a segment consensus function, representing the combination of multiple Ts_kH is the softmax prediction function, which is used to predict the probability that the entire video segment belongs to each behavior category.

In a second aspect, the present invention provides a multi-category engineering vehicle behavior recognition apparatus, including:

the acquisition model is used for acquiring a video to be identified;

the detection module is used for inputting the video to be recognized into a trained target detection model so as to enable the trained target detection model to recognize the video to be recognized and output a prediction frame;

the recognition module is used for inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target to obtain the category of the behavior of the engineering truck target in the video to be recognized;

The invention provides a multi-category engineering vehicle behavior recognition method, which comprises the steps of inputting a video to be recognized into a trained target detection model, so that the trained target detection model recognizes the video to be recognized, outputting a prediction frame containing an engineering vehicle target in the video to be recognized, enabling the prediction frame where the engineering vehicle target is located to correspond to the position coordinate and the category of the engineering vehicle target, then inputting images in the range of the prediction frame into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering vehicle target, and obtaining the category of the behavior of the engineering vehicle target in the video to be recognized, wherein the trained behavior recognition network obtains a second training set, inputs a second sample in the second training set into a preset behavior recognition network, the method comprises the steps of enabling dimension characteristic graphs output by each layer in the preset behavior recognition network to be grouped according to the time sequence of input images, enabling the difference of the number of the dimension characteristic graphs contained in each group to be minimum, shifting each group of dimension characteristic graphs according to the serial number of the group, supplementing 0 for vacant positions in characteristic vectors corresponding to the shifted dimension characteristic graphs, and iteratively training the preset behavior recognition network until a second training cut-off condition is achieved.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

Fig. 1 is a flowchart of a behavior recognition method for a multi-class engineering vehicle according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a training process for providing a target detection model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a DarkNet53 network architecture;

FIG. 4 is a schematic diagram of the calculation of the prior frame and prediction frame offsets;

FIG. 5 is a schematic diagram of a TSN architecture;

FIG. 6 is a schematic diagram of an insertion TSN architecture of a time shift module;

fig. 7 is a structural diagram of a multi-category engineering vehicle behavior recognition device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

As shown in fig. 1, the method for identifying the behavior of a multi-class engineering vehicle provided by the present invention includes:

s1, acquiring a video to be identified;

s2, inputting the video to be recognized into a trained target detection model so that the trained target detection model recognizes the video to be recognized and outputs a prediction frame;

s3, inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target, and the category of the behavior of the engineering truck target in the video to be recognized is obtained;

Example two

As an optional embodiment of the present invention, the trained target detection model is obtained by the following steps:

step 1: acquiring original image data;

the engineering vehicles comprise different categories, such as excavators, muck trucks, bulldozers and the like, the skeleton structures and the action modes of the engineering vehicles are different, and the engineering vehicles have various actions such as bulldozing, excavating, dumping and the like, so that the video data comprising the engineering vehicles of the different categories are used as original data. Firstly, extracting multiple frames from original video data as target detection data, dividing a training set, a test set and a verification set, and labeling the video frames by using a labeling tool. In order to prevent overfitting and improve detection accuracy, Gaussian noise is added before target detection, and data are randomly mirrored and rotated to obtain a data enhancement effect.

and 5: performing data enhancement on each sample in the training set;

step 6: dividing each sample after enhancement into s × s grids;

wherein the first threshold value can be preset according to practical experience.

Wherein the loss function is:

loss＝lbox+lcls+lobj

the coordinates of the prediction box are represented by,

coordinates and width and height values representing the prediction box; lcls denotesClass loss, λ_classWeights representing class penalties by a cross-entropy penalty function

expressed as the confidence level predicted by the prediction box.

Referring to fig. 2, an embodiment of the present invention may use a YOLO algorithm to perform the calculation of the target detection part, wherein the backbone network uses DarkNet53 to obtain the prior frame scale through clustering on the training set. The prior boxes are clustered from all the true labeled boxes in the training set, several shapes and sizes that occur most often in the training set. The statistical prior experiences are added into the model in advance, and the model is helped to be converged quickly.

And obtaining the prior frame scale through clustering on the training set. The prior boxes are the several shapes and sizes that are most frequently found in the training set clustered out of all the true labeled boxes in the training set. The statistical prior experiences are added into the model in advance, and the model is helped to be converged quickly.

Setting the number of the preselected frames as k, obtaining the most appropriate k priori frame scale values by using a k-means clustering algorithm, and normalizing the k scale values relative to the length and the width of the image, so that the k frames can represent the shape of a real object in the data set to the maximum extent. In clustering, the evaluation criterion is the distance d (box, centroid) between two borders 1-IoU (box, centroid). The Intersection over Union (IoU) of the prior frame and the real frame is used as a standard to measure the quality of a group of preselected frames.

And predicting the offset of the prior frame and the real object. Dividing the size of the video frame resize after data enhancement into s x s grids, setting prior frames based on different scales obtained by clustering, and predicting the position of an object based on the prior frames. The prior frame information (x, y, w, h) is the coordinate of the center position of the object, the width and the height of the prior frame respectively, and the values are normalized to the width and the height of the image. A confidence score (confidence score) and c class probabilities are predicted for each prior box of each grid through the darknet53 network. The confidence is expressed as

P_r(Object) indicates whether the cell contains a real Object center point. If the coordinates of the center position of an object fall into a certain grid, P of the grid_r(Object) is 1, indicating that the Object is detected.

Representing the intersection ratio of the prediction box and the real object.

The network structure of Yolo3 is shown in fig. 3, and the Darknet53 performs channel splicing (Concat) operation on the deep-layer feature map and the shallow-layer feature map by adding and sampling to different layers, fuses the deep-layer feature map and the shallow-layer feature map at the output end, and finally outputs feature maps of three sizes, namely 13 × 13, 26 × 26 and 52 × 52. The deep characteristic diagram has small size and large receptive field, which is beneficial to detecting large-scale objects, and the shallow characteristic diagram is opposite to the deep characteristic diagram, which is more beneficial to detecting small-scale objects.

Training the target detection network through the network, continuously reducing the loss value of the loss function until convergence, and verifying the function of the loss function by using test set data. The network structure and parameters are continuously optimized until the output is optimal. The model of the final optimization is the model responsible for the target detection part in the system. And inputting the video data into the model to obtain the position coordinates and the category information of the engineering vehicles of various categories.

EXAMPLE III

As an alternative embodiment of the present invention, the step 7 includes:

the formula (1) is:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein, b_xAbscissa representing the prediction box, b_yOrdinate representing the prediction frame, b_wWide offset of the predicted frame representing the prediction of the preset target detection model with respect to the prior frame having the largest cross-to-real frame, b_hHigh offset, p, of the prediction frame representing the prediction of the preset target detection model with respect to the prior frame with the largest intersection with the real frame_wRepresenting the current prior frame width, p_hRepresents the current prior box height; c. C_xAnd c_yRepresents the coordinate of the upper left corner of the grid where the center point is located, sigma (t)_x) And σ (t)_y) Representing the distance, t, between the center point C of the prediction frame and the coordinate of the upper left corner of the grid where the center point is located_wWide offset, t, of the priori frame relative to the true frame representing the prediction of the predetermined target detection model_hThe prior frame predicted by the preset target detection model has high offset relative to a real frame, sigma represents a Sigmoid function and has the function of quantizing coordinate offset to a (0, 1) interval, and the central coordinate b of the predicted frame obtained in the way is_x,b_yAnd the method is limited in the current region, ensures that one region only predicts the object with the central point in the region, and is beneficial to model convergence. The whole prediction process is to input the prior frame into a target detection model and obtain t through model calculation_w、t_h、t_x、t_yThe process of (1).

Referring to fig. 4, inputting a video frame and prior frame information into a darknet53 network, first finding a mesh containing a central point of a real object, then selecting the largest one of all prior frames generated by the mesh with a real frame IOU, predicting an offset between the prior frame and the real frame through the network, obtaining a prediction frame through the offset, and calculating a final output prediction frame by the model itself.

Example four

As an optional embodiment of the present invention, the trained behavior recognition network is obtained by the following steps:

step 1: acquiring a second data set;

and step 3: adjusting parameters of a preset behavior recognition network;

the second threshold is a preset value and can be obtained according to industry experience.

EXAMPLE five

As an optional embodiment of the present invention, the preset behavior recognition network is a TSN time-based segmentation network, a TSM time displacement module is connected between residual layers of the TSN network, the TSM time displacement module of each layer shifts a feature dimension map output by a previous residual layer at a corresponding position according to a group number, and fills a null in a feature vector corresponding to the shifted feature dimension map with 0.

Referring to fig. 5, behavior recognition is based on a Temporal Segment Networks (TSN) network. Video stream data passes through a target detection model, then position information of various engineering vehicles is sequentially input into a behavior recognition network in a bounding box form, and a TSN framework is adopted for extracting key frames and recognizing behaviors.

EXAMPLE six

As an optional embodiment of the present invention, the shifting, by the TSM time shift module of each layer, the corresponding position of the feature dimension map output by the previous layer residual error layer according to the group number, and the supplementing 0 to the empty position in the feature vector corresponding to the shifted dimension feature map includes:

Because behavior recognition depends on timing modeling, a TSM (temporal Shift Module) module is added on the basis of a TSN (time series network) architecture for timing modeling. Each time shifting module divides the batch _ size × segment × channel × h × w dimension feature map generated by the network middle layer into 3 groups according to the number of channels, and simulates time domain information through the left shift and the right shift of different groups of feature vectors in the channel dimension. If the moving proportion is too large, the spatial feature modeling capability is weakened, the image information of the original frame is possibly damaged, and if the moving proportion is too small, the temporal modeling capability of the model is affected, so that the 3 groups of feature maps are respectively shifted by one bit to the left and one bit to the right, the feature vectors which are not moved to simulate the temporal receptive field are selected to be filled with 0. The operation moves some channels between frames in the time dimension, the inter-frame information is exchanged, and the time domain information is further fused, so that the model is more effective in behavior recognition.

The 2d Convnet in FIG. 5 uses a conventional image classification network, such as ResNet50, ResNet101, BN-Incep, etc., and the network used in the present invention is ResNet50, which is a superposition of 50 residual networks. The TSM time shift module is inserted into each residual block of the ResNet50 in the manner shown in fig. 6. And performing time displacement operation on the first layer on each residual structure branch 1, wherein the rest structures and calculation modes of the residual blocks are unchanged. Thus, not only original frame information on the branch 2 is reserved, but also inter-frame information is exchanged on the branch 1, and each residual block fuses the two kinds of information, so that the network is more suitable for behavior recognition. And connecting the residual blocks subjected to time displacement in 50 layers to serve as a basic structure of the behavior recognition network, and finally adding a full connection layer for classification so as to recognize the behaviors of the multi-class targets.

EXAMPLE seven

As an alternative embodiment of the present invention, before inputting the prediction block into the trained behavior recognition network in the form of consecutive frames, the behavior recognition method further includes:

step 1: performing equal inter-frame division on the image within the range of the prediction frame according to the image time sequence, randomly extracting a frame from each sub-frame segment as a key frame, and stacking all the key frames to obtain divided image data;

step 2: and inputting the image data into the trained behavior recognition network.

Wherein, the recognition result output by the trained behavior recognition model is as follows:

TSN(T₁,T₂,...T_k)＝H(G(F(T₁,w),F(T₂,w)...F(T_k,w)))

The TSN is a behavior recognition network architecture, and its core lies in the division of time domain. Given a video V, which includes m objects to be detected, the m objects are extracted by the method in step S2, and then sequentially input into the TSN network in the form of continuous frames. Taking a certain engineering vehicle target to be measured as an example, dividing the engineering vehicle target into k sections { S ] according to equal frame intervals₁,S₂,...S_kThe output of behavior recognition is therefore:

TSN(T₁,T₂,...T_k)＝H(G(F(T₁,w),F(T₂,w)...F(T_k,w)))

OutPut＝{TSN₁(T₁,T₂,...T_k)，TSN₂(T₁,T₂,...T_k)，...，TSN_m(T₁,T₂,...T_k)}

wherein (T)₁,T₂,...T_k) Representing a sequence of video key frames, each key frame T_kFrom its corresponding video segment S_kObtaining the intermediate random sample; f (T)_kW) shows that the convolutional network with w as a parameter acts on the frame T_kFunction F returns T_kScores against all categories; g is a segment consensus function, representing the combination of multiple Ts_kThe category score of (2) outputs the total category prediction value among them, generally, the maximum value of k prediction results is obtained; h is the softmax prediction function used to predict the probability that the entire video belongs to each behavior category.

Training is carried out through the network, the network structure and the model parameters are optimized, various tested results are optimized, and finally a behavior recognition network is obtained. And inputting the various types of engineering truck targets in the video frame into the network to finally obtain the behaviors of the various types of engineering truck targets.

Example eight

As shown in fig. 7, the present invention provides a multi-category engineering vehicle behavior recognition apparatus, including:

an acquisition model 71, configured to acquire a video to be identified;

the detection module 72 is configured to input the video to be recognized into a trained target detection model, so that the trained target detection model recognizes the video to be recognized, and outputs a prediction frame;

the recognition module 73 is configured to input the images within the prediction frame range into a trained behavior recognition network in the form of continuous frames, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target, and obtains a category to which the behavior of the engineering truck target in the video to be recognized belongs;

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A behavior identification method of a multi-class engineering vehicle is characterized by comprising the following steps:

acquiring a video to be identified;

2. The behavior recognition method according to claim 1, wherein the trained object detection model is obtained by:

step 1: acquiring original image data;

and 5: performing data enhancement on each sample in the training set;

step 6: dividing each sample after enhancement into s × s grids;

3. The behavior recognition method according to claim 2, wherein the step 7 comprises:

the formula (1) is:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

wherein, b_xAbscissa representing the prediction box, b_yOrdinate representing the prediction frame, b_wWide offset of the predicted frame representing the prediction of the preset target detection model with respect to the prior frame having the largest cross-to-real frame, b_hHigh offset, p, of the prediction frame representing the prediction of the preset target detection model with respect to the prior frame with the largest intersection with the real frame_wRepresenting the current prior frame width, p_hRepresents the current prior box height; c. C_xAnd c_yRepresents the coordinate of the upper left corner of the grid where the center point is located, sigma (t)_x) And σ (t)_y) Representing the distance, t, between the center point C of the prediction frame and the coordinate of the upper left corner of the grid where the center point is located_wWide offset, t, of the priori frame relative to the true frame representing the prediction of the predetermined target detection model_hAnd the prior frame predicted by the preset target detection model has high offset relative to the real frame, wherein sigma represents a Sigmoid function and is used for quantizing the coordinate offset to a (0, 1) interval.

4. The behavior recognition method according to claim 2, wherein the loss function is:

loss＝lbox+lcls+lobj

the coordinates of the prediction box are represented by,

expressed as the confidence level predicted by the prediction box.

5. The behavior recognition method according to claim 1, wherein the trained behavior recognition network is obtained by:

step 1: acquiring a second data set;

and step 3: adjusting parameters of a preset behavior recognition network;

6. The behavior recognition method according to claim 5, wherein the preset behavior recognition network is a TSN time-sequential based segmentation network, TSM time shift modules are connected between TSN network residual layers, and each TSM time shift module shifts a feature dimension map output by a previous residual layer at a corresponding position according to a group number, and complements 0 to a vacancy in a feature vector corresponding to the shifted dimension feature map.

7. The behavior recognition method according to claim 6, wherein the TSM time shift module in each layer shifts the feature dimension map output by the previous residual layer according to the group sequence number, and the step of supplementing 0 to the empty position in the feature vector corresponding to the shifted feature map comprises:

8. The behavior recognition method according to claim 1, wherein before inputting the prediction block into the trained behavior recognition network in the form of consecutive frames, the behavior recognition method further comprises:

and inputting the image data into the trained behavior recognition network.

9. The behavior recognition method according to claim 8, wherein the trained behavior recognition model outputs the recognition result as:

TSN(T₁,T₂,...T_k)＝H(G(F(T₁,w),F(T₂,w)...F(T_k,w)))

10. A behavior recognition device for a multi-class engineering vehicle, comprising:

the acquisition model is used for acquiring a video to be identified;