CN112800934A - Behavior identification method and device for multi-class engineering vehicle - Google Patents

Behavior identification method and device for multi-class engineering vehicle Download PDF

Info

Publication number
CN112800934A
CN112800934A CN202110098578.5A CN202110098578A CN112800934A CN 112800934 A CN112800934 A CN 112800934A CN 202110098578 A CN202110098578 A CN 202110098578A CN 112800934 A CN112800934 A CN 112800934A
Authority
CN
China
Prior art keywords
frame
prediction
behavior recognition
detection model
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110098578.5A
Other languages
Chinese (zh)
Other versions
CN112800934B (en
Inventor
汪霖
李一荻
曹世闯
汪照阳
胡莎
刘成
陈晓璇
姜博
李艳艳
周延
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern University
Original Assignee
Northwestern University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern University filed Critical Northwestern University
Priority to CN202110098578.5A priority Critical patent/CN112800934B/en
Publication of CN112800934A publication Critical patent/CN112800934A/en
Application granted granted Critical
Publication of CN112800934B publication Critical patent/CN112800934B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-category engineering vehicle behavior recognition method and device, which are characterized in that videos to be recognized are input into a trained target detection model, so that the trained target detection model recognizes the videos to be recognized, a prediction frame containing an engineering vehicle target in the videos to be recognized is output, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, then images in the range of the prediction frame are input into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the videos to be recognized and recognizes the behavior of the engineering vehicle target to obtain the category of the behavior of the engineering vehicle target in the videos to be recognized, the behavior recognition network simulates time domain information through the displacement of different groups of feature vectors in channel dimensions, and therefore, the speed of the behavior recognition process is greatly improved, different behaviors of a plurality of engineering vehicles can be identified in real time.

Description

Behavior identification method and device for multi-class engineering vehicle
Technical Field
The invention belongs to the technical field of video image recognition, and particularly relates to a behavior recognition method and device for a multi-class engineering truck.
Background
In the field of video behavior recognition, existing methods are mainly classified into two categories. The first type is a behavior recognition method based on image information of video frames, such as a two-stream method and a three-dimensional convolution method. the two-stream method is to send the light flow graph and the video frame into a Convolutional Neural Network (CNN) for joint training to obtain a behavior class; the three-dimensional convolution method is to add time dimension information into a video frame sequence and directly carry out three-dimensional convolution on the sequence to obtain a behavior category. The second method is a skeleton-based behavior recognition method, which firstly estimates key nodes through RGB images and then predicts behaviors through a Recurrent Neural Network (RNN) or a Long Short-Term Memory Network (LSTM), but the method is mostly suitable for human body behavior recognition and other skeleton-fixed scenes.
The existing behavior identification method based on video frame image information can only identify one object and one action type of the object when a section of video is input for identification. However, since a fixed skeleton structure needs to be encoded into a vector and input into a network for motion classification, the recognition method is difficult to recognize when the motion of the object to be recognized varies greatly.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method and a device for identifying the behaviors of multi-class engineering vehicles. The technical problem to be solved by the invention is realized by the following technical scheme:
in a first aspect, the invention provides a method for identifying behaviors of a multi-class engineering truck, which comprises the following steps:
acquiring a video to be identified;
the method comprises the following steps that a video to be identified comprises a plurality of frames of images, wherein each frame of image comprises a plurality of engineering truck targets;
inputting the video to be recognized into a trained target detection model so that the trained target detection model recognizes the video to be recognized and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained through a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked out by a real frame, the first training set is clustered to obtain k prior frames, the prior frames are input into a preset target detection model, so that the preset target detection model determines the prior frame which is intersected with the real frame and has the largest ratio with the real frame, the offset between the prediction frame and the prior frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is subjected to iterative training until a first training cut-off condition is reached;
inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target, and the category of the behavior of the engineering truck target in the video to be recognized is obtained;
the trained behavior recognition network is obtained by obtaining a second training set, wherein the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimensional feature maps output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimensional feature maps contained in each group is minimum, each group of dimensional feature maps are shifted according to the serial number of the group, a vacancy in a feature vector corresponding to the shifted dimensional feature map is filled with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
Optionally, the trained target detection model is obtained through the following steps:
step 1: acquiring original image data;
step 2: dividing the original data into a training set, a testing set and a verification set;
and step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using real boxes;
and 4, step 4: clustering the training set by using a k-means clustering algorithm to obtain k prior frame scales;
each prior frame corresponds to prior frame information, the prior frame information comprises a scale of the prior frame, and the scale comprises a width and a height;
and 5: performing data enhancement on each sample in the training set;
step 6: dividing each sample after enhancement into s × s grids;
each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence coefficient and c category probabilities;
and 7: inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the maximum intersection with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum intersection with the real frame and the confidence of the grid where the object center position is located, calculating the offset between a prediction frame and the prior frame, and outputting the prediction frame;
and 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;
and step 9: repeating the steps 7 to 8 until a first training cutoff condition is reached;
wherein the first training cutoff condition comprises: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;
step 10: and determining the preset target detection model with the loss function reaching the minimum as the trained target detection model.
Optionally, step 7 includes:
inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the largest intersection with the real frame, calculating the offset between a prediction frame and the prior frame by using the following formula (1) based on the confidence of the prior frame with the largest intersection with the real frame and the grid where the object center position is located, and outputting the prediction frame;
the formula (1) is:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0002914865230000041
Figure BDA0002914865230000042
wherein, bxAbscissa representing the prediction box, byOrdinate representing the prediction frame, bwRepresenting preset target detection model predictionsWide offset of the prediction frame with respect to the prior frame which is the largest in intersection with the real frame, bhHigh offset, p, of the prediction frame representing the prediction of the preset target detection model with respect to the prior frame with the largest intersection with the real framewRepresenting the current prior frame width, phRepresents the current prior box height; c. CxAnd cyRepresents the coordinate of the upper left corner of the grid where the center point is located, sigma (t)x) And σ (t)y) Representing the distance, t, between the center point C of the prediction frame and the coordinate of the upper left corner of the grid where the center point is locatedwWide offset, t, of the priori frame relative to the true frame representing the prediction of the predetermined target detection modelhAnd the prior frame predicted by the preset target detection model has high offset relative to the real frame, wherein sigma represents a Sigmoid function and is used for quantizing the coordinate offset to a (0, 1) interval.
Wherein the loss function is:
loss=lbox+lcls+lobj
Figure BDA0002914865230000051
Figure BDA0002914865230000052
Figure BDA0002914865230000053
where lbox represents the position loss of the prediction frame and the real frame, λcoordAnd B represents the number of the prior frames arranged in each grid.
Figure BDA00029148652300000510
If the judgment value indicating that the prediction frame includes an object is 1, x is not 0i、yiCoordinates representing the real box, wi、hiA value representing the width and height of the real box,
Figure BDA0002914865230000058
the coordinates of the prediction box are represented by,
Figure BDA0002914865230000059
coordinates and width and height values representing the prediction box; lcls denotes class loss, λclassWeights representing class penalties by a cross-entropy penalty function
Figure BDA0002914865230000054
Calculating the class loss, pi(c) The class c representing the prediction of the prediction box has the same probability as the true class, 1 for the same, 0 for the different,
Figure BDA0002914865230000055
representing the probability of prediction as class c; lobj denotes confidence loss, λnoobjWeight, λ, indicating that the prediction box does not contain the actual machineshop truck targetobjThe representation prediction box contains the weights of the actual machineshop truck objectives,
Figure BDA0002914865230000056
if no engineering vehicle target in the prediction box at i, j is 1, an engineering vehicle target is 0, ciThe confidence level of the prediction box is represented,
Figure BDA0002914865230000057
expressed as the confidence level predicted by the prediction box.
Optionally, the trained behavior recognition network is obtained through the following steps:
step 1: acquiring a second data set;
step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain the behavior category recognized by the preset behavior recognition network;
and step 3: adjusting parameters of a preset behavior recognition network;
and 4, step 4: for each sample, comparing the behavior class of the sample identified by the preset behavior identification network with the real behavior class of the sample, and calculating a loss function of the preset behavior identification network;
and 5: repeating the step 2 to the step 4 until the preset behavior recognition network reaches a second training cut-off condition;
wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;
step 6: and determining the preset behavior recognition network reaching the second training cutoff condition as the trained behavior recognition network.
Optionally, the preset behavior recognition network is a TSN time-sequence-based segmentation network, a TSM time displacement module is connected between the TSN network residual error layers, the TSM time displacement module of each layer shifts the feature dimension map output by the previous layer residual error layer at the corresponding position according to the group number, and fills 0 in the vacancy in the feature vector corresponding to the shifted dimension feature map.
Optionally, the TSM time shift module of each layer shifts the feature dimension map output by the previous residual layer according to the serial number of the group, and compensating for 0 in the empty space in the feature vector corresponding to the shifted feature map includes:
the TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;
shifting the first group of dimension characteristic graphs to the left by one bit according to the time sequence of the image, and supplementing 0 to a characteristic vector vacancy corresponding to the shifted group of characteristic dimension graphs;
and shifting the second group of the dimension characteristic graphs to the right by one bit according to the time sequence of the image, and supplementing 0 to the characteristic vector vacancy corresponding to the group of the characteristic dimension graphs after shifting.
Optionally, before inputting the prediction frame into the trained behavior recognition network in the form of continuous frames, the behavior recognition method further includes:
performing equal inter-frame division on the image within the range of the prediction frame according to the image time sequence, randomly extracting a frame from each sub-frame segment as a key frame, and stacking all the key frames to obtain divided image data;
and inputting the image data into the trained behavior recognition network.
Optionally, the recognition result output by the trained behavior recognition model is as follows:
OutPut={TSN1(T1,T2,...Tk),TSN2(T1,T2,...Tk),...,TSNm(T1,T2,...Tk)};
TSN(T1,T2,...Tk)=H(G(F(T1,w),F(T2,w)...F(Tk,w)))
wherein (T)1,T2,...Tk) Representing a sequence of video key frames, each key frame TkFrom its corresponding video segment SkObtaining the intermediate random sample; f (T)kW) shows that the convolutional network with w as a parameter acts on the frame TkFunction F returns TkScores against all categories; g is a segment consensus function, representing the combination of multiple TskH is the softmax prediction function, which is used to predict the probability that the entire video segment belongs to each behavior category.
In a second aspect, the present invention provides a multi-category engineering vehicle behavior recognition apparatus, including:
the acquisition model is used for acquiring a video to be identified;
the method comprises the following steps that a video to be identified comprises a plurality of frames of images, wherein each frame of image comprises a plurality of engineering truck targets;
the detection module is used for inputting the video to be recognized into a trained target detection model so as to enable the trained target detection model to recognize the video to be recognized and output a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained through a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked out by a real frame, the first training set is clustered to obtain k prior frames, the prior frames are input into a preset target detection model, so that the preset target detection model determines the prior frame which is intersected with the real frame and has the largest ratio with the real frame, the offset between the prediction frame and the prior frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is subjected to iterative training until a first training cut-off condition is reached;
the recognition module is used for inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target to obtain the category of the behavior of the engineering truck target in the video to be recognized;
the trained behavior recognition network is obtained by obtaining a second training set, wherein the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimensional feature maps output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimensional feature maps contained in each group is minimum, each group of dimensional feature maps are shifted according to the serial number of the group, a vacancy in a feature vector corresponding to the shifted dimensional feature map is filled with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
The invention provides a multi-category engineering vehicle behavior recognition method, which comprises the steps of inputting a video to be recognized into a trained target detection model, so that the trained target detection model recognizes the video to be recognized, outputting a prediction frame containing an engineering vehicle target in the video to be recognized, enabling the prediction frame where the engineering vehicle target is located to correspond to the position coordinate and the category of the engineering vehicle target, then inputting images in the range of the prediction frame into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering vehicle target, and obtaining the category of the behavior of the engineering vehicle target in the video to be recognized, wherein the trained behavior recognition network obtains a second training set, inputs a second sample in the second training set into a preset behavior recognition network, the method comprises the steps of enabling dimension characteristic graphs output by each layer in the preset behavior recognition network to be grouped according to the time sequence of input images, enabling the difference of the number of the dimension characteristic graphs contained in each group to be minimum, shifting each group of dimension characteristic graphs according to the serial number of the group, supplementing 0 for vacant positions in characteristic vectors corresponding to the shifted dimension characteristic graphs, and iteratively training the preset behavior recognition network until a second training cut-off condition is achieved.
The present invention will be described in further detail with reference to the accompanying drawings and examples.
Drawings
Fig. 1 is a flowchart of a behavior recognition method for a multi-class engineering vehicle according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a training process for providing a target detection model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a DarkNet53 network architecture;
FIG. 4 is a schematic diagram of the calculation of the prior frame and prediction frame offsets;
FIG. 5 is a schematic diagram of a TSN architecture;
FIG. 6 is a schematic diagram of an insertion TSN architecture of a time shift module;
fig. 7 is a structural diagram of a multi-category engineering vehicle behavior recognition device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
As shown in fig. 1, the method for identifying the behavior of a multi-class engineering vehicle provided by the present invention includes:
s1, acquiring a video to be identified;
the method comprises the following steps that a video to be identified comprises a plurality of frames of images, wherein each frame of image comprises a plurality of engineering truck targets;
s2, inputting the video to be recognized into a trained target detection model so that the trained target detection model recognizes the video to be recognized and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained through a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked out by a real frame, the first training set is clustered to obtain k prior frames, the prior frames are input into a preset target detection model, so that the preset target detection model determines the prior frame which is intersected with the real frame and has the largest ratio with the real frame, the offset between the prediction frame and the prior frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is subjected to iterative training until a first training cut-off condition is reached;
s3, inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target, and the category of the behavior of the engineering truck target in the video to be recognized is obtained;
the trained behavior recognition network is obtained by obtaining a second training set, wherein the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimensional feature maps output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimensional feature maps contained in each group is minimum, each group of dimensional feature maps are shifted according to the serial number of the group, a vacancy in a feature vector corresponding to the shifted dimensional feature map is filled with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
The invention provides a multi-category engineering vehicle behavior recognition method, which comprises the steps of inputting a video to be recognized into a trained target detection model, so that the trained target detection model recognizes the video to be recognized, outputting a prediction frame containing an engineering vehicle target in the video to be recognized, enabling the prediction frame where the engineering vehicle target is located to correspond to the position coordinate and the category of the engineering vehicle target, then inputting images in the range of the prediction frame into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering vehicle target, and obtaining the category of the behavior of the engineering vehicle target in the video to be recognized, wherein the trained behavior recognition network obtains a second training set, inputs a second sample in the second training set into a preset behavior recognition network, the method comprises the steps of enabling dimension characteristic graphs output by each layer in the preset behavior recognition network to be grouped according to the time sequence of input images, enabling the difference of the number of the dimension characteristic graphs contained in each group to be minimum, shifting each group of dimension characteristic graphs according to the serial number of the group, supplementing 0 for vacant positions in characteristic vectors corresponding to the shifted dimension characteristic graphs, and iteratively training the preset behavior recognition network until a second training cut-off condition is achieved.
Example two
As an optional embodiment of the present invention, the trained target detection model is obtained by the following steps:
step 1: acquiring original image data;
the engineering vehicles comprise different categories, such as excavators, muck trucks, bulldozers and the like, the skeleton structures and the action modes of the engineering vehicles are different, and the engineering vehicles have various actions such as bulldozing, excavating, dumping and the like, so that the video data comprising the engineering vehicles of the different categories are used as original data. Firstly, extracting multiple frames from original video data as target detection data, dividing a training set, a test set and a verification set, and labeling the video frames by using a labeling tool. In order to prevent overfitting and improve detection accuracy, Gaussian noise is added before target detection, and data are randomly mirrored and rotated to obtain a data enhancement effect.
Step 2: dividing the original data into a training set, a testing set and a verification set;
and step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using real boxes;
and 4, step 4: clustering the training set by using a k-means clustering algorithm to obtain k prior frame scales;
each prior frame corresponds to prior frame information, the prior frame information comprises a scale of the prior frame, and the scale comprises a width and a height;
and 5: performing data enhancement on each sample in the training set;
step 6: dividing each sample after enhancement into s × s grids;
each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence coefficient and c category probabilities;
and 7: inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the maximum intersection with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum intersection with the real frame and the confidence of the grid where the object center position is located, calculating the offset between a prediction frame and the prior frame, and outputting the prediction frame;
and 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;
and step 9: repeating the steps 7 to 8 until a first training cutoff condition is reached;
wherein the first training cutoff condition comprises: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;
wherein the first threshold value can be preset according to practical experience.
Step 10: and determining the preset target detection model with the loss function reaching the minimum as the trained target detection model.
Wherein the loss function is:
loss=lbox+lcls+lobj
Figure BDA0002914865230000131
Figure BDA0002914865230000132
Figure BDA0002914865230000133
where lbox represents the position loss of the prediction frame and the real frame, λcoordAnd B represents the number of the prior frames arranged in each grid.
Figure BDA0002914865230000134
If the judgment value indicating that the prediction frame includes an object is 1, x is not 0i、yiCoordinates representing the real box, wi、hiA value representing the width and height of the real box,
Figure BDA0002914865230000135
the coordinates of the prediction box are represented by,
Figure BDA0002914865230000136
coordinates and width and height values representing the prediction box; lcls denotesClass loss, λclassWeights representing class penalties by a cross-entropy penalty function
Figure BDA0002914865230000137
Calculating the class loss, pi(c) The class c representing the prediction of the prediction box has the same probability as the true class, 1 for the same, 0 for the different,
Figure BDA0002914865230000138
representing the probability of prediction as class c; lobj denotes confidence loss, λnoobjWeight, λ, indicating that the prediction box does not contain the actual machineshop truck targetobjThe representation prediction box contains the weights of the actual machineshop truck objectives,
Figure BDA0002914865230000139
if no engineering vehicle target in the prediction box at i, j is 1, an engineering vehicle target is 0, ciThe confidence level of the prediction box is represented,
Figure BDA00029148652300001310
expressed as the confidence level predicted by the prediction box.
Referring to fig. 2, an embodiment of the present invention may use a YOLO algorithm to perform the calculation of the target detection part, wherein the backbone network uses DarkNet53 to obtain the prior frame scale through clustering on the training set. The prior boxes are clustered from all the true labeled boxes in the training set, several shapes and sizes that occur most often in the training set. The statistical prior experiences are added into the model in advance, and the model is helped to be converged quickly.
And obtaining the prior frame scale through clustering on the training set. The prior boxes are the several shapes and sizes that are most frequently found in the training set clustered out of all the true labeled boxes in the training set. The statistical prior experiences are added into the model in advance, and the model is helped to be converged quickly.
Setting the number of the preselected frames as k, obtaining the most appropriate k priori frame scale values by using a k-means clustering algorithm, and normalizing the k scale values relative to the length and the width of the image, so that the k frames can represent the shape of a real object in the data set to the maximum extent. In clustering, the evaluation criterion is the distance d (box, centroid) between two borders 1-IoU (box, centroid). The Intersection over Union (IoU) of the prior frame and the real frame is used as a standard to measure the quality of a group of preselected frames.
And predicting the offset of the prior frame and the real object. Dividing the size of the video frame resize after data enhancement into s x s grids, setting prior frames based on different scales obtained by clustering, and predicting the position of an object based on the prior frames. The prior frame information (x, y, w, h) is the coordinate of the center position of the object, the width and the height of the prior frame respectively, and the values are normalized to the width and the height of the image. A confidence score (confidence score) and c class probabilities are predicted for each prior box of each grid through the darknet53 network. The confidence is expressed as
Figure BDA0002914865230000141
Pr(Object) indicates whether the cell contains a real Object center point. If the coordinates of the center position of an object fall into a certain grid, P of the gridr(Object) is 1, indicating that the Object is detected.
Figure BDA0002914865230000142
Representing the intersection ratio of the prediction box and the real object.
The network structure of Yolo3 is shown in fig. 3, and the Darknet53 performs channel splicing (Concat) operation on the deep-layer feature map and the shallow-layer feature map by adding and sampling to different layers, fuses the deep-layer feature map and the shallow-layer feature map at the output end, and finally outputs feature maps of three sizes, namely 13 × 13, 26 × 26 and 52 × 52. The deep characteristic diagram has small size and large receptive field, which is beneficial to detecting large-scale objects, and the shallow characteristic diagram is opposite to the deep characteristic diagram, which is more beneficial to detecting small-scale objects.
Training the target detection network through the network, continuously reducing the loss value of the loss function until convergence, and verifying the function of the loss function by using test set data. The network structure and parameters are continuously optimized until the output is optimal. The model of the final optimization is the model responsible for the target detection part in the system. And inputting the video data into the model to obtain the position coordinates and the category information of the engineering vehicles of various categories.
EXAMPLE III
As an alternative embodiment of the present invention, the step 7 includes:
inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the largest intersection with the real frame, calculating the offset between a prediction frame and the prior frame by using the following formula (1) based on the confidence of the prior frame with the largest intersection with the real frame and the grid where the object center position is located, and outputting the prediction frame;
the formula (1) is:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure BDA0002914865230000151
Figure BDA0002914865230000152
wherein, bxAbscissa representing the prediction box, byOrdinate representing the prediction frame, bwWide offset of the predicted frame representing the prediction of the preset target detection model with respect to the prior frame having the largest cross-to-real frame, bhHigh offset, p, of the prediction frame representing the prediction of the preset target detection model with respect to the prior frame with the largest intersection with the real framewRepresenting the current prior frame width, phRepresents the current prior box height; c. CxAnd cyRepresents the coordinate of the upper left corner of the grid where the center point is located, sigma (t)x) And σ (t)y) Representing the distance, t, between the center point C of the prediction frame and the coordinate of the upper left corner of the grid where the center point is locatedwWide offset, t, of the priori frame relative to the true frame representing the prediction of the predetermined target detection modelhThe prior frame predicted by the preset target detection model has high offset relative to a real frame, sigma represents a Sigmoid function and has the function of quantizing coordinate offset to a (0, 1) interval, and the central coordinate b of the predicted frame obtained in the way isx,byAnd the method is limited in the current region, ensures that one region only predicts the object with the central point in the region, and is beneficial to model convergence. The whole prediction process is to input the prior frame into a target detection model and obtain t through model calculationw、th、tx、tyThe process of (1).
Referring to fig. 4, inputting a video frame and prior frame information into a darknet53 network, first finding a mesh containing a central point of a real object, then selecting the largest one of all prior frames generated by the mesh with a real frame IOU, predicting an offset between the prior frame and the real frame through the network, obtaining a prediction frame through the offset, and calculating a final output prediction frame by the model itself.
Example four
As an optional embodiment of the present invention, the trained behavior recognition network is obtained by the following steps:
step 1: acquiring a second data set;
step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain the behavior category recognized by the preset behavior recognition network;
and step 3: adjusting parameters of a preset behavior recognition network;
and 4, step 4: for each sample, comparing the behavior class of the sample identified by the preset behavior identification network with the real behavior class of the sample, and calculating a loss function of the preset behavior identification network;
and 5: repeating the step 2 to the step 4 until the preset behavior recognition network reaches a second training cut-off condition;
wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;
the second threshold is a preset value and can be obtained according to industry experience.
Step 6: and determining the preset behavior recognition network reaching the second training cutoff condition as the trained behavior recognition network.
EXAMPLE five
As an optional embodiment of the present invention, the preset behavior recognition network is a TSN time-based segmentation network, a TSM time displacement module is connected between residual layers of the TSN network, the TSM time displacement module of each layer shifts a feature dimension map output by a previous residual layer at a corresponding position according to a group number, and fills a null in a feature vector corresponding to the shifted feature dimension map with 0.
Referring to fig. 5, behavior recognition is based on a Temporal Segment Networks (TSN) network. Video stream data passes through a target detection model, then position information of various engineering vehicles is sequentially input into a behavior recognition network in a bounding box form, and a TSN framework is adopted for extracting key frames and recognizing behaviors.
EXAMPLE six
As an optional embodiment of the present invention, the shifting, by the TSM time shift module of each layer, the corresponding position of the feature dimension map output by the previous layer residual error layer according to the group number, and the supplementing 0 to the empty position in the feature vector corresponding to the shifted dimension feature map includes:
the TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;
shifting the first group of dimension characteristic graphs to the left by one bit according to the time sequence of the image, and supplementing 0 to a characteristic vector vacancy corresponding to the shifted group of characteristic dimension graphs;
and shifting the second group of the dimension characteristic graphs to the right by one bit according to the time sequence of the image, and supplementing 0 to the characteristic vector vacancy corresponding to the group of the characteristic dimension graphs after shifting.
Because behavior recognition depends on timing modeling, a TSM (temporal Shift Module) module is added on the basis of a TSN (time series network) architecture for timing modeling. Each time shifting module divides the batch _ size × segment × channel × h × w dimension feature map generated by the network middle layer into 3 groups according to the number of channels, and simulates time domain information through the left shift and the right shift of different groups of feature vectors in the channel dimension. If the moving proportion is too large, the spatial feature modeling capability is weakened, the image information of the original frame is possibly damaged, and if the moving proportion is too small, the temporal modeling capability of the model is affected, so that the 3 groups of feature maps are respectively shifted by one bit to the left and one bit to the right, the feature vectors which are not moved to simulate the temporal receptive field are selected to be filled with 0. The operation moves some channels between frames in the time dimension, the inter-frame information is exchanged, and the time domain information is further fused, so that the model is more effective in behavior recognition.
The 2d Convnet in FIG. 5 uses a conventional image classification network, such as ResNet50, ResNet101, BN-Incep, etc., and the network used in the present invention is ResNet50, which is a superposition of 50 residual networks. The TSM time shift module is inserted into each residual block of the ResNet50 in the manner shown in fig. 6. And performing time displacement operation on the first layer on each residual structure branch 1, wherein the rest structures and calculation modes of the residual blocks are unchanged. Thus, not only original frame information on the branch 2 is reserved, but also inter-frame information is exchanged on the branch 1, and each residual block fuses the two kinds of information, so that the network is more suitable for behavior recognition. And connecting the residual blocks subjected to time displacement in 50 layers to serve as a basic structure of the behavior recognition network, and finally adding a full connection layer for classification so as to recognize the behaviors of the multi-class targets.
EXAMPLE seven
As an alternative embodiment of the present invention, before inputting the prediction block into the trained behavior recognition network in the form of consecutive frames, the behavior recognition method further includes:
step 1: performing equal inter-frame division on the image within the range of the prediction frame according to the image time sequence, randomly extracting a frame from each sub-frame segment as a key frame, and stacking all the key frames to obtain divided image data;
step 2: and inputting the image data into the trained behavior recognition network.
Wherein, the recognition result output by the trained behavior recognition model is as follows:
OutPut={TSN1(T1,T2,...Tk),TSN2(T1,T2,...Tk),...,TSNm(T1,T2,...Tk)};
TSN(T1,T2,...Tk)=H(G(F(T1,w),F(T2,w)...F(Tk,w)))
wherein (T)1,T2,...Tk) Representing a sequence of video key frames, each key frame TkFrom its corresponding video segment SkObtaining the intermediate random sample; f (T)kW) shows that the convolutional network with w as a parameter acts on the frame TkFunction F returns TkScores against all categories; g is a segment consensus function, representing the combination of multiple TskH is the softmax prediction function, which is used to predict the probability that the entire video segment belongs to each behavior category.
The TSN is a behavior recognition network architecture, and its core lies in the division of time domain. Given a video V, which includes m objects to be detected, the m objects are extracted by the method in step S2, and then sequentially input into the TSN network in the form of continuous frames. Taking a certain engineering vehicle target to be measured as an example, dividing the engineering vehicle target into k sections { S ] according to equal frame intervals1,S2,...SkThe output of behavior recognition is therefore:
TSN(T1,T2,...Tk)=H(G(F(T1,w),F(T2,w)...F(Tk,w)))
OutPut={TSN1(T1,T2,...Tk),TSN2(T1,T2,...Tk),...,TSNm(T1,T2,...Tk)}
wherein (T)1,T2,...Tk) Representing a sequence of video key frames, each key frame TkFrom its corresponding video segment SkObtaining the intermediate random sample; f (T)kW) shows that the convolutional network with w as a parameter acts on the frame TkFunction F returns TkScores against all categories; g is a segment consensus function, representing the combination of multiple TskThe category score of (2) outputs the total category prediction value among them, generally, the maximum value of k prediction results is obtained; h is the softmax prediction function used to predict the probability that the entire video belongs to each behavior category.
Training is carried out through the network, the network structure and the model parameters are optimized, various tested results are optimized, and finally a behavior recognition network is obtained. And inputting the various types of engineering truck targets in the video frame into the network to finally obtain the behaviors of the various types of engineering truck targets.
Example eight
As shown in fig. 7, the present invention provides a multi-category engineering vehicle behavior recognition apparatus, including:
an acquisition model 71, configured to acquire a video to be identified;
the method comprises the following steps that a video to be identified comprises a plurality of frames of images, wherein each frame of image comprises a plurality of engineering truck targets;
the detection module 72 is configured to input the video to be recognized into a trained target detection model, so that the trained target detection model recognizes the video to be recognized, and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained through a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked out by a real frame, the first training set is clustered to obtain k prior frames, the prior frames are input into a preset target detection model, so that the preset target detection model determines the prior frame which is intersected with the real frame and has the largest ratio with the real frame, the offset between the prediction frame and the prior frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is subjected to iterative training until a first training cut-off condition is reached;
the recognition module 73 is configured to input the images within the prediction frame range into a trained behavior recognition network in the form of continuous frames, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target, and obtains a category to which the behavior of the engineering truck target in the video to be recognized belongs;
the trained behavior recognition network is obtained by obtaining a second training set, wherein the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimensional feature maps output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimensional feature maps contained in each group is minimum, each group of dimensional feature maps are shifted according to the serial number of the group, a vacancy in a feature vector corresponding to the shifted dimensional feature map is filled with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A behavior identification method of a multi-class engineering vehicle is characterized by comprising the following steps:
acquiring a video to be identified;
the method comprises the following steps that a video to be identified comprises a plurality of frames of images, wherein each frame of image comprises a plurality of engineering truck targets;
inputting the video to be recognized into a trained target detection model so that the trained target detection model recognizes the video to be recognized and outputs a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained through a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked out by a real frame, the first training set is clustered to obtain k prior frames, the prior frames are input into a preset target detection model, so that the preset target detection model determines the prior frame which is intersected with the real frame and has the largest ratio with the real frame, the offset between the prediction frame and the prior frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is subjected to iterative training until a first training cut-off condition is reached;
inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode, so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target, and the category of the behavior of the engineering truck target in the video to be recognized is obtained;
the trained behavior recognition network is obtained by obtaining a second training set, wherein the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimensional feature maps output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimensional feature maps contained in each group is minimum, each group of dimensional feature maps are shifted according to the serial number of the group, a vacancy in a feature vector corresponding to the shifted dimensional feature map is filled with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
2. The behavior recognition method according to claim 1, wherein the trained object detection model is obtained by:
step 1: acquiring original image data;
step 2: dividing the original data into a training set, a testing set and a verification set;
and step 3: marking the engineering truck targets in the training set, the testing set and the verification set by using real boxes;
and 4, step 4: clustering the training set by using a k-means clustering algorithm to obtain k prior frame scales;
each prior frame corresponds to prior frame information, the prior frame information comprises a scale of the prior frame, and the scale comprises a width and a height;
and 5: performing data enhancement on each sample in the training set;
step 6: dividing each sample after enhancement into s × s grids;
each grid corresponds to a plurality of prior frames, and each prior frame of each grid predicts a confidence coefficient and c category probabilities;
and 7: inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the maximum intersection with the real frame, adjusting parameters in the preset target detection model by using a back propagation algorithm based on the prior frame with the maximum intersection with the real frame and the confidence of the grid where the object center position is located, calculating the offset between a prediction frame and the prior frame, and outputting the prediction frame;
and 8: calculating a loss function of the preset target detection model based on the prediction frame and the real frame;
and step 9: repeating the steps 7 to 8 until a first training cutoff condition is reached;
wherein the first training cutoff condition comprises: the loss function value of the preset target detection model is not changed any more or is lower than a first threshold value;
step 10: and determining the preset target detection model with the loss function reaching the minimum as the trained target detection model.
3. The behavior recognition method according to claim 2, wherein the step 7 comprises:
inputting the prior frame information and the object center position coordinate into a preset target detection model so that the preset target detection model determines a prior frame with the largest intersection with the real frame, calculating the offset between a prediction frame and the prior frame by using the following formula (1) based on the confidence of the prior frame with the largest intersection with the real frame and the grid where the object center position is located, and outputting the prediction frame;
the formula (1) is:
bx=σ(tx)+cx
by=σ(ty)+cy
Figure FDA0002914865220000031
Figure FDA0002914865220000032
wherein, bxAbscissa representing the prediction box, byOrdinate representing the prediction frame, bwWide offset of the predicted frame representing the prediction of the preset target detection model with respect to the prior frame having the largest cross-to-real frame, bhHigh offset, p, of the prediction frame representing the prediction of the preset target detection model with respect to the prior frame with the largest intersection with the real framewRepresenting the current prior frame width, phRepresents the current prior box height; c. CxAnd cyRepresents the coordinate of the upper left corner of the grid where the center point is located, sigma (t)x) And σ (t)y) Representing the distance, t, between the center point C of the prediction frame and the coordinate of the upper left corner of the grid where the center point is locatedwWide offset, t, of the priori frame relative to the true frame representing the prediction of the predetermined target detection modelhAnd the prior frame predicted by the preset target detection model has high offset relative to the real frame, wherein sigma represents a Sigmoid function and is used for quantizing the coordinate offset to a (0, 1) interval.
4. The behavior recognition method according to claim 2, wherein the loss function is:
loss=lbox+lcls+lobj
Figure FDA0002914865220000041
Figure FDA0002914865220000042
Figure FDA0002914865220000043
where lbox represents the position loss of the prediction frame and the real frame, λcoordAnd B represents the number of the prior frames arranged in each grid.
Figure FDA0002914865220000044
If the judgment value indicating that the prediction frame includes an object is 1, x is not 0i、yiCoordinates representing the real box, wi、hiA value representing the width and height of the real box,
Figure FDA0002914865220000045
the coordinates of the prediction box are represented by,
Figure FDA0002914865220000046
coordinates and width and height values representing the prediction box; lcls denotes class loss, λclassWeights representing class penalties by a cross-entropy penalty function
Figure FDA0002914865220000047
Calculating the class loss, pi(c) The class c representing the prediction of the prediction box has the same probability as the true class, 1 for the same, 0 for the different,
Figure FDA0002914865220000048
representing the probability of prediction as class c; lobj denotes confidence loss, λnoobjWeight, λ, indicating that the prediction box does not contain the actual machineshop truck targetobjThe representation prediction box contains the weights of the actual machineshop truck objectives,
Figure FDA0002914865220000049
if no engineering vehicle target in the prediction box at i, j is 1, an engineering vehicle target is 0, ciThe confidence level of the prediction box is represented,
Figure FDA00029148652200000410
expressed as the confidence level predicted by the prediction box.
5. The behavior recognition method according to claim 1, wherein the trained behavior recognition network is obtained by:
step 1: acquiring a second data set;
step 2: sequentially inputting each sample in the second data set into a preset behavior recognition network to obtain the behavior category recognized by the preset behavior recognition network;
and step 3: adjusting parameters of a preset behavior recognition network;
and 4, step 4: for each sample, comparing the behavior class of the sample identified by the preset behavior identification network with the real behavior class of the sample, and calculating a loss function of the preset behavior identification network;
and 5: repeating the step 2 to the step 4 until the preset behavior recognition network reaches a second training cut-off condition;
wherein the second training cutoff condition comprises: the loss function value of the preset behavior recognition network is not changed any more or is lower than a second threshold value;
step 6: and determining the preset behavior recognition network reaching the second training cutoff condition as the trained behavior recognition network.
6. The behavior recognition method according to claim 5, wherein the preset behavior recognition network is a TSN time-sequential based segmentation network, TSM time shift modules are connected between TSN network residual layers, and each TSM time shift module shifts a feature dimension map output by a previous residual layer at a corresponding position according to a group number, and complements 0 to a vacancy in a feature vector corresponding to the shifted dimension feature map.
7. The behavior recognition method according to claim 6, wherein the TSM time shift module in each layer shifts the feature dimension map output by the previous residual layer according to the group sequence number, and the step of supplementing 0 to the empty position in the feature vector corresponding to the shifted feature map comprises:
the TSM time displacement module of each layer divides the characteristic dimension graph output by the residual error layer of the previous layer into 3 groups according to the time sequence of the video frame;
shifting the first group of dimension characteristic graphs to the left by one bit according to the time sequence of the image, and supplementing 0 to a characteristic vector vacancy corresponding to the shifted group of characteristic dimension graphs;
and shifting the second group of the dimension characteristic graphs to the right by one bit according to the time sequence of the image, and supplementing 0 to the characteristic vector vacancy corresponding to the group of the characteristic dimension graphs after shifting.
8. The behavior recognition method according to claim 1, wherein before inputting the prediction block into the trained behavior recognition network in the form of consecutive frames, the behavior recognition method further comprises:
performing equal inter-frame division on the image within the range of the prediction frame according to the image time sequence, randomly extracting a frame from each sub-frame segment as a key frame, and stacking all the key frames to obtain divided image data;
and inputting the image data into the trained behavior recognition network.
9. The behavior recognition method according to claim 8, wherein the trained behavior recognition model outputs the recognition result as:
OutPut={TSN1(T1,T2,...Tk),TSN2(T1,T2,...Tk),...,TSNm(T1,T2,...Tk)};
TSN(T1,T2,...Tk)=H(G(F(T1,w),F(T2,w)...F(Tk,w)))
wherein (T)1,T2,...Tk) Representing a sequence of video key frames, each key frame TkFrom its corresponding video segment SkObtaining the intermediate random sample; f (T)kW) shows that the convolutional network with w as a parameter acts on the frame TkFunction F returns TkScores against all categories; g is a segment consensus function, representing the combination of multiple TskH is the softmax prediction function, which is used to predict the probability that the entire video segment belongs to each behavior category.
10. A behavior recognition device for a multi-class engineering vehicle, comprising:
the acquisition model is used for acquiring a video to be identified;
the method comprises the following steps that a video to be identified comprises a plurality of frames of images, wherein each frame of image comprises a plurality of engineering truck targets;
the detection module is used for inputting the video to be recognized into a trained target detection model so as to enable the trained target detection model to recognize the video to be recognized and output a prediction frame;
the method comprises the steps that a prediction frame comprises an engineering vehicle target in a video to be recognized, the prediction frame where the engineering vehicle target is located corresponds to the position coordinate and the category of the engineering vehicle target, a trained target detection model is obtained through a first training set, the first training set comprises a plurality of first samples, the engineering vehicle target in each first sample is marked out by a real frame, the first training set is clustered to obtain k prior frames, the prior frames are input into a preset target detection model, so that the preset target detection model determines the prior frame which is intersected with the real frame and has the largest ratio with the real frame, the offset between the prediction frame and the prior frame is calculated, a prediction frame comprising the target is output, and the preset target detection model is subjected to iterative training until a first training cut-off condition is reached;
the recognition module is used for inputting the images in the prediction frame range into a trained behavior recognition network in a continuous frame mode so that the behavior recognition network extracts key frames of the video to be recognized and recognizes the behavior of the engineering truck target to obtain the category of the behavior of the engineering truck target in the video to be recognized;
the trained behavior recognition network is obtained by obtaining a second training set, wherein the second training set comprises a plurality of second samples, each second sample comprises a real behavior category of an engineering truck target, the second samples are input into a preset behavior recognition network, so that dimensional feature maps output by each layer in the preset behavior recognition network are grouped according to the time sequence of input images, the number difference of the dimensional feature maps contained in each group is minimum, each group of dimensional feature maps are shifted according to the serial number of the group, a vacancy in a feature vector corresponding to the shifted dimensional feature map is filled with 0, and the preset behavior recognition network is iteratively trained until a second training cut-off condition is reached, so that the trained behavior recognition network is obtained.
CN202110098578.5A 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle Active CN112800934B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110098578.5A CN112800934B (en) 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110098578.5A CN112800934B (en) 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle

Publications (2)

Publication Number Publication Date
CN112800934A true CN112800934A (en) 2021-05-14
CN112800934B CN112800934B (en) 2023-08-08

Family

ID=75811658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110098578.5A Active CN112800934B (en) 2021-01-25 2021-01-25 Behavior recognition method and device for multi-class engineering vehicle

Country Status (1)

Country Link
CN (1) CN112800934B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255616A (en) * 2021-07-07 2021-08-13 中国人民解放军国防科技大学 Video behavior identification method based on deep learning
CN113361519A (en) * 2021-05-21 2021-09-07 北京百度网讯科技有限公司 Target processing method, training method of target processing model and device thereof
CN114419508A (en) * 2022-01-19 2022-04-29 北京百度网讯科技有限公司 Recognition method, training method, device, equipment and storage medium
CN115131606A (en) * 2022-06-15 2022-09-30 重庆文理学院 Two-stage process action detection method based on YOLO-TSM

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086792A (en) * 2018-06-26 2018-12-25 上海理工大学 Based on the fine granularity image classification method for detecting and identifying the network architecture
WO2020206861A1 (en) * 2019-04-08 2020-10-15 江西理工大学 Yolo v3-based detection method for key object at transportation junction
CN111950583A (en) * 2020-06-05 2020-11-17 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM clustering
CN112084890A (en) * 2020-08-21 2020-12-15 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM and CQFL

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109086792A (en) * 2018-06-26 2018-12-25 上海理工大学 Based on the fine granularity image classification method for detecting and identifying the network architecture
WO2020206861A1 (en) * 2019-04-08 2020-10-15 江西理工大学 Yolo v3-based detection method for key object at transportation junction
CN111950583A (en) * 2020-06-05 2020-11-17 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM clustering
CN112084890A (en) * 2020-08-21 2020-12-15 杭州电子科技大学 Multi-scale traffic signal sign identification method based on GMM and CQFL

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王建林;付雪松;黄展超;郭永奇;王汝童;赵利强;: "改进YOLOv2卷积神经网络的多类型合作目标检测", 光学精密工程, no. 01 *
赵宇航;左辰煜;朱俊杰;钱诚;: "基于YOLO V3的无人机航拍车辆检测方法", 电子世界, no. 13 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361519A (en) * 2021-05-21 2021-09-07 北京百度网讯科技有限公司 Target processing method, training method of target processing model and device thereof
CN113361519B (en) * 2021-05-21 2023-07-28 北京百度网讯科技有限公司 Target processing method, training method of target processing model and device thereof
CN113255616A (en) * 2021-07-07 2021-08-13 中国人民解放军国防科技大学 Video behavior identification method based on deep learning
CN113255616B (en) * 2021-07-07 2021-09-21 中国人民解放军国防科技大学 Video behavior identification method based on deep learning
CN114419508A (en) * 2022-01-19 2022-04-29 北京百度网讯科技有限公司 Recognition method, training method, device, equipment and storage medium
CN115131606A (en) * 2022-06-15 2022-09-30 重庆文理学院 Two-stage process action detection method based on YOLO-TSM

Also Published As

Publication number Publication date
CN112800934B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN109118479B (en) Capsule network-based insulator defect identification and positioning device and method
CN111091105B (en) Remote sensing image target detection method based on new frame regression loss function
CN112233097B (en) Road scene other vehicle detection system and method based on space-time domain multi-dimensional fusion
CN112800934A (en) Behavior identification method and device for multi-class engineering vehicle
CN109784203B (en) Method for inspecting contraband in weak supervision X-ray image based on layered propagation and activation
CN112183414A (en) Weak supervision remote sensing target detection method based on mixed hole convolution
CN111476302A (en) fast-RCNN target object detection method based on deep reinforcement learning
CN110991444B (en) License plate recognition method and device for complex scene
CN113920107A (en) Insulator damage detection method based on improved yolov5 algorithm
CN114841972A (en) Power transmission line defect identification method based on saliency map and semantic embedded feature pyramid
CN112633149B (en) Domain-adaptive foggy-day image target detection method and device
CN111832615A (en) Sample expansion method and system based on foreground and background feature fusion
CN111681259B (en) Vehicle tracking model building method based on Anchor mechanism-free detection network
CN108171119B (en) SAR image change detection method based on residual error network
CN111833353B (en) Hyperspectral target detection method based on image segmentation
CN116342894B (en) GIS infrared feature recognition system and method based on improved YOLOv5
CN115984537A (en) Image processing method and device and related equipment
CN116310850B (en) Remote sensing image target detection method based on improved RetinaNet
CN112906816A (en) Target detection method and device based on optical differential and two-channel neural network
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN113609895A (en) Road traffic information acquisition method based on improved Yolov3
CN114972759A (en) Remote sensing image semantic segmentation method based on hierarchical contour cost function
CN115937659A (en) Mask-RCNN-based multi-target detection method in indoor complex environment
CN117274774A (en) Yolov 7-based X-ray security inspection image dangerous goods detection algorithm
CN112418358A (en) Vehicle multi-attribute classification method for strengthening deep fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant