CN113435430B - Video behavior identification method, system and equipment based on self-adaptive space-time entanglement - Google Patents

Video behavior identification method, system and equipment based on self-adaptive space-time entanglement Download PDF

Info

Publication number
CN113435430B
CN113435430B CN202110992358.7A CN202110992358A CN113435430B CN 113435430 B CN113435430 B CN 113435430B CN 202110992358 A CN202110992358 A CN 202110992358A CN 113435430 B CN113435430 B CN 113435430B
Authority
CN
China
Prior art keywords
feature
time
behavior
weight
weighting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110992358.7A
Other languages
Chinese (zh)
Other versions
CN113435430A (en
Inventor
陈盈盈
周鲁
胡益珲
王金桥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN202110992358.7A priority Critical patent/CN113435430B/en
Publication of CN113435430A publication Critical patent/CN113435430A/en
Application granted granted Critical
Publication of CN113435430B publication Critical patent/CN113435430B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision, and particularly relates to a video behavior identification method, a system and equipment based on adaptive space-time entanglement, aiming at solving the problem that the existing behavior identification method does not notice the differential influence of space-time clues on different action classes, so that the robustness of behavior class identification is poor. The method comprises the steps of obtaining an image to be identified by a behavior from an input video stream as an input image; acquiring the behavior category of the input image through a trained behavior recognition model; wherein the behavior recognition model is constructed based on a convolutional neural network. The invention improves the robustness of behavior category identification.

Description

Video behavior identification method, system and equipment based on self-adaptive space-time entanglement
Technical Field
The invention belongs to the field of computer vision, and particularly relates to a video behavior identification method, system and device based on adaptive space-time entanglement.
Background
Spatio-temporal information modeling is one of the core problems of video behavior recognition. The method is a basic task for analyzing the behaviors in the video, and can provide a basic method for tasks such as video behavior detection and segmentation. The purpose of which is to identify a specific category of behavior from a given video sequence.
In recent years, some mainstream methods include a dual-stream network and a 3D convolution network, in which the former extracts RGB and optical flow features through two parallel networks, respectively, and the latter models temporal and spatial information simultaneously through 3D convolution. However, a large number of model parameters and calculations limit efficiency, and therefore some improvements are made to model temporal and spatial information separately by decomposing a three-dimensional convolution into a two-dimensional spatial convolution and a one-dimensional temporal convolution.
The mainstream method is to extract better spatio-temporal features by designing different network structures, but the differential influence of spatio-temporal clues on different action classes is not noticed. For example, some actions can be easily distinguished with only one picture even without the help of temporal information because they have significant spatial information in different scenes and can therefore be predicted as action classes with a high degree of confidence. However, time information is essential for fine-grained motion recognition, such as discrimination of the action of pushing and pulling a string in "violin" motion. Because the video usually contains rich and interrelated contents, in such multidimensional information, it is not enough to model the spatio-temporal features by independent decomposition, and the correlation of the spatio-temporal information has great difference between different action categories, and the contribution of the spatio-temporal information in the identification process is different. In addition, the time boundary of the motion is not clear, i.e., the start time and the end time of the motion are not clear and the duration is not determined. Based on the method, the invention provides a video behavior identification method based on self-adaptive space-time entanglement.
Disclosure of Invention
In order to solve the above problems in the prior art, that is, to solve the problem that the existing behavior identification method does not notice the differentiation influence of spatio-temporal clues on different action classes, which results in poor robustness of behavior category identification, a first aspect of the present invention provides a video behavior identification method based on adaptive spatio-temporal entanglement, the method comprising:
s10, acquiring an image to be behavior recognized from the input video stream as an input image;
s20, acquiring the behavior category of the input image through the trained behavior recognition model;
the behavior recognition model is constructed based on a convolutional neural network, and the training method comprises the following steps:
a10, acquiring training sample images and behavior category truth labels corresponding to the training samples; the training sample image is an image to be identified by behaviors in a video data set acquired according to time sequence information;
a20, extracting the characteristics of the training sample image as first characteristics;
a30, acquiring the current training times T, and initializing a weight parameter if T is 1; otherwise, acquiring the updated weight parameter after T-1 times of training; the weight parameters comprise a first time characteristic weight, a second time characteristic weight, a third time characteristic weight and a first space characteristic weight;
2D convolution processing is carried out on the first features, and weighting processing is carried out on the first features after 2D convolution by combining the first spatial feature weight to obtain second features; carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics;
a40, calculating the feature similarity between adjacent frames based on the third feature, and adjusting the time feature weight by combining the feature similarity; weighting the third feature through the adjusted first time feature weight, inputting the weighted third feature into a 1D time convolution module, weighting the third feature through the adjusted second time feature weight, and inputting the weighted third feature into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are
Figure 475356DEST_PATH_IMAGE001
The convolutional layer of (1);
a50, weighting the fourth feature by combining the third time feature weight, and predicting the behavior category based on the weighted fourth feature to obtain a predicted behavior category;
a60, updating the weight parameters by a preset weight parameter updating method based on the predicted behavior types and the behavior type truth value labels, and jumping to the step A10 after updating until a trained behavior recognition model is obtained.
In some preferred embodiments, the method of "calculating feature similarity between adjacent frames based on the third feature, and adjusting the first temporal feature weight and the second temporal feature weight in combination with the feature similarity" includes:
calculating the cosine similarity between frames in a set dimension;
if the cosine similarity is greater than the set similarity threshold, setting the corresponding cosine similarity as a positive value, otherwise, setting the corresponding cosine similarity as a negative value;
based on cosine similarity after setting positive and negative, weighting time characteristic
Figure 930739DEST_PATH_IMAGE002
Carrying out weighting; will be weighted
Figure 675842DEST_PATH_IMAGE003
And not weighted
Figure 174956DEST_PATH_IMAGE003
Adding, weighted
Figure 67826DEST_PATH_IMAGE004
And not weighted
Figure 45009DEST_PATH_IMAGE004
And adding the time characteristic weight as the adjusted time characteristic weight.
In some preferred embodiments, the processing procedure of the weighted third feature by the high cohesion time expression module is as follows:
performing convolution and down-sampling processing on the third feature after the weighting processing of the second time feature weight, and taking the processed third feature as an original feature;
multiplying a preset basis vector by the original feature, reconstructing the multiplied feature through softmax, and taking the reconstructed feature as an attention feature;
multiplying the attention feature by the original feature to serve as an updated base vector, and further updating the attention feature;
updating the basis vectors and the attention characteristics in a circulating manner until the set circulating step number is reached;
and multiplying the finally obtained attention feature by the finally updated basis vector, and adding the multiplied attention feature and the third feature subjected to the weighting processing of the second time feature weight.
In some preferred embodiments, during the updating process of the basis vectors, L2 regularization and moving average updating are performed on the basis vectors:
the moving average updating method comprises the following steps;
Figure 457667DEST_PATH_IMAGE005
wherein mu is a base vector, mu _ mean is a mean value of the base vector, and momentum is a momentum, i.e. a preset weight.
In some preferred embodiments, the "updating the weight parameter by a preset weight parameter updating method" includes:
Figure 330945DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 445531DEST_PATH_IMAGE007
the function of the loss is represented by,
Figure 226406DEST_PATH_IMAGE008
is a parameter of the structure of the network,
Figure 8417DEST_PATH_IMAGE009
the network parameters refer to parameters for weighting space and time, and the network parameters refer to other parameters except the structural parameters.
In some preferred embodiments, the "updating the weight parameter by a preset weight parameter updating method" includes:
Figure 600066DEST_PATH_IMAGE010
Figure 405211DEST_PATH_IMAGE011
Figure 989776DEST_PATH_IMAGE012
Figure 626294DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 638112DEST_PATH_IMAGE014
a true value label for the behavior class is represented,
Figure 664974DEST_PATH_IMAGE015
the category of the behavior of the prediction is represented,
Figure 803963DEST_PATH_IMAGE016
which represents the cross-entropy loss in the entropy domain,
Figure 29408DEST_PATH_IMAGE017
a preset smoothing weight is represented and,
Figure 212127DEST_PATH_IMAGE018
Figure 54181DEST_PATH_IMAGE019
Figure 731282DEST_PATH_IMAGE020
indicating the number of correct behavior classes, other behavior classes, and total behavior classes, the correct class being a behavior class truth label, the other behavior classes being other classes than the correct behavior class,
Figure 14495DEST_PATH_IMAGE021
represents the cross-entropy loss after smoothing,
Figure 633695DEST_PATH_IMAGE022
a value of the prize is indicated,
Figure 697466DEST_PATH_IMAGE023
Figure 161946DEST_PATH_IMAGE024
the time step is represented by the time-step,
Figure 565245DEST_PATH_IMAGE025
the function of the symbol is represented by,
Figure 106079DEST_PATH_IMAGE026
which is indicative of the current action or actions,
Figure 391567DEST_PATH_IMAGE027
which is indicative of the current state of the device,
Figure 128579DEST_PATH_IMAGE028
the parameters that represent the convolutional neural network are,
Figure 183123DEST_PATH_IMAGE029
indicating that a prize is to be awarded.
In a second aspect of the present invention, a video behavior recognition system based on adaptive space-time entanglement is provided, the system comprising: the device comprises an image acquisition module and a behavior category identification module;
the image acquisition module is configured to acquire an image to be identified by a behavior from an input video stream as an input image;
the behavior category identification module is configured to acquire a behavior category of the input image through a trained behavior identification model;
the behavior recognition model is constructed based on a convolutional neural network, and the training method comprises the following steps:
a10, acquiring training sample images and behavior category truth labels corresponding to the training samples; the training sample image is an image to be identified by behaviors in a video data set acquired according to time sequence information;
a20, extracting the characteristics of the training sample image as first characteristics;
a30, acquiring the current training times T, and initializing a weight parameter if T is 1; otherwise, acquiring the updated weight parameter after T-1 times of training; the weight parameters comprise a first time characteristic weight, a second time characteristic weight, a third time characteristic weight and a first space characteristic weight;
2D convolution processing is carried out on the first features, and weighting processing is carried out on the first features after 2D convolution by combining the first spatial feature weight to obtain second features; carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics;
a40, calculating the feature similarity between adjacent frames based on the third feature, and adjusting the time feature weight by combining the feature similarity; weighting the third feature through the adjusted first time feature weight, inputting the weighted third feature into a 1D time convolution module, weighting the third feature through the adjusted second time feature weight, and inputting the weighted third feature into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are
Figure 144125DEST_PATH_IMAGE001
The convolutional layer of (1);
a50, weighting the fourth feature by combining the third time feature weight, and predicting the behavior category based on the weighted fourth feature to obtain a predicted behavior category;
a60, updating the weight parameters by a preset weight parameter updating method based on the predicted behavior types and the behavior type truth value labels, and jumping to the step A10 after updating until a trained behavior recognition model is obtained.
In a third aspect of the invention, an electronic device is proposed, at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for execution by the processor to implement the adaptive spatiotemporal entanglement-based video behavior recognition method described above.
In a fourth aspect of the present invention, a computer-readable storage medium is provided, where the computer-readable storage medium stores computer instructions for being executed by the computer to implement the above-mentioned adaptive spatiotemporal entanglement-based video behavior identification method.
The invention has the beneficial effects that:
the invention improves the robustness of behavior category identification.
1) The invention adopts a network structure search strategy to adaptively adjust the weight of time and space information, and excavates deep association between the space information and co-learns the interaction of time and space according to the difference of contributions in the behavior identification process; meanwhile, according to the prior information of the action rhythm and the structural parameters of the time convolution, high cohesive expression of the time information is obtained, so that actions with different rhythms are adjusted, and the robustness of behavior category identification is improved.
2) According to the invention, structural parameters are generated according to the current network state, and the Auto (2+1) D is used for adjusting time and space information according to the current structural parameters, so that the decoupled characteristics are fused, and the space-time modeling capability is enhanced. And adjusting rhythm information of the action according to the similarity and the structural parameters of the provided characteristics in the time dimension, thereby weakening the problem of characteristic expression difference caused by the same action but different rhythms. Finally, the adjusted features are used for predicting the behavior categories, so that the robustness of behavior identification is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a flow chart of a video behavior recognition method based on adaptive space-time entanglement according to an embodiment of the invention;
FIG. 2 is a block diagram of a video behavior recognition system based on adaptive spatiotemporal entanglement according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a convolution block structure according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an Auto (2+1) D-based adaptive fusion process according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of adjusting temporal feature weights according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a high cohesion time expression module optimization process according to one embodiment of the present invention;
fig. 7 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The invention discloses a video behavior identification method based on adaptive space-time entanglement, which comprises the following steps of:
s10, acquiring an image to be behavior recognized from the input video stream as an input image;
s20, acquiring the behavior category of the input image through the trained behavior recognition model;
the behavior recognition model is constructed based on a convolutional neural network, and the training method comprises the following steps:
a10, acquiring training sample images and behavior category truth labels corresponding to the training samples; the training sample image is an image to be identified by behaviors in a video data set acquired according to time sequence information;
a20, extracting the characteristics of the training sample image as first characteristics;
a30, acquiring the current training times T, and initializing a weight parameter if T is 1; otherwise, acquiring the updated weight parameter after T-1 times of training; the weight parameters comprise a first time characteristic weight, a second time characteristic weight, a third time characteristic weight and a first space characteristic weight;
2D convolution processing is carried out on the first features, and weighting processing is carried out on the first features after 2D convolution by combining the first spatial feature weight to obtain second features; carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics;
a40, calculating the feature similarity between adjacent frames based on the third feature, and adjusting the time feature weight by combining the feature similarity; weighting the third feature through the adjusted first time feature weight, inputting the weighted third feature into a 1D time convolution module, weighting the third feature through the adjusted second time feature weight, and inputting the weighted third feature into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are
Figure 667642DEST_PATH_IMAGE001
The convolutional layer of (1);
a50, weighting the fourth feature by combining the third time feature weight, and predicting the behavior category based on the weighted fourth feature to obtain a predicted behavior category;
a60, updating the weight parameters by a preset weight parameter updating method based on the predicted behavior types and the behavior type truth value labels, and jumping to the step A10 after updating until a trained behavior recognition model is obtained.
In order to more clearly describe the video behavior recognition method based on adaptive space-time entanglement, the following describes in detail the steps of an embodiment of the method in conjunction with the accompanying drawings.
In the following embodiments, a training process of a behavior recognition model is explained first, and then a process of obtaining a behavior recognition result by a video behavior recognition method based on adaptive space-time entanglement is described in detail.
1. Training process for behavior recognition model
A10, acquiring training sample images and behavior category truth labels corresponding to the training samples; the training sample image is an image to be identified by behaviors in a video data set acquired according to time sequence information;
in this embodiment, an image to be behavior-recognized in a video data set is obtained according to timing sequence information, and is used as a training sample image, and a behavior class true value label corresponding to the training sample image, that is, a correct behavior class is obtained.
A20, extracting the characteristics of the training sample image as first characteristics;
in this embodiment, the features of the training sample image are extracted by a convolutional neural network.
A30, acquiring the current training times T, and initializing a weight parameter if T is 1; otherwise, acquiring the updated weight parameter after T-1 times of training; the weight parameters comprise a first time characteristic weight, a second time characteristic weight, a third time characteristic weight and a first space characteristic weight;
2D convolution processing is carried out on the first features, and weighting processing is carried out on the first features after 2D convolution by combining the first spatial feature weight to obtain second features; carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics;
in the invention, the time and space information in the characteristics is decoupled by convolution of 2D and 1D (called Auto (2+1) D for short), and independent modeling is carried out. And adaptively fusing the decoupled information through structural parameters (weight parameters). The nonlinear expression capability of the model is increased by the activation function. They constitute a basic volume Block that can be used to replace the Block structure in networks such as the ResNet network. Auto (2+1) D is composed of a 2D convolution and a 1D convolution connected in sequence, their corresponding structural parameters, and an activation function, as shown in fig. 3.
In this embodiment, first, a weight parameter is obtained according to the training times of the behavior recognition model. The weight parameter comprises a first time characteristic weight
Figure 942765DEST_PATH_IMAGE030
Second time characteristic weight
Figure 851815DEST_PATH_IMAGE031
Third time characteristic weight
Figure 249299DEST_PATH_IMAGE032
First spatial feature weight
Figure 243800DEST_PATH_IMAGE033
If the training is the first training, initializing the weight parameter. Namely, a multidimensional structure parameter vector (weight parameter vector) is predefined, wherein the dimensions respectively represent the structure parameters corresponding to the space convolution and the time convolution, and the structure parameters are applied to the space convolution and the time convolution to fuse the characteristics of the space convolution and the time convolution. Otherwise, acquiring the updated weight parameters after T-1 times of training.
After the weight parameters are obtained, 2D convolution processing is carried out on the first features through a 2D convolution module, and the first features after 2D convolution are weighted by combining the weight of the first spatial features to obtain second features; and carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics. As shown in fig. 3 and 5, the 2D convolution module is 1
Figure 322614DEST_PATH_IMAGE034
d
Figure 368061DEST_PATH_IMAGE034
d is preferably 3 in the present invention.
A40, calculating the feature similarity between adjacent frames based on the third feature, and adjusting the time feature weight by combining the feature similarity; weighting the third feature through the adjusted first time feature weight, inputting the weighted third feature into a 1D time convolution module, weighting the third feature through the adjusted second time feature weight, and inputting the weighted third feature into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are convolution layers of tx1x 1;
in this embodiment, the feature similarity between adjacent frames, that is, the cosine distance similarity between the third features between adjacent frames, is calculated in a set dimension to measure the variation degree of the training sample image in the time dimension, and the current structural parameters are divided into positive and negative parameters based on a variation degree threshold, which act on the two structures in fig. 4 and 5. Specifically, the structure parameters after excitation correction and the originally input structure parameters are combined in a residual connection mode to serve as final structure parameters (namely, the cosine similarity between frames is calculated on a set dimension, if the cosine similarity is larger than a set similarity threshold, the corresponding cosine similarity is set to be a positive value, otherwise, the corresponding cosine similarity is set to be a negative value, and the time characteristic weight is weighted based on the cosine similarity after setting positive and negative values
Figure 405288DEST_PATH_IMAGE035
Carrying out weighting; will be weighted
Figure 355926DEST_PATH_IMAGE030
And not weighted
Figure 769590DEST_PATH_IMAGE030
Adding, weighted
Figure 918812DEST_PATH_IMAGE031
And not weighted
Figure 330201DEST_PATH_IMAGE031
Added as adjusted temporal feature weights).
Under the condition that the network achieves certain optimization, the element values of the tensor do not have large variance and are small in unification, so that the method dynamically adjusts the threshold value to set the current prior similarity information, namely a rhythm regulator, by setting margin, and adjusts the rhythms of different actions to enlarge the cohesiveness of the time dimension information.
By adjusted first time characteristic weight
Figure 50027DEST_PATH_IMAGE030
Weighting the third characteristic, inputting the weighted third characteristic into a 1D time convolution module, and adjusting the weight of the second time characteristic
Figure 532961DEST_PATH_IMAGE031
Weighting the third characteristics, and inputting the weighted third characteristics into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are
Figure 271110DEST_PATH_IMAGE001
The above-mentioned convolutional layer. In the present invention, t is preferably set to 3.
The 1D time convolution module is used for performing convolution processing on the weighted third features;
the high-cohesion time expression module obtains high-cohesion time expression by using an attention mechanism based on EM algorithm optimization, and for each training sample image, the characteristics are reconstructed through fixed-round iterative optimization. This process can be divided into E and M steps, first assuming a base vector B
Figure 118980DEST_PATH_IMAGE034
C
Figure 575369DEST_PATH_IMAGE034
And K, wherein B is the size of batch, C is the number of channels corresponding to the original input features, and K is the dimensionality of the basis vector. In step E, by using the base vectors and B
Figure 347147DEST_PATH_IMAGE034
(H
Figure 408644DEST_PATH_IMAGE034
W)
Figure 224153DEST_PATH_IMAGE036
Performing matrix multiplication on the original vector of C, and then reconstructing the original characteristics by softmax to obtain the size B
Figure 167838DEST_PATH_IMAGE034
(H
Figure 726996DEST_PATH_IMAGE034
W)
Figure 642999DEST_PATH_IMAGE036
And (K) a characteristic diagram. In step M, the dimension is B
Figure 391861DEST_PATH_IMAGE034
(H
Figure 88421DEST_PATH_IMAGE034
W)
Figure 388953DEST_PATH_IMAGE036
Reconstructed feature map of K and B
Figure 956200DEST_PATH_IMAGE034
(H
Figure 113512DEST_PATH_IMAGE034
W)
Figure 782522DEST_PATH_IMAGE036
C is multiplied to obtain a new base vector B
Figure 683482DEST_PATH_IMAGE034
C
Figure 574078DEST_PATH_IMAGE034
K. Namely the weighted third character of the expression module pair with high cohesion timeThe specific processing procedure of characterization is as shown in fig. 6:
performing convolution and down-sampling processing on the third feature after the weighting processing of the second time feature weight, and taking the processed third feature as an original feature;
multiplying a preset base vector by the original features, and taking the multiplied features as attention features;
multiplying the attention feature by the original feature to serve as an updated base vector, and further updating the attention feature;
updating the basis vectors and the attention characteristics in a circulating manner until the set circulating step number is reached;
and multiplying the finally obtained attention feature by the finally updated basis vector, and adding the multiplied attention feature and the third feature subjected to the weighting processing of the second time feature weight to obtain a finally reconstructed feature map with global information, which is obtained by the high-cohesion time expression module.
Furthermore, to ensure the stability of the basis vector update, it is regularized by L2, while the moving average update of the basis vectors is increased during training:
the moving average updating method comprises the following steps:
Figure 902291DEST_PATH_IMAGE037
(1)
wherein mu is a base vector, mu _ mean is a mean value of the base vector, and momentum is a momentum, i.e. a preset weight.
And finally, splicing the features processed by the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature.
A50, weighting the fourth feature by combining the third time feature weight, and predicting the behavior category based on the weighted fourth feature to obtain a predicted behavior category;
in the present embodiment, the description is combined
Figure 307864DEST_PATH_IMAGE038
The fourth characteristic is weighted and the fourth characteristic is weighted,and predicting the behavior category through the classifier based on the weighted fourth feature to obtain the predicted behavior category.
A60, updating the weight parameters by a preset weight parameter updating method based on the predicted behavior types and the behavior type truth value labels, and jumping to the step A10 after updating until a trained behavior recognition model is obtained.
In the embodiment, the weight parameters (i.e. the first time characteristic weight) in the convolutional neural network are updated by a preset weight parameter updating method
Figure 497668DEST_PATH_IMAGE030
Second time characteristic weight
Figure 242770DEST_PATH_IMAGE031
Figure 741885DEST_PATH_IMAGE038
First spatial feature weight
Figure 634754DEST_PATH_IMAGE033
) And (6) updating.
In the invention, the structural parameters corresponding to the time and space convolutions to be fused are optimized and updated by two structural parameter updating modes of differential mode updating and strategy gradient mode updating.
The specific process of the differential mode updating is as follows:
recording the operating space as
Figure 877517DEST_PATH_IMAGE039
Figure 742705DEST_PATH_IMAGE040
Is a specific one of the operations, the node refers to the set of basic operation units in the network structure search method, i and j are two sequentially adjacent nodes, and the weight of a set of candidate operations between them is denoted as
Figure 429032DEST_PATH_IMAGE041
And P is the corresponding probability distribution. The candidate operation with the maximum probability between the nodes i and j is obtained by the max function, and the final network structure is formed by stacking the operations obtained by searching different nodes, as shown in the following formula:
Figure 278040DEST_PATH_IMAGE042
(2)
looking at the method in the lateral direction, which is equivalent to learning the specific operation selected, the operation space is limited to be optimized directly by the gradient on top of the cascaded 2D convolution and 1D convolution, as shown in the following formula:
Figure 527755DEST_PATH_IMAGE043
(3)
wherein the content of the first and second substances,
Figure 44187DEST_PATH_IMAGE044
which is indicative of a parameter of the network,
Figure 150684DEST_PATH_IMAGE045
the gradient is represented by the number of lines,
Figure 690249DEST_PATH_IMAGE046
representing a loss of training.
Viewed longitudinally, the method of the present invention is equivalent to enhancing or reducing the importance of the 2D and 1D features in feature learning through structural parameters. As shown in fig. 3, the blocks of the present invention are defined between two nodes. Such as the Resnet structure, which represent the output of the previous block and the input of the next block. Sequentially connected 1
Figure 25547DEST_PATH_IMAGE034
d
Figure 662064DEST_PATH_IMAGE034
d and
Figure 673883DEST_PATH_IMAGE001
the convolution is defined inside the block. Structural parameters are used on top of these two convolutions to adjust the strength of their filtering ability. Note the book
Figure 966324DEST_PATH_IMAGE047
To be defined in a search space
Figure 89001DEST_PATH_IMAGE039
In, and scope input
Figure 596337DEST_PATH_IMAGE048
The above operations. The weight vector between node i and node j is
Figure 982319DEST_PATH_IMAGE049
. The following formula is thus obtained:
Figure 293214DEST_PATH_IMAGE050
(4)
wherein the content of the first and second substances,
Figure 485161DEST_PATH_IMAGE051
is a linear mapping of the weight vector,
Figure 830692DEST_PATH_IMAGE052
representing an operation acting on an input x. This step is to
Figure 387575DEST_PATH_IMAGE051
Set as one fully connected layer. In particular, a cell (a neuron in a neural network) is defined herein as one (2+1) D block, so i and j are fixed. The learning objective can therefore be simplified as:
Figure 936499DEST_PATH_IMAGE053
(5)
wherein the content of the first and second substances,
Figure 932137DEST_PATH_IMAGE054
is a parameter of the structure of the network,
Figure 335437DEST_PATH_IMAGE055
the parameters are network parameters, the structural parameters refer to parameters for weighting space and time, the network parameters refer to other parameters except the structural parameters, and y is the output of the (2+1) D module, namely the predicted behavior category. These two parameters are trained end-to-end simultaneously herein, thanks to a lightweight search space. A set of structural parameters is learned for each (2+1) D block. In this way, the optimization method proposed by this sub-step is as follows:
Figure 859959DEST_PATH_IMAGE056
(6)
wherein the content of the first and second substances,
Figure 411026DEST_PATH_IMAGE057
the loss function is represented.
The specific process of strategy gradient mode updating is as follows:
policy Gradient (Policy Gradient) is a reinforcement learning method, wherein Policy refers to actions (action) taken under different states (states), and the goal is to hopefully make Gradient reduction based on Policy, so as to train an agent (agent) to have better action according to the current state and obtain higher reward. The method uses a Multilayer Perceptron (MLP) as agent, parameters of the current agent network as state, structural parameters of the network output as action, and loss of the current backbone network and reward constant as components of the reward function.
The specific forward flow is firstly to input initial structure parameters to the agent network, and then the network can predict the next network parameters, namely action, according to the current agent network parameters and the input structure parameters. In the back propagation process, then the reward currently available is maximized. Let s be the current state, a represents the current action, and θ represents the network parameters. The cross entropy loss is defined as follows:
Figure 695508DEST_PATH_IMAGE058
(7)
wherein the content of the first and second substances,
Figure 953314DEST_PATH_IMAGE059
a true value label for the behavior class is represented,
Figure 648737DEST_PATH_IMAGE060
the category of the behavior of the prediction is represented,
Figure 687100DEST_PATH_IMAGE061
representing the cross entropy loss. In order to ensure that the influence of the structural parameter search on the whole network learning is positive, the invention carries out reward function design based on the smoothed CE value, so that the searched structural parameters and the learning of the backbone network are mutually assisted. Wherein the smoothed CE formula is as follows:
Figure 962224DEST_PATH_IMAGE062
(8)
i、
Figure 871274DEST_PATH_IMAGE063
Figure 19490DEST_PATH_IMAGE064
indicating the number of correct behavior classes, other behavior classes, and total behavior classes, the correct class being a behavior class truth label, the other behavior classes being other classes than the correct behavior class,
Figure 13991DEST_PATH_IMAGE065
is a very small constant which is constant in time,
Figure 92805DEST_PATH_IMAGE066
indicating a preset levelAnd (4) sliding the weight. Then:
Figure 856362DEST_PATH_IMAGE067
(9)
wherein the content of the first and second substances,
Figure 424746DEST_PATH_IMAGE068
representing a symbolic function.
If the SCE value obtained at the next time step n is larger than that obtained at the previous m
Figure 109806DEST_PATH_IMAGE069
Then forward reward value is given
Figure 274202DEST_PATH_IMAGE070
Otherwise reward is
Figure 423423DEST_PATH_IMAGE071
. The overall objective function is defined as:
Figure 897130DEST_PATH_IMAGE072
(10)
wherein the content of the first and second substances,
Figure 69485DEST_PATH_IMAGE073
indicating that a prize is to be awarded.
Specifically, MLPs (multi-layer perceptron) corresponding to structural parameters of two parts of a prior excitation module for the importance of spatio-temporal information and the reduction of intra-class differences are 3-layer neural networks with 6 hidden neurons and 4 hidden neurons, while ReLU activation functions are added between layers, and the last layer is a softplus activation function. Since the policy gradient mechanism needs a complete state behavior sequence, which results in a lack of feedback in the intermediate state and thus a non-ideal overall training effect, in terms of the length of the state sequence, one method is to set the state sequence to 1 epoch, that is, calculate the reward of the latest epoch every 2 epochs, and the other method is to view the state sequence as an optimization within one iteration, which is more beneficial to the optimization. And stripping the parameters of the network and the parameters of the agent during optimization, and optimizing separately. Different optimizers are adopted for the two parameters, wherein Adam is adopted by the agent optimizer, SGD is adopted by the network parameter optimization, and the Adam and the SGD are alternately updated during optimization.
And C, circulating the steps A100-A60 until a trained behavior recognition model is obtained.
2. Video behavior identification method based on self-adaptive space-time entanglement
S10, acquiring an image to be behavior recognized from the input video stream as an input image;
in the present embodiment, an image to be behavior-recognized is acquired from an input video.
And S20, acquiring the behavior category of the input image through the trained behavior recognition model.
In this embodiment, the behavior category of the image to be behavior recognized is obtained through the trained behavior recognition model.
A video behavior recognition system based on adaptive space-time entanglement according to a second embodiment of the present invention, as shown in fig. 2, specifically includes: the image acquisition module 100 and the behavior category identification module 200;
the image obtaining module 100 is configured to obtain an image to be behavior recognized from an input video stream as an input image;
the behavior category identification module 200 is configured to obtain a behavior category of the input image through a trained behavior identification model;
the behavior recognition model is constructed based on a convolutional neural network, and the training method comprises the following steps:
a10, acquiring training sample images and behavior category truth labels corresponding to the training samples; the training sample image is an image to be identified by behaviors in a video data set acquired according to time sequence information;
a20, extracting the characteristics of the training sample image as first characteristics;
a30, acquiring the current training times T, and initializing a weight parameter if T is 1; otherwise, acquiring the updated weight parameter after T-1 times of training; the weight parameters comprise a first time characteristic weight, a second time characteristic weight, a third time characteristic weight and a first space characteristic weight;
2D convolution processing is carried out on the first features, and weighting processing is carried out on the first features after 2D convolution by combining the first spatial feature weight to obtain second features; carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics;
a40, calculating the feature similarity between adjacent frames based on the third feature, and adjusting the time feature weight by combining the feature similarity; weighting the third feature through the adjusted first time feature weight, inputting the weighted third feature into a 1D time convolution module, weighting the third feature through the adjusted second time feature weight, and inputting the weighted third feature into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are
Figure 286840DEST_PATH_IMAGE001
The convolutional layer of (1);
a50, weighting the fourth feature by combining the third time feature weight, and predicting the behavior category based on the weighted fourth feature to obtain a predicted behavior category;
a60, updating the weight parameters by a preset weight parameter updating method based on the predicted behavior types and the behavior type truth value labels, and jumping to the step A10 after updating until a trained behavior recognition model is obtained.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that, the video behavior recognition system based on adaptive space-time entanglement provided in the foregoing embodiment is only illustrated by the division of the foregoing functional modules, and in practical applications, the foregoing functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiments of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiments may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
An electronic device according to a third embodiment of the present invention includes at least one processor; and a memory communicatively coupled to at least one of the processors; wherein the memory stores instructions executable by the processor for execution by the processor to implement the adaptive spatiotemporal entanglement-based video behavior recognition method described above.
A computer-readable storage medium of a fourth embodiment of the present invention stores computer instructions for execution by the computer to implement the above-mentioned adaptive spatiotemporal entanglement-based video behavior identification method.
It is clear to those skilled in the art that, for convenience and brevity not described, the specific working processes and related descriptions of the above-described apparatuses and computer-readable storage media may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Referring now to FIG. 7, there is illustrated a block diagram of a computer system suitable for use as a server in implementing embodiments of the method, system, and apparatus of the present application. The server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 7, the computer system includes a Central Processing Unit (CPU) 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for system operation are also stored. The CPU701, the ROM 702, and the RAM703 are connected to each other via a bus 704. An Input/Output (I/O) interface 705 is also connected to the bus 704.
The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), a compact disc read-only memory (CD-ROM), Optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (9)

1. A video behavior identification method based on adaptive space-time entanglement is characterized by comprising the following steps:
s10, acquiring an image to be behavior recognized from the input video stream as an input image;
s20, acquiring the behavior category of the input image through the trained behavior recognition model;
the behavior recognition model is constructed based on a convolutional neural network, and the training method comprises the following steps:
a10, acquiring training sample images and behavior category truth labels corresponding to the training samples; the training sample image is an image to be identified by behaviors in a video data set acquired according to time sequence information;
a20, extracting the characteristics of the training sample image as first characteristics;
a30, acquiring the current training times T, and initializing a weight parameter if T is 1; otherwise, acquiring the updated weight parameter after T-1 times of training; the weight parameters comprise a first time characteristic weight, a second time characteristic weight, a third time characteristic weight and a first space characteristic weight;
2D convolution processing is carried out on the first features, and weighting processing is carried out on the first features after 2D convolution by combining the first spatial feature weight to obtain second features; carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics;
a40, calculating the feature similarity between adjacent frames based on the third feature, and adjusting the first time feature weight and the second time feature weight by combining the feature similarity; weighting the third feature through the adjusted first time feature weight, inputting the weighted third feature into a 1D time convolution module, weighting the third feature through the adjusted second time feature weight, and inputting the weighted third feature into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are
Figure 815899DEST_PATH_IMAGE001
The convolutional layer of (1);
a50, weighting the fourth feature by combining the third time feature weight, and predicting the behavior category based on the weighted fourth feature to obtain a predicted behavior category;
a60, updating the weight parameters by a preset weight parameter updating method based on the predicted behavior types and the behavior type truth value labels, and jumping to the step A10 after updating until a trained behavior recognition model is obtained.
2. The adaptive spatio-temporal entanglement-based video behavior recognition method according to claim 1, wherein "based on the third feature, feature similarity between adjacent frames is calculated, and the first temporal feature weight and the second temporal feature weight are adjusted in combination with the feature similarity", and the method comprises:
calculating the cosine similarity between frames in a set dimension;
if the cosine similarity is greater than the set similarity threshold, setting the corresponding cosine similarity as a positive value, otherwise, setting the corresponding cosine similarity as a negative value;
based on cosine similarity after setting positive and negative, weighting time characteristic
Figure 592094DEST_PATH_IMAGE002
Carrying out weighting; will be weighted
Figure 106252DEST_PATH_IMAGE003
And not weighted
Figure 235882DEST_PATH_IMAGE003
Adding, weighted
Figure 50254DEST_PATH_IMAGE004
And not weighted
Figure 607138DEST_PATH_IMAGE004
And adding the time characteristic weight as the adjusted time characteristic weight.
3. The adaptive space-time entanglement based video behavior recognition method according to claim 1, wherein the processing procedure of the weighted third feature by the high-cohesion time expression module is as follows:
performing convolution and down-sampling processing on the third feature after the weighting processing of the second time feature weight, and taking the processed third feature as an original feature;
multiplying a preset basis vector by the original feature, reconstructing the multiplied feature through softmax, and taking the reconstructed feature as an attention feature;
multiplying the attention feature by the original feature to serve as an updated base vector, and further updating the attention feature;
updating the basis vectors and the attention characteristics in a circulating manner until the set circulating step number is reached;
and multiplying the finally obtained attention feature by the finally updated basis vector, and adding the multiplied attention feature and the third feature subjected to the weighting processing of the second time feature weight.
4. The adaptive space-time entanglement based video behavior recognition method according to claim 3, wherein in the updating process of the basis vectors, L2 regularization and moving average updating are carried out on the basis vectors;
the moving average updating method comprises the following steps:
Figure 733225DEST_PATH_IMAGE005
wherein mu is a base vector, mu _ mean is a mean value of the base vector, and momentum is a momentum, i.e. a preset weight.
5. The adaptive space-time entanglement-based video behavior recognition method according to claim 1, wherein the weight parameters are updated by a preset weight parameter updating method, and the method comprises the following steps:
Figure 666546DEST_PATH_IMAGE006
wherein the content of the first and second substances,
Figure 335425DEST_PATH_IMAGE007
the function of the loss is represented by,
Figure 63210DEST_PATH_IMAGE008
is a parameter of the structure of the network,
Figure 551960DEST_PATH_IMAGE009
the network parameters refer to parameters for weighting space and time, and the network parameters refer to other parameters except the structural parameters.
6. The adaptive space-time entanglement-based video behavior recognition method according to claim 5, wherein the weight parameters are updated by a preset weight parameter updating method, and the method comprises the following steps:
Figure 148026DEST_PATH_IMAGE010
Figure 671411DEST_PATH_IMAGE011
Figure 835677DEST_PATH_IMAGE012
Figure 811723DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 86846DEST_PATH_IMAGE014
a true value label for the behavior class is represented,
Figure 464738DEST_PATH_IMAGE015
the category of the behavior of the prediction is represented,
Figure 658959DEST_PATH_IMAGE016
which represents the cross-entropy loss in the entropy domain,
Figure 122301DEST_PATH_IMAGE017
a preset smoothing weight is represented and,
Figure 935537DEST_PATH_IMAGE018
Figure 433514DEST_PATH_IMAGE019
Figure 939582DEST_PATH_IMAGE020
indicating the number of correct behavior classes, other behavior classes, and total behavior classes, the correct class being a behavior class truth label, the other behavior classes being other classes than the correct behavior class,
Figure 890220DEST_PATH_IMAGE021
represents the cross-entropy loss after smoothing,
Figure 631780DEST_PATH_IMAGE022
a value of the prize is indicated,
Figure 718685DEST_PATH_IMAGE023
Figure 395654DEST_PATH_IMAGE024
the time step is represented by the time-step,
Figure 833588DEST_PATH_IMAGE025
the function of the symbol is represented by,
Figure 988626DEST_PATH_IMAGE026
which is indicative of the current action or actions,
Figure 320250DEST_PATH_IMAGE027
which is indicative of the current state of the device,
Figure 433700DEST_PATH_IMAGE028
the parameters that represent the convolutional neural network are,
Figure 93351DEST_PATH_IMAGE029
indicating that a prize is to be awarded.
7. A video behavior recognition system based on adaptive spatiotemporal entanglement, the system comprising: the device comprises an image acquisition module and a behavior category identification module;
the image acquisition module is configured to acquire an image to be identified by a behavior from an input video stream as an input image;
the behavior category identification module is configured to acquire a behavior category of the input image through a trained behavior identification model;
the behavior recognition model is constructed based on a convolutional neural network, and the training method comprises the following steps:
a10, acquiring training sample images and behavior category truth labels corresponding to the training samples; the training sample image is an image to be identified by behaviors in a video data set acquired according to time sequence information;
a20, extracting the characteristics of the training sample image as first characteristics;
a30, acquiring the current training times T, and initializing a weight parameter if T is 1; otherwise, acquiring the updated weight parameter after T-1 times of training; the weight parameters comprise a first time characteristic weight, a second time characteristic weight, a third time characteristic weight and a first space characteristic weight;
2D convolution processing is carried out on the first features, and weighting processing is carried out on the first features after 2D convolution by combining the first spatial feature weight to obtain second features; carrying out batch normalization and activation processing on the second characteristics to obtain third characteristics;
a40, calculating the feature similarity between adjacent frames based on the third feature, and adjusting the first time feature weight and the second time feature weight by combining the feature similarity; weighting the third feature through the adjusted first time feature weight, inputting the weighted third feature into a 1D time convolution module, weighting the third feature through the adjusted second time feature weight, and inputting the weighted third feature into a high-cohesion time expression module; splicing the processed features of the 1D time convolution module and the high cohesion time expression module to serve as a fourth feature; the 1D time convolution module and the high cohesion time expression module are
Figure 52080DEST_PATH_IMAGE001
The convolutional layer of (1);
a50, weighting the fourth feature by combining the third time feature weight, and predicting the behavior category based on the weighted fourth feature to obtain a predicted behavior category;
a60, updating the weight parameters by a preset weight parameter updating method based on the predicted behavior types and the behavior type truth value labels, and jumping to the step A10 after updating until a trained behavior recognition model is obtained.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to at least one of the processors; wherein the content of the first and second substances,
the memory stores instructions executable by the processor for performing the method for adaptive spatiotemporal entanglement based video behavior recognition according to any one of claims 1-6.
9. A computer-readable storage medium storing computer instructions for execution by the computer to implement the adaptive spatio-temporal entanglement based video behavior recognition method according to any one of claims 1-6.
CN202110992358.7A 2021-08-27 2021-08-27 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement Active CN113435430B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110992358.7A CN113435430B (en) 2021-08-27 2021-08-27 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110992358.7A CN113435430B (en) 2021-08-27 2021-08-27 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement

Publications (2)

Publication Number Publication Date
CN113435430A CN113435430A (en) 2021-09-24
CN113435430B true CN113435430B (en) 2021-11-09

Family

ID=77798196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110992358.7A Active CN113435430B (en) 2021-08-27 2021-08-27 Video behavior identification method, system and equipment based on self-adaptive space-time entanglement

Country Status (1)

Country Link
CN (1) CN113435430B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114863356B (en) * 2022-03-10 2023-02-03 西南交通大学 Group activity identification method and system based on residual aggregation graph network
CN116189281B (en) * 2022-12-13 2024-04-02 北京交通大学 End-to-end human behavior classification method and system based on space-time self-adaptive fusion
CN116109630B (en) * 2023-04-10 2023-06-16 创域智能(常熟)网联科技有限公司 Image analysis method and system based on sensor acquisition and artificial intelligence

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200064444A1 (en) * 2015-07-17 2020-02-27 Origin Wireless, Inc. Method, apparatus, and system for human identification based on human radio biometric information
CN109460702B (en) * 2018-09-14 2022-02-15 华南理工大学 Passenger abnormal behavior identification method based on human body skeleton sequence
CN110084228A (en) * 2019-06-25 2019-08-02 江苏德劭信息科技有限公司 A kind of hazardous act automatic identifying method based on double-current convolutional neural networks
CN110909658A (en) * 2019-11-19 2020-03-24 北京工商大学 Method for recognizing human body behaviors in video based on double-current convolutional network
CN112183240B (en) * 2020-09-11 2022-07-22 山东大学 Double-current convolution behavior identification method based on 3D time stream and parallel space stream
CN112597883B (en) * 2020-12-22 2024-02-09 武汉大学 Human skeleton action recognition method based on generalized graph convolution and reinforcement learning

Also Published As

Publication number Publication date
CN113435430A (en) 2021-09-24

Similar Documents

Publication Publication Date Title
CN113435430B (en) Video behavior identification method, system and equipment based on self-adaptive space-time entanglement
JP7127120B2 (en) Video classification method, information processing method and server, and computer readable storage medium and computer program
CN111079601A (en) Video content description method, system and device based on multi-mode attention mechanism
CN110503192A (en) The effective neural framework of resource
CN111382555B (en) Data processing method, medium, device and computing equipment
WO2023061102A1 (en) Video behavior recognition method and apparatus, and computer device and storage medium
US20200134455A1 (en) Apparatus and method for training deep learning model
CN112232355B (en) Image segmentation network processing method, image segmentation device and computer equipment
CN113705811B (en) Model training method, device, computer program product and equipment
CN113469186B (en) Cross-domain migration image segmentation method based on small number of point labels
CN112580728B (en) Dynamic link prediction model robustness enhancement method based on reinforcement learning
CN114298851A (en) Network user social behavior analysis method and device based on graph sign learning and storage medium
CN111708871A (en) Dialog state tracking method and device and dialog state tracking model training method
CN114072809A (en) Small and fast video processing network via neural architectural search
CN116861262B (en) Perception model training method and device, electronic equipment and storage medium
Kang et al. Autoencoder-based graph construction for semi-supervised learning
CN110990630B (en) Video question-answering method based on graph modeling visual information and guided by using questions
Hao et al. Deep collaborative online learning resource recommendation based on attention mechanism
JP2022088341A (en) Apparatus learning device and method
Lu et al. Siamese Graph Attention Networks for robust visual object tracking
CN113822291A (en) Image processing method, device, equipment and storage medium
CN113095328A (en) Self-training-based semantic segmentation method guided by Gini index
Germi et al. Enhanced data-recalibration: utilizing validation data to mitigate instance-dependent noise in classification
Zhang et al. Style classification of media painting images by integrating ResNet and attention mechanism
Guo et al. An unsupervised optical flow estimation for LiDAR image sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant