CN115661596A - Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer - Google Patents

Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer Download PDF

Info

Publication number
CN115661596A
CN115661596A CN202211334609.3A CN202211334609A CN115661596A CN 115661596 A CN115661596 A CN 115661596A CN 202211334609 A CN202211334609 A CN 202211334609A CN 115661596 A CN115661596 A CN 115661596A
Authority
CN
China
Prior art keywords
model
video
convolution
vector
feature vectors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211334609.3A
Other languages
Chinese (zh)
Inventor
刘绍辉
米亚纯
姜峰
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202211334609.3A priority Critical patent/CN115661596A/en
Publication of CN115661596A publication Critical patent/CN115661596A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/80Management or planning
    • Y02P90/82Energy audits or management systems therefor

Landscapes

  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a short video positive energy evaluation method, a device and equipment based on 3D convolution and a Transformer, relating to the technical field of video violent behavior analysis and solving the technical problem of how to better perform positive energy evaluation on a video containing a large number of frames, wherein the method comprises the following steps: acquiring a video clip, wherein the frame number of the video clip is a preset frame number; extracting the features of the video clips based on a pre-trained 3D convolution model to obtain a plurality of feature vectors; performing position coding on the feature vector; inputting a plurality of feature vectors subjected to position coding into a pre-trained transform model to obtain an output vector; inputting the output vector to a multilayer perceptron model, and calculating to obtain the positive energy fraction of the video clip; the method is used for carrying out positive energy evaluation on the short video based on the 3D convolution model and the Transformer model, has a good time sequence modeling effect, can process the video containing a large number of video frames for a long time, and is also applied to the field of computer vision.

Description

Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer
Technical Field
The invention relates to the technical field of video violent behavior analysis.
Background
In recent years, the proliferation of user-generated content video has been explosive across platforms such as fast-hand, trembling, and micro-video, each supporting tens of millions and billions of users. Today's video capture devices become rather cheaper and cheaper with higher and higher capture levels, which greatly reduces the short video production costs, and thus a huge number of short video users are not only consumers of short videos but also creators of short videos, resulting in a rapid increase in the number of short videos on the network, which thus rapidly become the most dominant source of human information.
Due to different purposes of video creation of different users, some short videos which are vulgar, negative or unhealthy always exist, so that the information spread by the short video content does not accord with the mainstream value of the society
The existing violent behavior analysis data sets are few, and many researches are carried out on a single data set. Algorithmically, existing techniques tend to employ 2D convolution techniques + RNN (GRU, LSTM), single 3D convolution techniques, and 3D convolution techniques + RNN (GRU, LSTM). A lot of experiments show that the above method does not perform well enough in video processing, RNN technique has been proved to be much worse than transform model in time sequence modeling, and the processing of video frames using only 3D convolution technique is too limited to process video containing a large number of video frames for a long time.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a short video positive energy evaluation method, device and equipment based on 3D convolution and a Transformer.
A short video positive energy evaluation method based on 3D convolution and a Transformer comprises the following steps:
acquiring a video clip, wherein the frame number of the video clip is a preset frame number;
performing feature extraction on the video clip based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;
performing position coding on the feature vector;
inputting a plurality of feature vectors subjected to position coding into a pre-trained transform model to obtain an output vector;
and inputting the output vector to a multilayer perceptron model, and calculating to obtain a positive energy fraction of the video clip.
Further, the 3D convolution model is an R3D model, and the preset frame number is a multiple of the number of frames that can be input by the R3D model each time.
Further, the 3D convolution model includes a plurality of fully connected layers, and the last fully connected layer in the 3D convolution model is marked as a first fully connected layer;
carrying out feature extraction on each video clip based on a pre-trained 3D convolution model to obtain a plurality of feature vectors, wherein the method comprises the following steps:
performing center cropping on the video clip;
dividing the video clips subjected to center cutting according to a frame sequence to obtain a plurality of input frame groups;
inputting the plurality of input frame groups into the 3D convolution model in sequence, and recording the obtained input vector of the first full-connection layer as a middle characteristic vector;
and inputting the plurality of intermediate feature vectors into a second full-connection layer for dimensionality increase to obtain a plurality of feature vectors.
Further, the 3D convolution model is trained based on data set Kinetics and based on the following parameters:
the 3D convolution model is an R3D model with 18 layers of depth, the iteration times are 1M times, the initial learning rate is 1e-2, the whole training process is divided into 45 stages, the first 10 stages are warmed up, and the learning rate of every ten stages is reduced to one tenth of the original learning rate.
Further, the Transformer model comprises at least one Encoder Block, wherein the Encoder Block comprises a multi-head attention structure and a multi-layer perceptron structure;
before the feature vector is input into each multi-head attention structure, standardization operation is carried out, and residual connection is adopted after each multi-head attention structure;
before the feature vectors are input into each multi-layer perceptron structure, standardization operation is carried out, and residual connection is adopted after each multi-layer perceptron structure;
the multilayer perceptron structure comprises a third full connection layer and a fourth full connection layer, wherein the third full connection layer is used for expanding the dimensionality of the characteristic vector to four times of the original dimensionality, and the fourth full connection layer is used for restoring the dimensionality of the characteristic vector to the original size.
Further, the Transformer model is trained based on the following parameters:
the loss function adopts MSE, the optimizer adopts an AdamW optimizer, the number of layers of the Encoder Block is 24, and the number of heads of a multi-head attention structure is 16;
preheating the learning rate by adopting a Warm up method, selecting Linear Warm up as a specific Warm up strategy, setting the initial learning rate to be 1e-5, setting the preheating step number to be 15 steps, and setting the total training number to be 60 steps.
Further, inputting a plurality of feature vectors subjected to position coding into a pre-trained transform model to obtain an output vector, including:
inputting a plurality of feature vectors subjected to position coding and a classification head vector cls-token into a Transformer model;
and inputting output corresponding to the classification head vector cls-token in the Transformer model as classification features to a final full-connected layer, and converting 1024-dimensional features into one-dimensional vectors to obtain output vectors.
Further, the preset frame number is 96, the number of frames that the R3D model can input each time is 16, and the number of the feature vectors is 6.
A short video positive energy evaluation device based on 3D convolution and Transformer comprises:
the video acquisition module is used for acquiring video clips, and the frame number of the video clips is a preset frame number;
the feature extraction module is used for extracting features of the video clips based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;
the position coding module is used for carrying out position coding on the feature vector;
the output calculation module is used for inputting the plurality of feature vectors subjected to position coding into a pre-trained Transformer model to obtain an output vector;
and the score calculating module is used for inputting the output vector to the multilayer perceptron model and calculating to obtain the positive energy score of the video clip.
An electronic device comprises a processor and a storage device, wherein a plurality of instructions are stored in the storage device, and the processor is used for reading the plurality of instructions in the storage device and executing the method.
The short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer provided by the invention at least have the following beneficial effects:
(1) The method comprises the steps of performing positive energy evaluation on short videos based on a 3D convolution model and a Transformer model, extracting features of the videos by using the 3D convolution model, and performing time sequence feature fusion on the videos by using the Transformer model, so that the method has a good time sequence modeling effect and can process videos containing a large number of video frames for a long time;
(2) The training of the 3D convolution model and the Transformer model is carried out by adopting specific parameters, so that the training efficiency can be effectively improved while the training effect is improved;
(3) When the characteristics of the video clips are extracted based on the 3D convolution model, the dimensionality increasing operation is adopted, the complexity of the model is improved, and the model has better fitting capability;
(4) The feature vectors extracted from the 3D convolution model are subjected to position coding, so that the time sequence information between frames is effectively kept and is not lost, and a better time sequence feature fusion effect is realized.
Drawings
FIG. 1 is a flowchart of an embodiment of a short video positive energy evaluation method based on 3D convolution and a transform according to the present invention;
FIG. 2 is a schematic structural diagram of an embodiment of an Encoder Block in the transform model provided by the present invention.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
It should be noted that the concept of "positive energy" in this embodiment is used to describe short video emotional tendency, and specifically refers to that short video content is not related to violent behavior, and does not relate to vulgar, negative or unhealthy content.
Referring to fig. 1, in some embodiments, a short video positive energy evaluation method based on 3D convolution and a transform includes:
s1, acquiring a video clip, wherein the frame number of the video clip is a preset frame number;
s2, extracting features of the video clips based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;
s3, carrying out position coding on the feature vector;
s4, inputting the plurality of feature vectors subjected to position coding into a pre-trained Transformer model to obtain an output vector;
and S5, inputting the output vector to a multilayer perceptron model, and calculating to obtain a positive energy fraction of the video clip.
In fig. 1, R3D is a 3D version of the ResNet model, FC is a full-link layer, position Embedding represents Position coding, and MLP Head represents a multi-layer perceptron.
As a preferred embodiment, the 3D convolution model is an R3D model, and the preset frame number is a multiple of the number of frames that can be input by the R3D model each time. The R3D model keeps the whole structure of the original 2D ResNet unchanged, expands the original 3X 3 2D convolution into 3X 3D convolution, and simultaneously replaces the pooling layer with 3D pooling. The input dimensions of the R3D model are 3 × 16 × 112 × 112. Since the model parameter amount of the R3D model is relatively small, when performing 3D convolution using the R3D model, not only a better effect can be obtained, but also the model speed is not slowed down.
In the method provided by the embodiment, the data set containing a plurality of negative energy behaviors is adopted for training, wherein the training includes fighting, bloody smell, gunshot, explosion, smoking and the like.
Specifically, in step S2, extracting features of the video segment based on a pre-trained 3D convolution model to obtain a plurality of feature vectors, including:
s21, performing center cutting on the video clip;
s22, segmenting the video clips subjected to center cutting according to a frame sequence to obtain a plurality of input frame groups;
s23, inputting the plurality of input frame groups into the 3D convolution model in sequence, and recording the obtained input vector of the first full-connection layer as a middle characteristic vector;
and S24, inputting the plurality of intermediate characteristic vectors into a second full-connection layer for dimensionality increase to obtain a plurality of characteristic vectors.
In a specific application scenario, an R3D model is used to perform feature extraction on a video clip. The R3D model can input 16 frames at a time. When the video clips are obtained, a random extraction strategy is adopted for the complete video frames, 96 frames of video clips are randomly extracted from the whole video every time, the 96 frames are multiple times of the number of frames which can be input by the R3D model every time, then the video clips of the 96 frames are subjected to center cutting, and the video frames of 112 × 112 of the 96 frames are obtained and used as the input of the network. Because the R3D model can only input 16 frames each time, the video clip is divided according to the frame sequence to obtain 6 input frame groups, each input frame group comprises 16 frames, 6 512-dimensional intermediate feature vectors are finally obtained through the R3D model, and the dimension of the features is increased from 512 dimensions to 1024 dimensions through the second full-connection layer.
And performing dimension-raising operation to obtain a feature vector used for being input to the transform model in the subsequent step. In step S24, the obtained intermediate feature vector is subjected to dimensionality enhancement based on the second full-connection layer, so that the model complexity is improved, and the model has better fitting capability.
As a preferred embodiment, the 3D convolution model is trained based on the data set Kinetics and based on the following parameters: the 3D convolution model is an R3D model with 18 layers of depth, the iteration times are 1M times, the initial learning rate is 1e-2, the whole training process is divided into 45 stages, the first 10 stages are warmed up, and the learning rate of every ten stages is reduced to one tenth of the original learning rate.
Referring to FIG. 2, in some embodiments, the transform model used includes at least one Encode Block that includes a multi-headed attention structure and a multi-layered perceptron structure. Before the feature vector is input into each Multi-Head Attention structure (Multi-Head Attention), standardized Norm operation is carried out, and residual error connection is adopted after each Multi-Head Attention structure; the normalized Norm operation is performed before the feature vectors are input into each of the multi-layered perceptron structures (MLP), and residual connection is performed after each of the multi-layered perceptron structures. The multilayer perceptron structure comprises a third full connection layer and a fourth full connection layer, wherein the third full connection layer is used for expanding the dimensionality of the characteristic vector to four times of the original dimensionality, and the fourth full connection layer is used for restoring the dimensionality of the characteristic vector to the original size. In FIG. 2, norm denotes a normalization layer, MLP denotes a multilayer perceptron structure, and Multi-Head Attention denotes a Multi-Head Attention structure.
In some embodiments, the Transformer model is obtained by stacking multiple Encoder blocks. For an Encoder Block, it mainly consists of a Multi-Head attachment (MSA) and an MLP structure, and also includes two operations of residual connection and layer normalization. It is noted that layer normalization processes all features of each sample, while batch normalization processes all samples of each channel. Before inputting data into both MSA and MLP structures, the data is normalized using a Layer Normalization operation, and residual concatenation is applied behind each MSA and MLP structure. The MLP structure follows MSA, and comprises two full-connection layers, wherein the first full-connection layer enlarges the dimension of the feature by four times, the second full-connection layer restores the dimension of the feature to the original size, and the activation functions adopt GELU (Gaussian Error Linear Unit). After the feature vectors subjected to position coding are input into a Transformer model, the feature vectors are continuously and forwardly propagated through a plurality of Encoder blocks, the sequence dimension is invariable after each Encoder Block, and finally the feature vectors corresponding to cls-token are used as output.
In step S3, the obtained feature vectors are subjected to position coding, and the time sequence information between frames is effectively kept from being lost.
In a specific application scenario, 6 1024-dimensional feature vectors are position-coded to store timing information between video blocks, and here, a learnable position-coded vector is used, the position-coding corresponds to a table, the table has N rows, the value of N is the same as the length of an input sequence, each row represents a video frame, and the dimension of the sequence is kept unchanged after the position-coding is added.
In step S4, inputting the plurality of feature vectors subjected to position coding into a pre-trained transform model to obtain an output vector, including:
s41, inputting the plurality of feature vectors subjected to position coding and the classification head vector cls-token into a Transformer model;
and S42, taking the vector corresponding to the classification head vector cls-token in the Transformer model as an output vector.
As a preferred embodiment, the Transformer model is trained based on the following parameters: the loss function adopts MSE, the optimizer adopts an AdamW optimizer, the number of Encoder Block layers is 24, the number of heads of a multi-head attention structure is 16, the output corresponding to the classification head vector cls-token is used as the feature input to the last full-connection layer, and the 1024-dimensional feature transformation is changed into a one-dimensional score. Among them, the AdamW optimizer is a variant of Adam optimizer, introducing weight attenuation and L2 regularization.
As a better implementation mode, a Warm-up operation is carried out on the learning rate by adopting a Warm-up method, linear Warm-up is selected as a specific Warm-up strategy, the initial learning rate is set to be 1e-5, the number of Warm-up steps is set to be 15 steps, and the total number of training is 60 steps, namely, the learning rate is uniformly increased from 6.67e-7 to the initial learning rate of 1e-5 in the first 15 epochs and then uniformly reduced to 2.22e-7. The principle of the warm-up operation is to make the learning rate gradually increase from a small value, and train with a large learning rate after the change of the model is small. By the mode, the convergence speed of the model can be effectively increased, and therefore the efficiency of model training is improved.
In some embodiments, when the video length is long, steps S1-S5 are repeated, and a plurality of segment extractions are performed and analyzed.
In some embodiments, there is provided a short video positive energy evaluation apparatus based on 3D convolution and transform, comprising:
the video acquisition module is used for acquiring a video clip, and the frame number of the video clip is a preset frame number;
the feature extraction module is used for extracting features of the video clips based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;
the position coding module is used for carrying out position coding on the feature vector;
the output calculation module is used for inputting the plurality of feature vectors subjected to position coding into a pre-trained Transformer model to obtain an output vector;
and the fraction calculating module is used for inputting the output vectors into the multilayer perceptron model and calculating to obtain the positive energy fraction of the video clip.
In some embodiments, an electronic device is provided, which includes a processor and a storage device, wherein the storage device stores a plurality of instructions, and the processor is configured to read the plurality of instructions in the storage device and execute the method.
In a specific application scenario, the device performing the above method is NVIDIA GTX 3090GPU, pytorch1.12 used by the machine learning framework.
In the existing violent behavior video analysis methods, 2D convolution technology is combined with RNN models (GRU, LSTM), single 3D convolution technology or combination of 3D convolution technology and RNN models is mostly adopted. When the method is used for performing short video positive energy scoring, the RNN technology has proved to be much worse than a transform model in time sequence modeling, and the processing of video frames by using the 3D convolution technology is too limited to process videos containing a large number of video frames for a long time.
However, when the 3D convolution model and the Transformer model are used in combination, there is a problem that the parameter amount of the Transformer model is relatively large. The short video positive energy evaluation method based on 3D convolution and transform provided in this embodiment adopts a technical means of performing time sequence modeling on the features of the video block extracted by R3D, so that the 3D convolution model and the transform model are applied to the short video positive energy evaluation, and not only can a good time sequence modeling effect be achieved, but also a video including a large number of video frames for a long time can be processed. In order to highlight the advantages of the technology, compared with three models, namely a C3D model + GRU, a C3D model + Transformer and an R3D model + GRU, the effect of the technology is obviously better than that of the other three models, and the effectiveness of the combination of the R3D feature extractor and the Transformer model is proved.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A short video positive energy evaluation method based on 3D convolution and a Transformer is characterized by comprising the following steps:
acquiring a video clip, wherein the frame number of the video clip is a preset frame number;
performing feature extraction on the video clip based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;
performing position coding on the feature vector;
inputting a plurality of feature vectors subjected to position coding into a pre-trained transform model to obtain an output vector;
and inputting the output vector to a multilayer perceptron model, and calculating to obtain a positive energy fraction of the video clip.
2. The method of claim 1, wherein the 3D convolution model is an R3D model, and the preset frame number is a multiple of a number of frames that the R3D model can input at a time.
3. The method of claim 2, wherein the 3D convolution model includes a plurality of fully connected layers, and the last fully connected layer in the 3D convolution model is denoted as a first fully connected layer;
carrying out feature extraction on each video segment based on a pre-trained 3D convolution model to obtain a plurality of feature vectors, wherein the feature vectors comprise:
performing center cropping on the video clip;
dividing the video clips subjected to center cutting according to a frame sequence to obtain a plurality of input frame groups;
inputting the plurality of input frame groups into the 3D convolution model in sequence, and recording the obtained input vector of the first full-connection layer as a middle characteristic vector;
and inputting the plurality of intermediate feature vectors into a second full-connection layer for dimensionality increase to obtain a plurality of feature vectors.
4. The method of claim 2, wherein the 3D convolution model is trained based on a dataset Kinetics and based on the following parameters:
the 3D convolution model is an R3D model with 18 layers of depth, the iteration times are 1M times, the initial learning rate is 1e-2, the whole training process is divided into 45 stages, the first 10 stages are warmed up, and the learning rate of every ten stages is reduced to one tenth of the original learning rate.
5. The method of claim 4, wherein the transform model comprises at least one Encoder Block, wherein the Encoder Block comprises a multi-head attention structure and a multi-layer perceptron structure;
before the feature vector is input into each multi-head attention structure, standardization operation is carried out, and residual connection is adopted after each multi-head attention structure;
before the feature vectors are input into each multi-layer perceptron structure, standardization operation is carried out, and residual connection is adopted after each multi-layer perceptron structure;
the multilayer perceptron structure comprises a third full connection layer and a fourth full connection layer, wherein the third full connection layer is used for expanding the dimensionality of the characteristic vector to four times of the original dimensionality, and the fourth full connection layer is used for restoring the dimensionality of the characteristic vector to the original size.
6. The method of claim 5, wherein the Transformer model is trained based on the following parameters:
the loss function adopts MSE, the optimizer adopts an AdamW optimizer, the number of layers of the Encoder Block is 24, and the number of heads of a multi-head attention structure is 16;
preheating the learning rate by adopting a Warm up method, selecting Linear Warm up as a specific Warm up strategy, setting the initial learning rate to be 1e-5, setting the preheating step number to be 15 steps, and setting the total training number to be 60 steps.
7. The method of claim 5, wherein inputting a plurality of the position-coded feature vectors into a pre-trained transform model to obtain an output vector comprises:
inputting a plurality of feature vectors subjected to position coding and a classification head vector cls-token into a Transformer model;
and inputting output corresponding to the classification head vector cls-token in the Transformer model as classification features to a final full-connection layer, and converting 1024-dimensional features into one-dimensional vectors to obtain output vectors.
8. The method according to claim 2, wherein the preset frame number is 96, the number of frames that the R3D model can input at a time is 16, and the number of feature vectors is 6.
9. A short video positive energy evaluation device based on 3D convolution and a Transformer is characterized by comprising:
the video acquisition module is used for acquiring a video clip, and the frame number of the video clip is a preset frame number;
the feature extraction module is used for extracting features of the video clips based on a pre-trained 3D convolution model to obtain a plurality of feature vectors;
the position coding module is used for carrying out position coding on the feature vector;
the output calculation module is used for inputting the plurality of feature vectors subjected to position coding into a pre-trained Transformer model to obtain an output vector;
and the score calculating module is used for inputting the output vector to the multilayer perceptron model and calculating to obtain the positive energy score of the video clip.
10. An electronic device comprising a processor and a memory means, wherein a plurality of instructions are stored in the memory means, and wherein the processor is configured to read the plurality of instructions from the memory means and to perform the method according to any one of claims 1 to 8.
CN202211334609.3A 2022-10-28 2022-10-28 Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer Pending CN115661596A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211334609.3A CN115661596A (en) 2022-10-28 2022-10-28 Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211334609.3A CN115661596A (en) 2022-10-28 2022-10-28 Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer

Publications (1)

Publication Number Publication Date
CN115661596A true CN115661596A (en) 2023-01-31

Family

ID=84993814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211334609.3A Pending CN115661596A (en) 2022-10-28 2022-10-28 Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer

Country Status (1)

Country Link
CN (1) CN115661596A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402811A (en) * 2023-06-05 2023-07-07 长沙海信智能系统研究院有限公司 Fighting behavior identification method and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116402811A (en) * 2023-06-05 2023-07-07 长沙海信智能系统研究院有限公司 Fighting behavior identification method and electronic equipment
CN116402811B (en) * 2023-06-05 2023-08-18 长沙海信智能系统研究院有限公司 Fighting behavior identification method and electronic equipment

Similar Documents

Publication Publication Date Title
CN109726045A (en) System and method for the sparse recurrent neural network of block
JP2022526513A (en) Video frame information labeling methods, appliances, equipment and computer programs
CN110942502B (en) Voice lip fitting method and system and storage medium
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN110020639B (en) Video feature extraction method and related equipment
Qi et al. Tea chrysanthemum detection under unstructured environments using the TC-YOLO model
CN110458085B (en) Video behavior identification method based on attention-enhanced three-dimensional space-time representation learning
CN113177616B (en) Image classification method, device, equipment and storage medium
CN114897136B (en) Multi-scale attention mechanism method and module and image processing method and device
CN112418059B (en) Emotion recognition method and device, computer equipment and storage medium
CN112507920B (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
CN115330643B (en) Earthquake denoising method based on convolutional neural network and visual transformation neural network
CN112836602B (en) Behavior recognition method, device, equipment and medium based on space-time feature fusion
CN114821058A (en) Image semantic segmentation method and device, electronic equipment and storage medium
CN115661596A (en) Short video positive energy evaluation method, device and equipment based on 3D convolution and Transformer
CN114627035A (en) Multi-focus image fusion method, system, device and storage medium
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN115439936A (en) Behavior identification method based on multiple visual angles and application thereof
CN118397367A (en) Tampering detection method based on convolution vision Mamba
US20220207321A1 (en) Convolution-Augmented Transformer Models
CN108960326A (en) A kind of point cloud fast partition method and its system based on deep learning frame
CN117455757A (en) Image processing method, device, equipment and storage medium
CN108764233A (en) A kind of scene character recognition method based on continuous convolution activation
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination