CN113254713B - Multi-source emotion calculation system and method for generating emotion curve based on video content - Google Patents

Multi-source emotion calculation system and method for generating emotion curve based on video content Download PDF

Info

Publication number
CN113254713B
CN113254713B CN202110533941.1A CN202110533941A CN113254713B CN 113254713 B CN113254713 B CN 113254713B CN 202110533941 A CN202110533941 A CN 202110533941A CN 113254713 B CN113254713 B CN 113254713B
Authority
CN
China
Prior art keywords
emotion
video
feature
auditory
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110533941.1A
Other languages
Chinese (zh)
Other versions
CN113254713A (en
Inventor
牛建伟
杨森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN202110533941.1A priority Critical patent/CN113254713B/en
Publication of CN113254713A publication Critical patent/CN113254713A/en
Application granted granted Critical
Publication of CN113254713B publication Critical patent/CN113254713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multisource emotion calculation system and method for generating an emotion curve based on video content, and relates to the technologies of deep learning, computer vision, emotion calculation, audio processing, image processing and the like. The system comprises a video content feature extraction module, an audio content feature extraction module, a feature fusion regression module and a long video segmentation and processing module. The method extracts a convolutional neural network from video and audio data acquired from a short video segment by using corresponding features and converts the convolutional neural network into visual features and auditory features; performing fusion regression on the obtained visual features and auditory features to obtain the emotion values of the video segments; and finally combining and outputting the emotion sequences of the long video, and generating a smooth emotion curve by using an interpolation algorithm. The invention realizes an automatic method and a system for calculating the video emotion change curve on a computer, retains the characteristics of manual emotion labeling of a viewer, has smooth and natural output result, higher visual effect and higher subsequent analysis and utilization value.

Description

Multi-source emotion calculation system and method for generating emotion curve based on video content
Technical Field
The invention relates to a deep learning technology, a computer vision technology and a video processing technology, in particular to a multisource emotion calculation system and a multisource emotion calculation method for generating an emotion curve based on video content, which are a technology for generating the emotion curve from the video content.
Background
The video referred to in the present invention refers specifically to long video with a duration of more than 1 minute, which usually contains video content and corresponding audio data. The emotion curve refers specifically to the change over time of the emotional feedback that the video brings to the viewer. Emotion consists of Valence (valency) and Arousal (Arousal) 2-dimensional emotional value. Valence indicates the positive and negative degree of emotion, and arousal indicates the intensity of emotion. Calculating the emotion curve of a video is a conversion task from the video to the emotion curve, and the purpose of the conversion task is to convert the input video into the emotion curve. In recent years, video content understanding based on deep learning has advanced significantly, and recent research has proposed a series of systematic approaches. For example, based on Convolutional Neural Network (CNN) methods; methods based on Recurrent Neural Networks (RNNs).
However, based on the prior art, the method for calculating the video emotion curve usually calculates for video content or audio content, which is difficult to utilize integrated information and characteristics, and the generated emotion curve change is inconsistent with the emotion fluctuation brought to audiences by the video, and cannot be directly used as the emotion representation of the video for further processing.
Disclosure of Invention
The invention aims to provide an automatic method and system for generating an emotion curve according to visual content and auditory content of a video based on two-dimensional and three-dimensional convolutional neural networks, so as to solve the problem that the comprehensive effect of various performances represented by video generation emotion in the prior art is poor.
The invention discloses a multi-source emotion calculating system for generating an emotion curve based on video content. The video content feature extraction module is used for extracting visual features from an input video. The audio content feature extraction module is used for calculating auditory features of the input video. And the feature fusion regression module is used for performing fusion regression on the visual features and the auditory features and performing regression prediction on the emotion values corresponding to the short videos. The long video segmentation and processing module segments the input original long video into short videos with equal length, the emotion value of each short video is calculated by the video content feature extraction module, the audio content feature extraction module and the feature fusion regression module, the short videos are spliced to form a whole long video emotion sequence, and the spliced long video emotion sequence is smoothed to obtain an emotion curve of the original input video.
Aiming at the multisource emotion calculating system for generating the emotion curve based on the video content, the multisource emotion calculating method comprises the following steps:
step 1: the long video V is cut into short video segments of equal length by a video cutting tool.
Step 2: extracting video sample frames from each short video segment and then using a three-dimensional residual network to extract successive video sample frames from the video sample framesExtracting visual Feature of short video contentvisual
And step 3: calculating Mel frequency cepstrum coefficient of audio frequency in each short video segment, and extracting auditory characteristic Feature of short video by using Mel frequency cepstrum coefficient of audio frequency as inputauditory
And 4, step 4: for each short video segment, extracting FeaturevisualAnd FeatureauditoryAnd fusing, merging into a uniform input vector Feature, and inputting into a regressor to further obtain the emotion value of each short video segment.
And 5: and (4) splicing the emotion values of each short video segment obtained in the steps 2-4 to form an emotion sequence of the long video segment V, and smoothing.
Compared with the prior art, the method and the system have the advantages and positive effects that:
1. the multisource emotion calculation system and method for generating the emotion curve based on the video content calculate the spatio-temporal characteristics of the video through the video data (vision and hearing) of different modes, and then perform fusion and regression training on the characteristics of the two modes to obtain the emotion value of the short video. And then, automatically segmenting the long video and carrying out emotion calculation to obtain an emotion sequence. Due to the discontinuity of the emotion sequence in time, the method carries out interpolation processing on the sequence by utilizing a third-order spline interpolation value, and outputs the obtained smooth emotion curve.
2. The invention relates to a multisource emotion calculation system and method for generating an emotion curve based on video content, which utilize a three-dimensional depth convolution network in the design of a video visual feature extraction network so as to effectively extract space-time features related to video frame context.
3. The invention discloses a multisource emotion calculation system and method for generating an emotion curve based on video content, and provides a preprocessing method based on a Mel frequency cepstrum coefficient in the design of a video auditory feature extraction network, so that the extracted auditory features are more in line with the characteristics of human ears.
5. According to the multisource emotion calculation system and method for generating the emotion curve based on the video content, a large-scale artificially labeled video emotion data set is used during network parameter training, the generated video emotion curve is closer to the real experience of human audiences, and the further processing and utilization of subsequent video analysis are facilitated.
Drawings
FIG. 1 is a schematic diagram of a multi-source emotion calculation system for generating an emotion curve based on video content according to the method of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples.
The multi-source emotion calculating system for generating the emotion curve based on the video content, as shown in fig. 1, comprises functional modules: the device comprises a video content feature extraction module, an audio content feature extraction module, a feature fusion regression module and a long video segmentation and processing module.
The video content feature extraction module extracts video sampling frames from an input video, then extracts space-time features (visual features) of the video content from continuous video sampling frames by using a three-dimensional residual error network, and inputs the features into the feature fusion regression module.
The audio content feature extraction module calculates audio information of an input video, calculates Mel Frequency Cepstrum Coefficient (MFCC), then inputs a depth residual error network to extract audio features (auditory features), and inputs a feature fusion regression module.
The feature fusion regression module performs fusion regression on the visual features extracted by the video content feature extraction module and the auditory features extracted by the audio content extraction module, and performs regression prediction on the emotion value corresponding to the short video by using a full-connection network.
The long video segmentation and processing module segments an input original long video into equal-length videos, the emotion values of each short video calculated by the video content feature extraction module, the audio content feature extraction module and the feature fusion regression module are spliced to form a whole long video emotion sequence, and then the spliced long video emotion sequence is subjected to smoothing processing by using a three-order spline difference value to obtain an emotion curve of the original input video.
For the above multisource emotion calculating system for generating an emotion curve based on video content, as shown in fig. 1, the multisource emotion calculating method is as follows:
step 1: cutting the long video V into short video segments of equal length by a video cutting tool (FFmpeg); the design of the invention equally divides the long video V into short video segments of 8 seconds, and the redundant parts are ignored.
Step 2: obtaining spatio-temporal Feature representing visual information in each short video segmentvisual. The visual feature extraction method is not limited, and includes, but is not limited to, artificially designed features, convolutional neural networks, cyclic neural networks, long-short term memory networks, and attention mechanism.
In the embodiment of the invention, one frame is extracted as a key frame every 4 frames by a frame sampling mode. Due to the characteristic of continuous and dynamic change of videos, the traditional convolutional neural network can only process single-frame images and cannot effectively utilize context information of continuous frames. Thus Feature of the inventionvisualMainly uses 18 layers of three-dimensional depth residual network (3Dimension ResNet, R3D), the 3-dimensional convolution neural network can process the spatial and temporal information and propagate the information through the network, and the input tensor ziIn this case 4-dimensional, with a size of 3 × T × H × W, where 3 is the number of channels per frame of video, typically RGB; t is the number of frames in a video segment and H and W represent the height and width of the frames, respectively. In an embodiment of the present invention, the size (height and width) of each frame of image is scaled to 112. The field of the depth residual network moves along the space (H height and W width) and time (T frame number) on the input tensor, and after the convolution operation and the ReLU activation function, the output tensor is generated. The three-dimensional depth residual error network adopts an R3D structure with the best general effect. The output of the ith 3D volume block is as follows:
zi=zi-1+F(zi-1;θi)
wherein, F (z)i-1;θi) Realize the weight is thetaiConvolution operations and applications ofHaving a ReLU function, zi-1Represents the output of the last 3D volume block, ziRepresenting the output of the ith 3D volume block. The output of the 18 3D volume blocks passes through a time-space pooling layer and a layer of fully-connected neural network to generate 128-dimensional Feature representing visual informationvisual. R3D is a 3-dimensional space-time convolution network, 3D volume blocks are the basic components of the network, and the specific implementation technology can be referred to documents Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri: A clock hook at spatial convolution for Action recognitions.CVPR 2018: 6450-. The ReLU function is a Neural network activation function, and the realization technology can refer to documents Xavier Glorot, organism Bordes, Yoshua Bengio: Deep Sparse recovery Neural networks, AISTTATS 2011: 315-.
And step 3: feature representing auditory information in each short video segment is obtainedauditory. Firstly, the Mel frequency cepstrum coefficient of the audio is calculated, and then the Mel frequency cepstrum coefficient of the audio is used as input to extract auditory characteristics. The determination method of the auditory feature extraction is not limited, and includes but is not limited to artificial design features, neural networks and other machine learning methods.
In the embodiment of the invention, in order to reduce the input size and the model size, the sampling rate of the audio signal is reduced to 2000 Hz by using a sine difference method. Extraction of auditory Feature of mel-frequency cepstrum coefficient of audioauditoryThe principle of the deep residual network (ResNet) is mainly used. The invention trains 18 layers of ResNet, changes the input size of the first layer of convolution network to 2 multiplied by 64 by adopting parameters pre-trained on ImageNet, so as to change a three-color channel adapting to a natural image into a two-channel adapting to a sound two-channel. And then training and fine-tuning parameters of the 18-layer ResNet model by using a video emotion analysis data set, and obtaining new parameters of the model after fine tuning, wherein the model is more suitable for emotion analysis tasks. Finally outputting the Feature representing auditory information with 128 dimensionsauditory. ResNet is a convolutional neural network, and the implementation technology can be referred to as: kaimng He, Xiangyu Zhuang, Shaoqing Ren, Jianan Sun: Deep research Learning for Image recognition. CVPR 2016:770 778.
And 4, step 4: for each short video segment, a Feature is obtained, as shown in fig. 1visualAnd FeatureauditoryAnd (5) carrying out fusion, merging into a uniform input vector Feature, and inputting into a regressor to further obtain the emotion value of each short video segment. The fusion method includes, but is not limited to, machine learning methods such as neural networks, support vector machines and the like and data fusion technology, the regressor includes, but is not limited to, machine learning methods such as support vector regression, neural networks and the like, and the emotion value includes, but is not limited to, 2-dimensional valence-evoked emotion, emotion discrete classification and other video emotion representations.
In the embodiment of the invention, the Feature vector Feature of 128 dimensionsvisualAnd a 128-dimensional Feature vector FeatureauditoryFirst normalized to unify the return and distribution of both. Then, the normalized FeaturevisualAnd FeatureauditoryWill be spliced into a unified 256-dimensional input Feature vector Feature. Inputting the feature vectors into a 2-layer fully-connected network, wherein the input is 256-dimensional, and the output of the first layer is 64-dimensional; the output is a 2-dimensional vector which respectively represents the emotion evoking value and the emotion effect value, namely [ Arousal, Valence]. Fully connected networks employ a ReLU activation function.
Step 5, splicing the emotion values of the short video segments obtained in the step 2 to the step 4 to form an emotion sequence of the long video V, and smoothing the emotion sequence;
and (4) generating a point in a 2-dimensional emotion space by the emotion value of each short video through the method of the step (2-4). These discrete points are then connected into a fold line using conventional splicing methods. The polyline represents emotion sequences of the long video V, and each emotion sequence is a two-dimensional point sequence. Compared with the prior art, the emotion sequence predicted based on the deep learning model guarantees the independence between valence and arousal of 2-dimensional emotion. And then, carrying out third-order spline interpolation on the emotion sequence to form a smooth emotion curve and outputting the curve.
Step 6: and outputting the smoothed emotion sequence as an emotion curve by adopting an interpolation algorithm.
In order to verify the effectiveness of the generated emotional curve, the invention also provides a quantitative emotional curve verification method. And performing Correlation analysis by using the generated emotion curve and the emotion sequence calculated by the viewer mark, and calculating Spearman's Rank Correlation Coefficient (SRCC) of the emotion curve and the viewer mark to quantitatively measure the effectiveness of the emotion curve. The calculation method is as the formula:
Figure RE-GDA0003149711560000051
wherein d isi=rg(Xi)-rg(Yi) Is the difference between the rank sizes of 2 sequences, n is the length of the sequence, X represents the emotion sequence calculated by the invention, XiIndicating the ith value therein. Y denotes the corresponding viewer-tagged emotion sequence, YiIndicating the ith value therein. rg (X)i) Represents XiRank in X, i.e., position sorted by size. Same theory rg (Y)i) As well as the same. The 2 sequences are emotion (Valence and aroma) sequences output by the long video segmentation and processing module and corresponding emotion sequences labeled by the audience.

Claims (3)

1. Multisource emotion calculation system for generating emotion curve based on video content is characterized in that: the system comprises a video content feature extraction module, an audio content feature extraction module, a feature fusion regression module and a long video segmentation and processing module;
the video content feature extraction module is used for extracting visual features from an input video;
the audio content feature extraction module is used for calculating auditory features of input video;
the feature fusion regression module is used for performing fusion regression on the visual features and the auditory features and performing regression prediction on emotion values corresponding to the short videos;
the long video segmentation and processing module segments an input original long video into short videos with equal length, the emotion values of each short video calculated by the video content feature extraction module, the audio content feature extraction module and the feature fusion regression module are spliced to form a whole long video emotion sequence, and the spliced long video emotion sequence is smoothed to obtain an emotion curve of the original input video;
the multi-source emotion calculation method comprises the following steps:
step 1: cutting the long video v into short video segments of equal length by a video cutting tool;
step 2: extracting video sampling frames from each short video segment, and then extracting visual Feature of short video content from continuous video sampling frames by using three-dimensional residual error networkvisual(ii) a The three-dimensional depth residual error network adopts an R3D structure, and the size of each frame of image is scaled to 112; moving a receptive field of the depth residual error network on the input tensor along space and time, performing convolution operation, and generating an output tensor after a ReLU activation function;
and step 3: calculating Mel frequency cepstrum coefficient of audio frequency in each short video segment, and extracting auditory characteristic Feature of short video by using Mel frequency cepstrum coefficient of audio frequency as inputauditory(ii) a Specifically, a sine difference method is used for reducing the sampling rate of an audio signal to 2000 Hz, 18 layers of depth residual error networks are trained, parameters pre-trained on ImageNet are adopted, the input size of the first layer of convolution network is changed to be 2 x 64, and a three-color channel adaptive to a natural image is changed to be a two-channel adaptive to sound and two channels; then training and fine-tuning parameters of an 18-layer ResNet model by using a video emotion analysis data set, obtaining new parameters of the fine-tuned model, and outputting Feature representing auditory information in 128 dimensionsauditory
And 4, step 4: for each short video segment, the extracted FeaturevisualAnd FeatureauditoryFusing, merging into a uniform input vector Feature, and inputting into a regressor to further obtain the emotion value of each short video clip;
and 5: splicing the emotion values of the short video segments obtained in the steps 2-4 to form an emotion sequence of the long video segment V, and smoothing;
step 6: and outputting the smoothed emotion sequence as an emotion curve by adopting an interpolation algorithm.
2. The multi-source emotion calculation system for generating an emotion curve based on video content of claim 1, wherein: in step 3, a 128-dimensional Feature vector FeaturevisualAnd a 128-dimensional Feature vector FeatureauditoryFirstly, the two are normalized to unify the return and distribution of the two; then, the normalized FeaturevisualAnd FeatureauditoryWill be spliced into a unified 256-dimensional input Feature vector Feature; inputting the feature vectors into a 2-layer fully-connected network, wherein the input is 256-dimensional, and the output of the first layer is 64-dimensional; the output is a 2-dimensional vector which respectively represents the emotion evoking value and the emotion effect value, namely [ Arousal, Valence](ii) a Fully connected networks employ a ReLU activation function.
3. The multi-source emotion calculation system for generating an emotion curve based on video content of claim 1, wherein: and carrying out correlation analysis on the generated emotion curve and the emotion sequence calculated by the viewer marking, and calculating the spearman grade correlation coefficient of the emotion curve and the viewer marking to quantitatively measure the effectiveness of the emotion curve.
CN202110533941.1A 2021-05-17 2021-05-17 Multi-source emotion calculation system and method for generating emotion curve based on video content Active CN113254713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110533941.1A CN113254713B (en) 2021-05-17 2021-05-17 Multi-source emotion calculation system and method for generating emotion curve based on video content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110533941.1A CN113254713B (en) 2021-05-17 2021-05-17 Multi-source emotion calculation system and method for generating emotion curve based on video content

Publications (2)

Publication Number Publication Date
CN113254713A CN113254713A (en) 2021-08-13
CN113254713B true CN113254713B (en) 2022-05-24

Family

ID=77183212

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110533941.1A Active CN113254713B (en) 2021-05-17 2021-05-17 Multi-source emotion calculation system and method for generating emotion curve based on video content

Country Status (1)

Country Link
CN (1) CN113254713B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116456262B (en) * 2023-03-30 2024-01-23 青岛城市轨道交通科技有限公司 Dual-channel audio generation method based on multi-modal sensing

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
CN110852215A (en) * 2019-10-30 2020-02-28 国网江苏省电力有限公司电力科学研究院 Multi-mode emotion recognition method and system and storage medium
CN111382677A (en) * 2020-02-25 2020-07-07 华南理工大学 Human behavior identification method and system based on 3D attention residual error model
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102660124B1 (en) * 2018-03-08 2024-04-23 한국전자통신연구원 Method for generating data for learning emotion in video, method for determining emotion in video, and apparatus using the methods

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109508375A (en) * 2018-11-19 2019-03-22 重庆邮电大学 A kind of social affective classification method based on multi-modal fusion
CN110852215A (en) * 2019-10-30 2020-02-28 国网江苏省电力有限公司电力科学研究院 Multi-mode emotion recognition method and system and storage medium
CN111382677A (en) * 2020-02-25 2020-07-07 华南理工大学 Human behavior identification method and system based on 3D attention residual error model
CN112766172A (en) * 2021-01-21 2021-05-07 北京师范大学 Face continuous expression recognition method based on time sequence attention mechanism

Also Published As

Publication number Publication date
CN113254713A (en) 2021-08-13

Similar Documents

Publication Publication Date Title
US11281945B1 (en) Multimodal dimensional emotion recognition method
CN109886225B (en) Image gesture action online detection and recognition method based on deep learning
CN110909658A (en) Method for recognizing human body behaviors in video based on double-current convolutional network
CN109635676B (en) Method for positioning sound source from video
CN105787458A (en) Infrared behavior identification method based on adaptive fusion of artificial design feature and depth learning feature
CN112183240B (en) Double-current convolution behavior identification method based on 3D time stream and parallel space stream
WO2022199215A1 (en) Crowd-information-fused speech emotion recognition method and system
CN108921032B (en) Novel video semantic extraction method based on deep learning model
CN113591770A (en) Multimode fusion obstacle detection method and device based on artificial intelligence blind guiding
WO2022262098A1 (en) Video emotion semantic analysis method based on graph neural network
CN110929762B (en) Limb language detection and behavior analysis method and system based on deep learning
CN110688927A (en) Video action detection method based on time sequence convolution modeling
CN113254713B (en) Multi-source emotion calculation system and method for generating emotion curve based on video content
CN111625661A (en) Audio and video segment classification method and device
Bulzomi et al. End-to-end neuromorphic lip-reading
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
CN117409121A (en) Fine granularity emotion control speaker face video generation method, system, equipment and medium based on audio frequency and single image driving
CN116167015A (en) Dimension emotion analysis method based on joint cross attention mechanism
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
CN113269068B (en) Gesture recognition method based on multi-modal feature adjustment and embedded representation enhancement
Zhang et al. Modeling temporal information using discrete fourier transform for recognizing emotions in user-generated videos
CN114329070A (en) Video feature extraction method and device, computer equipment and storage medium
WO2021147084A1 (en) Systems and methods for emotion recognition in user-generated video(ugv)
CN112183727A (en) Countermeasure generation network model, and shot effect rendering method and system based on countermeasure generation network model
KR20210035535A (en) Method of learning brain connectivity and system threrfor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant