CN112613486A - Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU - Google Patents

Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU Download PDF

Info

Publication number
CN112613486A
CN112613486A CN202110016985.7A CN202110016985A CN112613486A CN 112613486 A CN112613486 A CN 112613486A CN 202110016985 A CN202110016985 A CN 202110016985A CN 112613486 A CN112613486 A CN 112613486A
Authority
CN
China
Prior art keywords
video
frame
layer
attention
stereoscopic video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110016985.7A
Other languages
Chinese (zh)
Other versions
CN112613486B (en
Inventor
牛玉贞
郑愈明
彭丹泓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202110016985.7A priority Critical patent/CN112613486B/en
Publication of CN112613486A publication Critical patent/CN112613486A/en
Application granted granted Critical
Publication of CN112613486B publication Critical patent/CN112613486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)

Abstract

The invention relates to a professional stereoscopic video comfort classification method based on multilayer attention and BiGRU. The method comprises the following steps: 1. carrying out scene segmentation on the training video set and the video set to be predicted and obtaining a disparity map through preprocessing; 2. performing frame level processing to obtain initial frame level characteristics; 3. performing frame level attention processing to obtain final frame level characteristics; 4. performing lens level processing to obtain a preliminary lens level characteristic; 5. performing lens level attention processing to obtain final lens level characteristics; 6. double-flow fusion, namely fusing the output of the previous step by using the attention of a channel to obtain a final hidden state; 7. the final hidden state outputs classification probabilities through a classification network and classifies the professional stereoscopic video as suitable for children to watch or only suitable for adults to watch. 8. And inputting the left view of the stereo video in the video set to be tested and the corresponding disparity map into the trained model for classification. The method can effectively distinguish whether the professional stereoscopic video is suitable for children to watch.

Description

Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU
Technical Field
The invention relates to the field of image and video processing and computer vision, in particular to a professional stereoscopic video comfort classification method based on multilayer attention and BiGRU.
Background
Stereoscopic video is also called 3D video, and unlike 2D video, the most important feature is depth information, so that the presentation of the landscape in the video is no longer limited to the screen. The vigorous development of the stereoscopic technology enables people to obtain better viewing experience and bring troubles, for example, people can feel dizzy, dry eyes, nausea and the like when watching uncomfortable stereoscopic videos for a long time, and the adverse reactions can attack the watching heat of the audiences and even influence the physiological health of the audiences. Therefore, how to evaluate the quality of the visual comfort of the stereoscopic image becomes a concern. One of the main factors affecting the visual comfort of the stereoscopic video is parallax, including excessive horizontal parallax, vertical parallax, and rapidly changing parallax, and the other main factor affecting the visual comfort of the stereoscopic video is video content, including a salient object in the video, a presentation manner of the video, and a motion of the object.
Although some good results are obtained by the current comfort evaluation methods, the binocular distance of the children is generally not considered in the work. For children, the binocular distance is narrower than that of adults, the binocular fusion mechanism is not mature as adults, and the parallax size imaged on the retina is different from that of adults, so that the stereoscopic perception of children is different from that of adults. Comfortable stereoscopic video for adults may not be suitable for children to watch. For children who have had eye disease, stereoscopic movies of visual discomfort can cause them to suffer from headaches, eye strain, and inability to see images clearly.
Disclosure of Invention
The invention aims to provide a professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU, solves the problem that a current stereoscopic video comfort level evaluation algorithm does not consider children as audience objects, and can effectively distinguish whether professional stereoscopic videos are suitable for children to watch.
In order to achieve the purpose, the technical scheme of the invention is as follows: a professional stereoscopic video comfort classification method based on multilayer attention and BiGRU comprises the following steps:
step S1, carrying out scene segmentation on the training video set and the video set to be predicted and obtaining a disparity map through preprocessing;
step S2, frame level processing, namely, taking the left view of the stereoscopic video and the corresponding disparity map in the training video set as double-current input to carry out the frame level processing, and sensing the time sequence relation between frames in each shot from a plurality of time scales by using a time inference network;
step S3, frame level attention processing, wherein the final frame level characteristics are obtained by weighting and summing the time sequence relation between frames in each shot;
step S4, lens-level processing, namely, sensing frame-level characteristics of a plurality of continuous lenses by using a cyclic neural network bidirectional gating cyclic unit and outputting a hidden state set;
step S5, lens level attention processing, namely, carrying out weighted summation on the hidden state set output in the step S4 to obtain final lens level characteristics;
s6, fusing double streams, and fusing the lens level features output in the S5 by using a channel attention network to obtain a final hidden state;
s7, outputting classification probability through a classification network according to the final hidden state, classifying the professional stereo video into a model suitable for children to watch or only suitable for adults to watch, and obtaining a well-constructed professional stereo video visual comfort classification model from the step S2; training the professional stereoscopic video visual comfort degree classification model, learning the optimal parameters of the professional stereoscopic video visual comfort degree classification model by solving a minimum loss function in the training process, and storing the trained model;
and S8, inputting the left view of the video set to be tested and the corresponding disparity map into the trained model for classification prediction.
In an embodiment of the present invention, the step S1 specifically includes the following steps:
step S11, using multimedia video processing tool to divide the video into a frame image;
step S12, dividing the stereo video into video segments which are not overlapped with each other by using a shot division algorithm, wherein each segment is called a shot;
and step S13, dividing each frame into a left view and a right view, and calculating the horizontal displacement of the corresponding pixel points in the left view and the right view by using a Siftflow algorithm to serve as a disparity map.
In an embodiment of the present invention, the step S2 specifically includes the following steps:
s21, sparsely sampling frames in a lens, and randomly selecting 8 frames in sequence;
step S22, randomly extracting a sequence a frames from the 8 sampled frames, and respectively sensing the time sequence relation between the a frames by using a pre-trained time inference network, wherein the value range of a is between 2 and 8; given a video V, the temporal relationship T between two frames2(V) is represented by the following formula:
Figure BDA0002887730880000021
wherein ,fi and fjRespectively representing the characteristics of the ith frame and the jth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleNet, ResNet or BN-incorporation,
Figure BDA0002887730880000022
is a two-layer multilayer perceptron, each layer has 256 units, theta is the multilayer perceptron
Figure BDA0002887730880000025
The parameters of (1); similarly, the timing relationship T between 3-8 frames3(V)、T4(V)、T5(V)、T6(V)、T7(V) and T8(V) are respectively represented by the following formulae:
Figure BDA0002887730880000023
Figure BDA0002887730880000024
Figure BDA0002887730880000031
Figure BDA0002887730880000032
Figure BDA0002887730880000033
Figure BDA0002887730880000034
wherein ,fi、fj、fk、fl、fm、fn、fo and fpThe characteristics of the ith frame, the jth frame, the kth frame, the ith frame, the mth frame, the nth frame, the ith frame and the pth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleLeNet, ResNet or BN-addition are shown,
Figure BDA0002887730880000037
a two-layered multi-layered perceptron representing the temporal relationship between a-frames, each layer having 256 elements, θ beingMultilayer perceptron
Figure BDA0002887730880000038
The parameters of (1);
step S23, splicing the inter-frame time sequence relations of various time scales in the lens to obtain a frame level feature Tall(V), the calculation formula is as follows:
Tall(V)=[T2(V),T3(V),T4(V),T5(V),T6(V),T7(V),T8(V)]。
in an embodiment of the present invention, the step S3 specifically includes the following steps:
step S31, firstly, the time sequence relation characteristic T between a frames output by the network is reasoned for each timea(V) solving for hidden layer vector ua
ua=tanh(WfTa(V)+bf)
wherein Wf and bfParameters of a single-layer perceptron;
step S32, in order to measure the importance of each time scale time relation, the pair uaCarrying out a standardization operation:
Figure BDA0002887730880000035
wherein ufThe context vector represents the importance of the time sequence relation of the corresponding time scale, and is randomly initialized during training and obtained through learning;
step S33, the final time feature x is the frame-level feature, and the calculation formula is as follows:
Figure BDA0002887730880000036
in an embodiment of the present invention, the step S4 specifically includes the following steps:
step S41, Using step S33, the frame level of each shot in the S consecutive shotsCharacteristic splicing; each shot has a frame level feature x, and then the frame level features of the t, t ═ 1,2tThe frame level features are used as the input of the bidirectional gating circulation unit; the input of the gating cycle unit at the t, t is 1,2t-1And the frame-level feature x of the t-th shottThe hidden layer information h is output at the next momentt(ii) a The gated cycle cell contains 2 gates: reset gate rtAnd an update gate ztThe former is used for calculating candidate hidden layers
Figure BDA0002887730880000041
Controlling how much previous hidden layer h is reservedt-1The information of (a); the latter for controlling the addition of candidate hidden layers
Figure BDA0002887730880000042
Amount of information to obtain hidden state h of outputt;rt、zt
Figure BDA0002887730880000043
htThe calculation formula of (a) is as follows:
zt=σ(Wzxt+Uzht-1)
rt=σ(Wrxt+Urht-1)
Figure BDA0002887730880000044
Figure BDA0002887730880000045
wherein σ is a logic sigmoid function, Δ is an element multiplication, tanh is an activation function, Wz、Uz、Wr、UrW, U is a weight matrix learned in training;
step S42, circulating list due to bidirectional door controlThe element is composed of 2 unidirectional gate control circulation units with opposite directions, so that h is finally outputtThe hidden states of the two gating circulation units are jointly determined; at each moment, the input is simultaneously provided for the 2 gating circulation units with opposite directions, the output is determined by the 2 unidirectional gating circulation units together, and the outputs of the 2 unidirectional gating circulation units are spliced to be used as the output of the bidirectional gating circulation unit to obtain a hidden state set output by the bidirectional gating circulation unit; when the input is a sequence of video frames, the output of the bidirectional gated cyclic unit is a set h of hidden statesf(ii) a When the input is a disparity map sequence, the output of the bidirectional gating circulation unit is a hidden state set hd,hf and hdThe calculation formula of (a) is as follows:
Figure BDA0002887730880000046
Figure BDA0002887730880000047
wherein ,
Figure BDA0002887730880000048
indicating a hidden state output at the moment t, t-1, 2.., s of the video frame sequence;
Figure BDA0002887730880000049
this indicates a hidden state output at time t, t 1, 2.
1. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 5, wherein the step S5 specifically comprises the following steps:
step S51, for the video frame sequence, the model first outputs the hidden state for the gating cycle unit at each time
Figure BDA00028877308800000410
Vector of hidden layerut
Figure BDA0002887730880000051
wherein Ws and bsIs a parameter of the single layer perceptron;
step S52, in order to measure the importance of each shot, the pair utCarrying out a standardization operation:
Figure BDA0002887730880000052
wherein ,usThe context vector represents the importance of the corresponding shot, and is initialized randomly during training and obtained through learning;
step S53, hiding state h of video frame sequencefThe calculation formula of (2):
Figure BDA0002887730880000053
step S54, similarly, the hidden state h of the disparity map sequence can be obtained through the above processdH is to bef and hdSplicing to obtain ha,haThe calculation formula of (a) is as follows:
ha=[hf,hd]
so far the final shot level feature is complete.
In an embodiment of the present invention, the step S6 specifically includes the following steps:
step S61, calculating h by adopting channel attentionaThe weight of each hidden state in (1), denoted as Fscale(-) the calculation is as follows:
Fscale(·,·)=σ(W2δ(W1ha))
where, δ is the ReLU function, σ is the sigmoid function, W1 and W2Respectively two single layer perceptionsA parameter matrix of the machine is obtained through training;
step S62, use
Figure BDA0002887730880000054
Indicating the degree of importance of each channel ultimately obtained,
Figure BDA0002887730880000055
is Fscale(. phi) and haThe formula of the vector product of (a) is as follows:
Figure BDA0002887730880000056
weighted final hidden state
Figure BDA0002887730880000057
And obtaining the final classification probability through a classification network.
In an embodiment of the present invention, the step S7 specifically includes the following steps:
step S71, to prevent network overfitting
Figure BDA0002887730880000062
Inputting a first random inactivation layer of a classification network layer;
step S72, inputting the output after random inactivation into a full connection layer of the second layer of the classification network layer, converting the output of the full connection layer into the classification probability in the range of (0,1) through the normalization index function, and judging the professional stereoscopic video as being suitable for children to watch or only suitable for adults to watch;
step S73, calculating the parameter gradient of the professional stereoscopic video visual comfort classification model by using a back propagation method according to the cross entropy loss function, and updating the parameter by using a self-adaptive gradient descent method;
wherein the cross entropy loss function L is defined as follows:
Figure BDA0002887730880000061
n denotes the number of samples in each batch, yiLabel representing sample i, positive sample yiIs 1, represents a negative example y suitable for children to watchi0, representing suitability for adult viewing only, piRepresenting the probability that the model predicts the sample i as a positive sample;
and S74, training in batches until the L value calculated in the step S53 converges to a threshold value or the iteration times reach the threshold value, completing network training, learning the optimal parameters of the professional stereoscopic video visual comfort classification model, and storing the model parameters.
In an embodiment of the present invention, the step S8 specifically includes the following steps:
step S81, preprocessing a video set to be tested by using the step S1 to obtain a disparity map;
step S82, performing frame level processing on the left view of the stereoscopic video and the corresponding disparity map in the video set to be tested by using the step S2;
step S83, processing and predicting the video set to be tested by using the training model parameters saved in the step S7 through steps S3 to S7; and each continuous s shots is taken as a sample, and when the probability that the model predicts that the sample is a positive sample is greater than 0.5, the sample is judged to be classified as the positive sample, otherwise, the sample is a negative sample.
Compared with the prior art, the invention has the following beneficial effects: firstly, aiming at the problem that the current stereo video comfort evaluation algorithm does not consider children as audience objects, the invention provides a professional stereo video vision comfort classification method based on multilayer attention and a recurrent neural network, which can be used for distinguishing whether professional stereo videos are suitable for children to watch. Secondly, considering that main factors causing visual discomfort comprise video content and parallax, the method adopts a double-flow structure to respectively study the characteristics of a stereoscopic video frame and a parallax map sequence and the time sequence relation of the characteristics, and more comprehensively evaluates the stereoscopic vision comfort level of the stereoscopic video. Finally, because visual discomfort usually occurs in video segments and branches, the difficulty of classification is increased, and the method adopts frame level attention, lens level attention and channel attention to enable the model to pay more attention to the segments and branches causing the visual discomfort, so that the classification accuracy is improved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of an overall structure of a professional stereoscopic video visual comfort classification model according to an embodiment of the present invention;
fig. 3 is a diagram of a frame-level processing temporal inference network model architecture in an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.
As shown in fig. 1 and fig. 2, the present embodiment provides a professional stereoscopic video comfort classification method based on multi-layer attention and BiGRU, including the following steps:
step S1, carrying out scene segmentation on the training video set and the video set to be predicted and obtaining a disparity map through preprocessing; the method specifically comprises the following steps:
step S11: segmenting the video into a frame of image using a multimedia video processing tool;
step S12: dividing the stereoscopic video into video segments which are not overlapped with each other by using a shot division algorithm, wherein each segment is called a shot;
step S13: and dividing each frame into a left view and a right view, and calculating the horizontal displacement of corresponding pixel points in the left view and the right view by using a Siftflow algorithm to serve as a disparity map.
Step S2, frame level processing, namely, taking the left view of the stereoscopic video and the corresponding disparity map in the training video set as double-current input to carry out the frame level processing, and sensing the time sequence relation between frames in each shot from a plurality of time scales by using a time inference network; the method specifically comprises the following steps:
s21, sparsely sampling frames in a lens, and randomly selecting 8 frames in sequence;
step S22, randomly extracting a frames in order from the 8 sampled frames by using pre-training timeThe reasoning network respectively senses the time sequence relation between the frames a, and the value range of a is between 2 and 8; given a video V, the temporal relationship T between two frames2(V) is represented by the following formula:
Figure BDA0002887730880000071
wherein ,fi and fjRespectively representing the characteristics of the ith frame and the jth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleNet, ResNet or BN-incorporation,
Figure BDA0002887730880000072
is a two-layer multilayer perceptron, each layer has 256 units, theta is the multilayer perceptron
Figure BDA0002887730880000073
The parameters of (1); similarly, the timing relationship T between 3-8 frames3(V)、T4(V)、T5(V)、T6(V)、T7(V) and T8(V) are respectively represented by the following formulae:
Figure BDA0002887730880000074
Figure BDA0002887730880000081
Figure BDA0002887730880000082
Figure BDA0002887730880000083
Figure BDA0002887730880000084
Figure BDA0002887730880000085
wherein ,fi、fj、fk、fl、fm、fn、fo and fpThe characteristics of the ith frame, the jth frame, the kth frame, the ith frame, the mth frame, the nth frame, the ith frame and the pth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleLeNet, ResNet or BN-addition are shown,
Figure BDA0002887730880000086
a two-layer multi-layer perceptron for extracting the time sequence relation between a frames is shown, each layer has 256 units, theta is the multi-layer perceptron
Figure BDA0002887730880000087
The parameters of (1);
step S23, splicing the inter-frame time sequence relations of various time scales in the lens to obtain a frame level feature Tall(V), the calculation formula is as follows:
Tall(V)=[T2(V),T3(V),T4(V),T5(V),T6(V),T7(V),T8(V)]。
step S3, frame level attention processing, wherein the final frame level characteristics are obtained by weighting and summing the time sequence relation between frames in each shot; the method specifically comprises the following steps:
step S31, firstly, the time sequence relation characteristic T between a frames output by the network is reasoned for each timea(V) solving for hidden layer vector ua
ua=tanh(WfTa(V)+bf)
wherein Wf and bfParameters of a single-layer perceptron;
step S32, in order to measure the importance of each time scale time relation, the pair uaCarrying out a standardization operation:
Figure BDA0002887730880000088
wherein ufThe context vector represents the importance of the time sequence relation of the corresponding time scale, and is randomly initialized during training and obtained through learning;
step S33, the final time feature x is the frame-level feature, and the calculation formula is as follows:
Figure BDA0002887730880000089
step S4, lens-level processing, namely, sensing frame-level characteristics of a plurality of continuous lenses by using a cyclic neural network bidirectional gating cyclic unit and outputting a hidden state set; the method specifically comprises the following steps:
step S41, splicing the frame-level features of each shot in the S continuous shots by using the step S33; each shot has a frame level feature x, and then the frame level features of the t, t ═ 1,2tThe frame level features are used as the input of the bidirectional gating circulation unit; the input of the gating cycle unit at the t, t is 1,2t-1And the frame-level feature x of the t-th shottThe hidden layer information h is output at the next momentt(ii) a The gated cycle cell contains 2 gates: reset gate rtAnd an update gate ztThe former is used for calculating candidate hidden layers
Figure BDA0002887730880000091
Controlling how much previous hidden layer h is reservedt-1The information of (a); the latter for controlling the addition of candidate hidden layers
Figure BDA0002887730880000092
Amount of information to obtain hidden state h of outputt;rt、zt
Figure BDA0002887730880000093
htThe calculation formula of (a) is as follows:
zt=σ(Wzxt+Uzht-1)
rt=σ(Wrxt+Urht-1)
Figure BDA0002887730880000094
Figure BDA0002887730880000095
wherein σ is a logic sigmoid function, Δ is an element multiplication, tanh is an activation function, Wz、Uz、Wr、UrW, U is a weight matrix learned in training;
step S42, since the bidirectional gating cycle unit is composed of 2 unidirectional gating cycle units with opposite directions, the last h outputtThe hidden states of the two gating circulation units are jointly determined; at each moment, the input is simultaneously provided for the 2 gating circulation units with opposite directions, the output is determined by the 2 unidirectional gating circulation units together, and the outputs of the 2 unidirectional gating circulation units are spliced to be used as the output of the bidirectional gating circulation unit to obtain a hidden state set output by the bidirectional gating circulation unit; when the input is a sequence of video frames, the output of the bidirectional gated cyclic unit is a set h of hidden statesf(ii) a When the input is a disparity map sequence, the output of the bidirectional gating circulation unit is a hidden state set hd,hf and hdThe calculation formula of (a) is as follows:
Figure BDA0002887730880000096
Figure BDA0002887730880000097
wherein ,
Figure BDA0002887730880000098
indicating a hidden state output at the moment t, t-1, 2.., s of the video frame sequence;
Figure BDA0002887730880000099
this indicates a hidden state output at time t, t 1, 2.
Step S5, lens level attention processing, namely, carrying out weighted summation on the hidden state set output in the step S4 to obtain final lens level characteristics; the method specifically comprises the following steps:
step S51, for the video frame sequence, the model first outputs the hidden state for the gating cycle unit at each time
Figure BDA0002887730880000106
Solving hidden layer vector ut
Figure BDA0002887730880000101
wherein Ws and bsIs a parameter of the single layer perceptron;
step S52, in order to measure the importance of each shot, the pair utCarrying out a standardization operation:
Figure BDA0002887730880000102
wherein ,usThe context vector represents the importance of the corresponding shot, and is initialized randomly during training and obtained through learning;
step S53, hiding state h of video frame sequencefThe calculation formula of (2):
Figure BDA0002887730880000103
step S54, similarly, the hidden state h of the disparity map sequence can be obtained through the above processdH is to bef and hdSplicing to obtain ha,haThe calculation formula of (a) is as follows:
ha=[hf,hd]
so far the final shot level feature is complete.
S6, fusing double streams, and fusing the lens level features output in the S5 by using a channel attention network to obtain a final hidden state; the method specifically comprises the following steps:
step S61, calculating h by adopting channel attentionaThe weight of each hidden state in (1), denoted as Fscale(-) the calculation is as follows:
Fscale(·,·)=σ(W2δ(W1ha))
where, δ is the ReLU function, σ is the sigmoid function, W1 and W2Parameter matrixes of two single-layer perceptrons are respectively obtained through training;
step S62, use
Figure BDA0002887730880000104
Indicating the degree of importance of each channel ultimately obtained,
Figure BDA0002887730880000105
is Fscale(. phi) and haThe formula of the vector product of (a) is as follows:
Figure BDA0002887730880000111
weighted final hidden state
Figure BDA0002887730880000112
Is classifiedThe network obtains the final classification probability.
S7, outputting classification probability through a classification network according to the final hidden state, classifying the professional stereo video into a model suitable for children to watch or only suitable for adults to watch, and obtaining a well-constructed professional stereo video visual comfort classification model from the step S2; training the professional stereoscopic video visual comfort degree classification model, learning the optimal parameters of the professional stereoscopic video visual comfort degree classification model by solving a minimum loss function in the training process, and storing the trained model; the method specifically comprises the following steps:
step S71, to prevent network overfitting
Figure BDA0002887730880000114
Inputting a first random inactivation layer of a classification network layer;
step S72, inputting the output after random inactivation into a full connection layer of the second layer of the classification network layer, converting the output of the full connection layer into the classification probability in the range of (0,1) through the normalization index function, and judging the professional stereoscopic video as being suitable for children to watch or only suitable for adults to watch;
step S73, calculating the parameter gradient of the professional stereoscopic video visual comfort classification model by using a back propagation method according to the cross entropy loss function, and updating the parameter by using a self-adaptive gradient descent method;
wherein the cross entropy loss function L is defined as follows:
Figure BDA0002887730880000113
n denotes the number of samples in each batch, yiLabel representing sample i, positive sample yiIs 1, represents a negative example y suitable for children to watchi0, representing suitability for adult viewing only, piRepresenting the probability that the model predicts the sample i as a positive sample;
and S74, training in batches until the L value calculated in the step S53 converges to a threshold value or the iteration times reach the threshold value, completing network training, learning the optimal parameters of the professional stereoscopic video visual comfort classification model, and storing the model parameters.
S8, inputting the left view of the video set to be tested and the corresponding disparity map into a trained model for classification prediction; the method specifically comprises the following steps:
step S81, preprocessing a video set to be tested by using the step S1 to obtain a disparity map;
step S82, performing frame level processing on the left view of the stereoscopic video and the corresponding disparity map in the video set to be tested by using the step S2;
step S83, processing and predicting the video set to be tested by using the training model parameters saved in the step S7 through steps S3 to S7; and each continuous s shots is taken as a sample, and when the probability that the model predicts that the sample is a positive sample is greater than 0.5, the sample is judged to be classified as the positive sample, otherwise, the sample is a negative sample.
Preferably, in the present embodiment, the professional stereoscopic video visual comfort classification model is composed of a network constructed in S2 to S7.
Preferably, in this embodiment, video frames and disparity maps of a plurality of consecutive shots of a professional stereoscopic video are used as input, a temporal inference network and a bidirectional gating cycle unit are used to respectively sense and evaluate the long and short time sequence relationship of the video from a frame level and a shot level, a multi-layer attention is used to integrate information of video segments and branches causing visual discomfort, and finally the professional stereoscopic video is judged to be suitable for children to watch or only suitable for adults to watch.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims (9)

1. A professional stereoscopic video comfort classification method based on multilayer attention and BiGRU is characterized by comprising the following steps:
step S1, carrying out scene segmentation on the training video set and the video set to be predicted and obtaining a disparity map through preprocessing;
step S2, frame level processing, namely, taking the left view of the stereoscopic video and the corresponding disparity map in the training video set as double-current input to carry out the frame level processing, and sensing the time sequence relation between frames in each shot from a plurality of time scales by using a time inference network;
step S3, frame level attention processing, wherein the final frame level characteristics are obtained by weighting and summing the time sequence relation between frames in each shot;
step S4, lens-level processing, namely, sensing frame-level characteristics of a plurality of continuous lenses by using a cyclic neural network bidirectional gating cyclic unit and outputting a hidden state set;
step S5, lens level attention processing, namely, carrying out weighted summation on the hidden state set output in the step S4 to obtain final lens level characteristics;
s6, fusing double streams, and fusing the lens level features output in the S5 by using a channel attention network to obtain a final hidden state;
s7, outputting classification probability through a classification network according to the final hidden state, classifying the professional stereo video into a model suitable for children to watch or only suitable for adults to watch, and obtaining a well-constructed professional stereo video visual comfort classification model from the step S2; training the professional stereoscopic video visual comfort degree classification model, learning the optimal parameters of the professional stereoscopic video visual comfort degree classification model by solving a minimum loss function in the training process, and storing the trained model;
and S8, inputting the left view of the video set to be tested and the corresponding disparity map into the trained model for classification prediction.
2. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU as claimed in claim 1, wherein the step S1 specifically comprises the following steps:
step S11, using multimedia video processing tool to divide the video into a frame image;
step S12, dividing the stereo video into video segments which are not overlapped with each other by using a shot division algorithm, wherein each segment is called a shot;
and step S13, dividing each frame into a left view and a right view, and calculating the horizontal displacement of the corresponding pixel points in the left view and the right view by using a Siftflow algorithm to serve as a disparity map.
3. The method for classifying the comfort of the professional stereoscopic video based on multi-layer attention and BiGRU according to claim 2, wherein the step S2 specifically comprises the following steps:
s21, sparsely sampling frames in a lens, and randomly selecting 8 frames in sequence;
step S22, randomly extracting a sequence a frames from the 8 sampled frames, and respectively sensing the time sequence relation between the a frames by using a pre-trained time inference network, wherein the value range of a is between 2 and 8; given a video V, the temporal relationship T between two frames2(V) is represented by the following formula:
Figure FDA0002887730870000021
wherein ,fi and fjRespectively representing the characteristics of the ith frame and the jth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleNet, ResNet or BN-incorporation,
Figure FDA0002887730870000022
is a two-layer multilayer perceptron, each layer has 256 units, theta is the multilayer perceptron
Figure FDA0002887730870000023
The parameters of (1); in a similar manner to that described above,3-8 interframe time sequence relation T3(V)、T4(V)、T5(V)、T6(V)、T7(V) and T8(V) are respectively represented by the following formulae:
Figure FDA0002887730870000024
Figure FDA0002887730870000025
Figure FDA0002887730870000026
Figure FDA0002887730870000027
Figure FDA0002887730870000028
Figure FDA0002887730870000029
wherein ,fi、fj、fk、fl、fm、fn、fo and fpThe characteristics of the ith frame, the jth frame, the kth frame, the ith frame, the mth frame, the nth frame, the ith frame and the pth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleLeNet, ResNet or BN-addition are shown,
Figure FDA00028877308700000210
a two-layered multi-layered perceptron representing the temporal relationship between a-frames, each layer having 256 elements,theta is a multilayer perceptron
Figure FDA00028877308700000211
The parameters of (1);
step S23, splicing the inter-frame time sequence relations of various time scales in the lens to obtain a frame level feature Tall(V), the calculation formula is as follows:
Tall(V)=[T2(V),T3(V),T4(V),T5(V),T6(V),T7(V),T8(V)]。
4. the method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 3, wherein the step S3 specifically comprises the following steps:
step S31, firstly, the time sequence relation characteristic T between a frames output by the network is reasoned for each timea(V) solving for hidden layer vector ua
ua=tanh(WfTa(V)+bf)
wherein Wf and bfParameters of a single-layer perceptron;
step S32, in order to measure the importance of each time scale time relation, the pair uaCarrying out a standardization operation:
Figure FDA0002887730870000031
wherein ufThe context vector represents the importance of the time sequence relation of the corresponding time scale, and is randomly initialized during training and obtained through learning;
step S33, the final time feature x is the frame-level feature, and the calculation formula is as follows:
Figure FDA0002887730870000032
5. the method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 4, wherein the step S4 specifically comprises the following steps:
step S41, splicing the frame-level features of each shot in the S continuous shots by using the step S33; each shot has a frame level feature x, and then the frame level features of the t, t ═ 1,2tThe frame level features are used as the input of the bidirectional gating circulation unit; the input of the gating cycle unit at the t, t is 1,2t-1And the frame-level feature x of the t-th shottThe hidden layer information h is output at the next momentt(ii) a The gated cycle cell contains 2 gates: reset gate rtAnd an update gate ztThe former is used for calculating candidate hidden layers
Figure FDA0002887730870000033
Controlling how much previous hidden layer h is reservedt-1The information of (a); the latter for controlling the addition of candidate hidden layers
Figure FDA0002887730870000034
Amount of information to obtain hidden state h of outputt;rt、zt
Figure FDA0002887730870000035
htThe calculation formula of (a) is as follows:
zt=σ(Wzxt+Uzht-1)
rt=σ(Wrxt+Urht-1)
Figure FDA0002887730870000036
Figure FDA0002887730870000037
wherein σ is a logic sigmoid function, Δ is an element multiplication, tanh is an activation function, Wz、Uz、Wr、UrW, U is a weight matrix learned in training;
step S42, since the bidirectional gating cycle unit is composed of 2 unidirectional gating cycle units with opposite directions, the last h outputtThe hidden states of the two gating circulation units are jointly determined; at each moment, the input is simultaneously provided for the 2 gating circulation units with opposite directions, the output is determined by the 2 unidirectional gating circulation units together, and the outputs of the 2 unidirectional gating circulation units are spliced to be used as the output of the bidirectional gating circulation unit to obtain a hidden state set output by the bidirectional gating circulation unit; when the input is a sequence of video frames, the output of the bidirectional gated cyclic unit is a set h of hidden statesf(ii) a When the input is a disparity map sequence, the output of the bidirectional gating circulation unit is a hidden state set hd,hf and hdThe calculation formula of (a) is as follows:
Figure FDA0002887730870000041
Figure FDA0002887730870000042
wherein ,
Figure FDA0002887730870000043
indicating a hidden state output at the moment t, t-1, 2.., s of the video frame sequence;
Figure FDA0002887730870000044
this indicates a hidden state output at time t, t 1, 2.
6. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 5, wherein the step S5 specifically comprises the following steps:
step S51, for the video frame sequence, the model first outputs the hidden state for the gating cycle unit at each time
Figure FDA0002887730870000048
Solving hidden layer vector ut
Figure FDA0002887730870000045
wherein Ws and bsIs a parameter of the single layer perceptron;
step S52, in order to measure the importance of each shot, the pair utCarrying out a standardization operation:
Figure FDA0002887730870000046
wherein ,usThe context vector represents the importance of the corresponding shot, and is initialized randomly during training and obtained through learning;
step S53, hiding state h of video frame sequencefThe calculation formula of (2):
Figure FDA0002887730870000047
step S54, similarly, the hidden state h of the disparity map sequence can be obtained through the above processdH is to bef and hdSplicing to obtain ha,haThe calculation formula of (a) is as follows:
ha=[hf,hd]
so far the final shot level feature is complete.
7. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 6, wherein the step S6 specifically comprises the following steps:
step S61, calculating h by adopting channel attentionaThe weight of each hidden state in (1), denoted as Fscale(-) the calculation is as follows:
Fscale(·,·)=σ(W2δ(W1ha))
where, δ is the ReLU function, σ is the sigmoid function, W1 and W2Parameter matrixes of two single-layer perceptrons are respectively obtained through training;
step S62, use
Figure FDA0002887730870000051
Indicating the degree of importance of each channel ultimately obtained,
Figure FDA0002887730870000052
is Fscale(. phi) and haThe formula of the vector product of (a) is as follows:
Figure FDA0002887730870000053
weighted final hidden state
Figure FDA0002887730870000054
And obtaining the final classification probability through a classification network.
8. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU as claimed in claim 7, wherein the step S7 specifically comprises the following steps:
step S71, to prevent network overfitting
Figure FDA0002887730870000055
Inputting a first random inactivation layer of a classification network layer;
step S72, inputting the output after random inactivation into a full connection layer of the second layer of the classification network layer, converting the output of the full connection layer into the classification probability in the range of (0,1) through the normalization index function, and judging the professional stereoscopic video as being suitable for children to watch or only suitable for adults to watch;
step S73, calculating the parameter gradient of the professional stereoscopic video visual comfort classification model by using a back propagation method according to the cross entropy loss function, and updating the parameter by using a self-adaptive gradient descent method;
wherein the cross entropy loss function L is defined as follows:
Figure FDA0002887730870000056
n denotes the number of samples in each batch, yiLabel representing sample i, positive sample yiIs 1, represents a negative example y suitable for children to watchi0, representing suitability for adult viewing only, piRepresenting the probability that the model predicts the sample i as a positive sample;
and S74, training in batches until the L value calculated in the step S53 converges to a threshold value or the iteration times reach the threshold value, completing network training, learning the optimal parameters of the professional stereoscopic video visual comfort classification model, and storing the model parameters.
9. The method for classifying the comfort of the professional stereoscopic video based on multi-layer attention and BiGRU according to claim 8, wherein the step S8 specifically comprises the following steps:
step S81, preprocessing a video set to be tested by using the step S1 to obtain a disparity map;
step S82, performing frame level processing on the left view of the stereoscopic video and the corresponding disparity map in the video set to be tested by using the step S2;
step S83, processing and predicting the video set to be tested by using the training model parameters saved in the step S7 through steps S3 to S7; and each continuous s shots is taken as a sample, and when the probability that the model predicts that the sample is a positive sample is greater than 0.5, the sample is judged to be classified as the positive sample, otherwise, the sample is a negative sample.
CN202110016985.7A 2021-01-07 2021-01-07 Professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU Active CN112613486B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110016985.7A CN112613486B (en) 2021-01-07 2021-01-07 Professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110016985.7A CN112613486B (en) 2021-01-07 2021-01-07 Professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU

Publications (2)

Publication Number Publication Date
CN112613486A true CN112613486A (en) 2021-04-06
CN112613486B CN112613486B (en) 2023-08-08

Family

ID=75253406

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110016985.7A Active CN112613486B (en) 2021-01-07 2021-01-07 Professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU

Country Status (1)

Country Link
CN (1) CN112613486B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807318A (en) * 2021-10-11 2021-12-17 南京信息工程大学 Action identification method based on double-current convolutional neural network and bidirectional GRU
CN116935292A (en) * 2023-09-15 2023-10-24 山东建筑大学 Short video scene classification method and system based on self-attention model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109508642A (en) * 2018-10-17 2019-03-22 杭州电子科技大学 Ship monitor video key frame extracting method based on two-way GRU and attention mechanism
CN111860691A (en) * 2020-07-31 2020-10-30 福州大学 Professional stereoscopic video visual comfort degree classification method based on attention and recurrent neural network
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109508642A (en) * 2018-10-17 2019-03-22 杭州电子科技大学 Ship monitor video key frame extracting method based on two-way GRU and attention mechanism
WO2020221278A1 (en) * 2019-04-29 2020-11-05 北京金山云网络技术有限公司 Video classification method and model training method and apparatus thereof, and electronic device
CN111860691A (en) * 2020-07-31 2020-10-30 福州大学 Professional stereoscopic video visual comfort degree classification method based on attention and recurrent neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
XUANZHEN FENG ET.AL: "Sentiment Classification of Reviews Based on BiGRU Neural Network and Fine-grained Attention", 《JOURNAL OF PHYSICS: CONFERENCE SERIES》 *
李钊光: "基于深度学习和迁移学习的体育视频分类研究", 《电子测量技术》 *
桑海峰;赵子裕;何大阔;: "基于循环区域关注和视频帧关注的视频行为识别网络设计", 电子学报, no. 06 *
魏乐松 等: "基于边缘和结构的无参考屏幕内容图像质量评估", 《北京航空航天大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807318A (en) * 2021-10-11 2021-12-17 南京信息工程大学 Action identification method based on double-current convolutional neural network and bidirectional GRU
CN113807318B (en) * 2021-10-11 2023-10-31 南京信息工程大学 Action recognition method based on double-flow convolutional neural network and bidirectional GRU
CN116935292A (en) * 2023-09-15 2023-10-24 山东建筑大学 Short video scene classification method and system based on self-attention model
CN116935292B (en) * 2023-09-15 2023-12-08 山东建筑大学 Short video scene classification method and system based on self-attention model

Also Published As

Publication number Publication date
CN112613486B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
CN109902546B (en) Face recognition method, face recognition device and computer readable medium
CN111860691B (en) Stereo video visual comfort degree classification method based on attention and recurrent neural network
WO2021093468A1 (en) Video classification method and apparatus, model training method and apparatus, device and storage medium
Das et al. Where to focus on for human action recognition?
CN112149459B (en) Video saliency object detection model and system based on cross attention mechanism
CN112446476A (en) Neural network model compression method, device, storage medium and chip
Saputra et al. Learning monocular visual odometry through geometry-aware curriculum learning
CN110532871A (en) The method and apparatus of image procossing
CN112668366B (en) Image recognition method, device, computer readable storage medium and chip
CN109919221B (en) Image description method based on bidirectional double-attention machine
CN114529984B (en) Bone action recognition method based on learning PL-GCN and ECLSTM
CN112434608B (en) Human behavior identification method and system based on double-current combined network
CN112613486B (en) Professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU
CN113469958A (en) Method, system, equipment and storage medium for predicting development potential of embryo
CN114360073B (en) Image recognition method and related device
CN110599443A (en) Visual saliency detection method using bidirectional long-term and short-term memory network
CN112507920A (en) Examination abnormal behavior identification method based on time displacement and attention mechanism
CN114842542A (en) Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN114677730A (en) Living body detection method, living body detection device, electronic apparatus, and storage medium
CN113239866B (en) Face recognition method and system based on space-time feature fusion and sample attention enhancement
CN116402811B (en) Fighting behavior identification method and electronic equipment
Zhong A convolutional neural network based online teaching method using edge-cloud computing platform
CN111611852A (en) Method, device and equipment for training expression recognition model
CN116452472A (en) Low-illumination image enhancement method based on semantic knowledge guidance
Saif et al. Aggressive action estimation: a comprehensive review on neural network based human segmentation and action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant