CN112613486A

CN112613486A - Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU

Info

Publication number: CN112613486A
Application number: CN202110016985.7A
Authority: CN
Inventors: 牛玉贞; 郑愈明; 彭丹泓
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-06
Anticipated expiration: 2041-01-07
Also published as: CN112613486B

Abstract

The invention relates to a professional stereoscopic video comfort classification method based on multilayer attention and BiGRU. The method comprises the following steps: 1. carrying out scene segmentation on the training video set and the video set to be predicted and obtaining a disparity map through preprocessing; 2. performing frame level processing to obtain initial frame level characteristics; 3. performing frame level attention processing to obtain final frame level characteristics; 4. performing lens level processing to obtain a preliminary lens level characteristic; 5. performing lens level attention processing to obtain final lens level characteristics; 6. double-flow fusion, namely fusing the output of the previous step by using the attention of a channel to obtain a final hidden state; 7. the final hidden state outputs classification probabilities through a classification network and classifies the professional stereoscopic video as suitable for children to watch or only suitable for adults to watch. 8. And inputting the left view of the stereo video in the video set to be tested and the corresponding disparity map into the trained model for classification. The method can effectively distinguish whether the professional stereoscopic video is suitable for children to watch.

Description

Professional stereoscopic video comfort classification method based on multilayer attention and BiGRU

Technical Field

The invention relates to the field of image and video processing and computer vision, in particular to a professional stereoscopic video comfort classification method based on multilayer attention and BiGRU.

Background

Stereoscopic video is also called 3D video, and unlike 2D video, the most important feature is depth information, so that the presentation of the landscape in the video is no longer limited to the screen. The vigorous development of the stereoscopic technology enables people to obtain better viewing experience and bring troubles, for example, people can feel dizzy, dry eyes, nausea and the like when watching uncomfortable stereoscopic videos for a long time, and the adverse reactions can attack the watching heat of the audiences and even influence the physiological health of the audiences. Therefore, how to evaluate the quality of the visual comfort of the stereoscopic image becomes a concern. One of the main factors affecting the visual comfort of the stereoscopic video is parallax, including excessive horizontal parallax, vertical parallax, and rapidly changing parallax, and the other main factor affecting the visual comfort of the stereoscopic video is video content, including a salient object in the video, a presentation manner of the video, and a motion of the object.

Although some good results are obtained by the current comfort evaluation methods, the binocular distance of the children is generally not considered in the work. For children, the binocular distance is narrower than that of adults, the binocular fusion mechanism is not mature as adults, and the parallax size imaged on the retina is different from that of adults, so that the stereoscopic perception of children is different from that of adults. Comfortable stereoscopic video for adults may not be suitable for children to watch. For children who have had eye disease, stereoscopic movies of visual discomfort can cause them to suffer from headaches, eye strain, and inability to see images clearly.

Disclosure of Invention

The invention aims to provide a professional stereoscopic video comfort level classification method based on multilayer attention and BiGRU, solves the problem that a current stereoscopic video comfort level evaluation algorithm does not consider children as audience objects, and can effectively distinguish whether professional stereoscopic videos are suitable for children to watch.

In order to achieve the purpose, the technical scheme of the invention is as follows: a professional stereoscopic video comfort classification method based on multilayer attention and BiGRU comprises the following steps:

step S1, carrying out scene segmentation on the training video set and the video set to be predicted and obtaining a disparity map through preprocessing;

step S2, frame level processing, namely, taking the left view of the stereoscopic video and the corresponding disparity map in the training video set as double-current input to carry out the frame level processing, and sensing the time sequence relation between frames in each shot from a plurality of time scales by using a time inference network;

step S3, frame level attention processing, wherein the final frame level characteristics are obtained by weighting and summing the time sequence relation between frames in each shot;

step S4, lens-level processing, namely, sensing frame-level characteristics of a plurality of continuous lenses by using a cyclic neural network bidirectional gating cyclic unit and outputting a hidden state set;

step S5, lens level attention processing, namely, carrying out weighted summation on the hidden state set output in the step S4 to obtain final lens level characteristics;

s6, fusing double streams, and fusing the lens level features output in the S5 by using a channel attention network to obtain a final hidden state;

s7, outputting classification probability through a classification network according to the final hidden state, classifying the professional stereo video into a model suitable for children to watch or only suitable for adults to watch, and obtaining a well-constructed professional stereo video visual comfort classification model from the step S2; training the professional stereoscopic video visual comfort degree classification model, learning the optimal parameters of the professional stereoscopic video visual comfort degree classification model by solving a minimum loss function in the training process, and storing the trained model;

and S8, inputting the left view of the video set to be tested and the corresponding disparity map into the trained model for classification prediction.

In an embodiment of the present invention, the step S1 specifically includes the following steps:

step S11, using multimedia video processing tool to divide the video into a frame image;

step S12, dividing the stereo video into video segments which are not overlapped with each other by using a shot division algorithm, wherein each segment is called a shot;

and step S13, dividing each frame into a left view and a right view, and calculating the horizontal displacement of the corresponding pixel points in the left view and the right view by using a Siftflow algorithm to serve as a disparity map.

In an embodiment of the present invention, the step S2 specifically includes the following steps:

s21, sparsely sampling frames in a lens, and randomly selecting 8 frames in sequence;

step S22, randomly extracting a sequence a frames from the 8 sampled frames, and respectively sensing the time sequence relation between the a frames by using a pre-trained time inference network, wherein the value range of a is between 2 and 8; given a video V, the temporal relationship T between two frames₂(V) is represented by the following formula:

wherein ,f_i and f_jRespectively representing the characteristics of the ith frame and the jth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleNet, ResNet or BN-incorporation,

is a two-layer multilayer perceptron, each layer has 256 units, theta is the multilayer perceptron

The parameters of (1); similarly, the timing relationship T between 3-8 frames₃(V)、T₄(V)、T₅(V)、T₆(V)、T₇(V) and T₈(V) are respectively represented by the following formulae:

wherein ,f_i、f_j、f_k、f_l、f_m、f_n、f_o and f_pThe characteristics of the ith frame, the jth frame, the kth frame, the ith frame, the mth frame, the nth frame, the ith frame and the pth frame of the video extracted by using a basic characteristic extraction network comprising AlexNet, VGG, GoogleLeNet, ResNet or BN-addition are shown,

a two-layered multi-layered perceptron representing the temporal relationship between a-frames, each layer having 256 elements, θ beingMultilayer perceptron

The parameters of (1);

step S23, splicing the inter-frame time sequence relations of various time scales in the lens to obtain a frame level feature T_all(V), the calculation formula is as follows:

T_all(V)＝[T₂(V),T₃(V),T₄(V),T₅(V),T₆(V),T₇(V),T₈(V)]。

in an embodiment of the present invention, the step S3 specifically includes the following steps:

step S31, firstly, the time sequence relation characteristic T between a frames output by the network is reasoned for each time_a(V) solving for hidden layer vector u_a：

u_a＝tanh(W_fT_a(V)+b_f)

wherein W_f and b_fParameters of a single-layer perceptron;

step S32, in order to measure the importance of each time scale time relation, the pair u_aCarrying out a standardization operation:

wherein u_fThe context vector represents the importance of the time sequence relation of the corresponding time scale, and is randomly initialized during training and obtained through learning;

step S33, the final time feature x is the frame-level feature, and the calculation formula is as follows:

in an embodiment of the present invention, the step S4 specifically includes the following steps:

step S41, Using step S33, the frame level of each shot in the S consecutive shotsCharacteristic splicing; each shot has a frame level feature x, and then the frame level features of the t, t ═ 1,2_tThe frame level features are used as the input of the bidirectional gating circulation unit; the input of the gating cycle unit at the t, t is 1,2_t-1And the frame-level feature x of the t-th shot_tThe hidden layer information h is output at the next moment_t(ii) a The gated cycle cell contains 2 gates: reset gate r_tAnd an update gate z_tThe former is used for calculating candidate hidden layers

Controlling how much previous hidden layer h is reserved_t-1The information of (a); the latter for controlling the addition of candidate hidden layers

Amount of information to obtain hidden state h of output_t；r_t、z_t、

h_tThe calculation formula of (a) is as follows:

z_t＝σ(W_zx_t+U_zh_t-1)

r_t＝σ(W_rx_t+U_rh_t-1)

wherein σ is a logic sigmoid function, Δ is an element multiplication, tanh is an activation function, W_z、U_z、W_r、U_rW, U is a weight matrix learned in training;

step S42, circulating list due to bidirectional door controlThe element is composed of 2 unidirectional gate control circulation units with opposite directions, so that h is finally output_tThe hidden states of the two gating circulation units are jointly determined; at each moment, the input is simultaneously provided for the 2 gating circulation units with opposite directions, the output is determined by the 2 unidirectional gating circulation units together, and the outputs of the 2 unidirectional gating circulation units are spliced to be used as the output of the bidirectional gating circulation unit to obtain a hidden state set output by the bidirectional gating circulation unit; when the input is a sequence of video frames, the output of the bidirectional gated cyclic unit is a set h of hidden states^f(ii) a When the input is a disparity map sequence, the output of the bidirectional gating circulation unit is a hidden state set h^d，h^f and h^dThe calculation formula of (a) is as follows:

wherein ,

indicating a hidden state output at the moment t, t-1, 2.., s of the video frame sequence;

this indicates a hidden state output at time t, t 1, 2.

1. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 5, wherein the step S5 specifically comprises the following steps:

step S51, for the video frame sequence, the model first outputs the hidden state for the gating cycle unit at each time

Vector of hidden layeru_t：

wherein W_s and b_sIs a parameter of the single layer perceptron;

step S52, in order to measure the importance of each shot, the pair u_tCarrying out a standardization operation:

wherein ,u_sThe context vector represents the importance of the corresponding shot, and is initialized randomly during training and obtained through learning;

step S53, hiding state h of video frame sequence^fThe calculation formula of (2):

step S54, similarly, the hidden state h of the disparity map sequence can be obtained through the above process^dH is to be^f and h^dSplicing to obtain h^a，h^aThe calculation formula of (a) is as follows:

h^a＝[h^f,h^d]

so far the final shot level feature is complete.

In an embodiment of the present invention, the step S6 specifically includes the following steps:

step S61, calculating h by adopting channel attention^aThe weight of each hidden state in (1), denoted as F_scale(-) the calculation is as follows:

F_scale(·,·)＝σ(W₂δ(W₁h^a))

where, δ is the ReLU function, σ is the sigmoid function, W₁ and W₂Respectively two single layer perceptionsA parameter matrix of the machine is obtained through training;

step S62, use

Indicating the degree of importance of each channel ultimately obtained,

is F_scale(. phi) and h^aThe formula of the vector product of (a) is as follows:

weighted final hidden state

And obtaining the final classification probability through a classification network.

In an embodiment of the present invention, the step S7 specifically includes the following steps:

step S71, to prevent network overfitting

Inputting a first random inactivation layer of a classification network layer;

step S72, inputting the output after random inactivation into a full connection layer of the second layer of the classification network layer, converting the output of the full connection layer into the classification probability in the range of (0,1) through the normalization index function, and judging the professional stereoscopic video as being suitable for children to watch or only suitable for adults to watch;

step S73, calculating the parameter gradient of the professional stereoscopic video visual comfort classification model by using a back propagation method according to the cross entropy loss function, and updating the parameter by using a self-adaptive gradient descent method;

wherein the cross entropy loss function L is defined as follows:

n denotes the number of samples in each batch, y_iLabel representing sample i, positive sample y_iIs 1, represents a negative example y suitable for children to watch_i0, representing suitability for adult viewing only, p_iRepresenting the probability that the model predicts the sample i as a positive sample;

and S74, training in batches until the L value calculated in the step S53 converges to a threshold value or the iteration times reach the threshold value, completing network training, learning the optimal parameters of the professional stereoscopic video visual comfort classification model, and storing the model parameters.

In an embodiment of the present invention, the step S8 specifically includes the following steps:

step S81, preprocessing a video set to be tested by using the step S1 to obtain a disparity map;

step S82, performing frame level processing on the left view of the stereoscopic video and the corresponding disparity map in the video set to be tested by using the step S2;

step S83, processing and predicting the video set to be tested by using the training model parameters saved in the step S7 through steps S3 to S7; and each continuous s shots is taken as a sample, and when the probability that the model predicts that the sample is a positive sample is greater than 0.5, the sample is judged to be classified as the positive sample, otherwise, the sample is a negative sample.

Compared with the prior art, the invention has the following beneficial effects: firstly, aiming at the problem that the current stereo video comfort evaluation algorithm does not consider children as audience objects, the invention provides a professional stereo video vision comfort classification method based on multilayer attention and a recurrent neural network, which can be used for distinguishing whether professional stereo videos are suitable for children to watch. Secondly, considering that main factors causing visual discomfort comprise video content and parallax, the method adopts a double-flow structure to respectively study the characteristics of a stereoscopic video frame and a parallax map sequence and the time sequence relation of the characteristics, and more comprehensively evaluates the stereoscopic vision comfort level of the stereoscopic video. Finally, because visual discomfort usually occurs in video segments and branches, the difficulty of classification is increased, and the method adopts frame level attention, lens level attention and channel attention to enable the model to pay more attention to the segments and branches causing the visual discomfort, so that the classification accuracy is improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of an overall structure of a professional stereoscopic video visual comfort classification model according to an embodiment of the present invention;

fig. 3 is a diagram of a frame-level processing temporal inference network model architecture in an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

As shown in fig. 1 and fig. 2, the present embodiment provides a professional stereoscopic video comfort classification method based on multi-layer attention and BiGRU, including the following steps:

step S1, carrying out scene segmentation on the training video set and the video set to be predicted and obtaining a disparity map through preprocessing; the method specifically comprises the following steps:

step S11: segmenting the video into a frame of image using a multimedia video processing tool;

step S12: dividing the stereoscopic video into video segments which are not overlapped with each other by using a shot division algorithm, wherein each segment is called a shot;

step S13: and dividing each frame into a left view and a right view, and calculating the horizontal displacement of corresponding pixel points in the left view and the right view by using a Siftflow algorithm to serve as a disparity map.

Step S2, frame level processing, namely, taking the left view of the stereoscopic video and the corresponding disparity map in the training video set as double-current input to carry out the frame level processing, and sensing the time sequence relation between frames in each shot from a plurality of time scales by using a time inference network; the method specifically comprises the following steps:

step S22, randomly extracting a frames in order from the 8 sampled frames by using pre-training timeThe reasoning network respectively senses the time sequence relation between the frames a, and the value range of a is between 2 and 8; given a video V, the temporal relationship T between two frames₂(V) is represented by the following formula:

a two-layer multi-layer perceptron for extracting the time sequence relation between a frames is shown, each layer has 256 units, theta is the multi-layer perceptron

The parameters of (1);

T_all(V)＝[T₂(V),T₃(V),T₄(V),T₅(V),T₆(V),T₇(V),T₈(V)]。

step S3, frame level attention processing, wherein the final frame level characteristics are obtained by weighting and summing the time sequence relation between frames in each shot; the method specifically comprises the following steps:

u_a＝tanh(W_fT_a(V)+b_f)

wherein W_f and b_fParameters of a single-layer perceptron;

step S4, lens-level processing, namely, sensing frame-level characteristics of a plurality of continuous lenses by using a cyclic neural network bidirectional gating cyclic unit and outputting a hidden state set; the method specifically comprises the following steps:

step S41, splicing the frame-level features of each shot in the S continuous shots by using the step S33; each shot has a frame level feature x, and then the frame level features of the t, t ═ 1,2_tThe frame level features are used as the input of the bidirectional gating circulation unit; the input of the gating cycle unit at the t, t is 1,2_t-1And the frame-level feature x of the t-th shot_tThe hidden layer information h is output at the next moment_t(ii) a The gated cycle cell contains 2 gates: reset gate r_tAnd an update gate z_tThe former is used for calculating candidate hidden layers

Amount of information to obtain hidden state h of output_t；r_t、z_t、

h_tThe calculation formula of (a) is as follows:

z_t＝σ(W_zx_t+U_zh_t-1)

r_t＝σ(W_rx_t+U_rh_t-1)

step S42, since the bidirectional gating cycle unit is composed of 2 unidirectional gating cycle units with opposite directions, the last h output_tThe hidden states of the two gating circulation units are jointly determined; at each moment, the input is simultaneously provided for the 2 gating circulation units with opposite directions, the output is determined by the 2 unidirectional gating circulation units together, and the outputs of the 2 unidirectional gating circulation units are spliced to be used as the output of the bidirectional gating circulation unit to obtain a hidden state set output by the bidirectional gating circulation unit; when the input is a sequence of video frames, the output of the bidirectional gated cyclic unit is a set h of hidden states^f(ii) a When the input is a disparity map sequence, the output of the bidirectional gating circulation unit is a hidden state set h^d，h^f and h^dThe calculation formula of (a) is as follows:

wherein ,

this indicates a hidden state output at time t, t 1, 2.

Step S5, lens level attention processing, namely, carrying out weighted summation on the hidden state set output in the step S4 to obtain final lens level characteristics; the method specifically comprises the following steps:

Solving hidden layer vector u_t：

wherein W_s and b_sIs a parameter of the single layer perceptron;

h^a＝[h^f,h^d]

so far the final shot level feature is complete.

S6, fusing double streams, and fusing the lens level features output in the S5 by using a channel attention network to obtain a final hidden state; the method specifically comprises the following steps:

F_scale(·,·)＝σ(W₂δ(W₁h^a))

where, δ is the ReLU function, σ is the sigmoid function, W₁ and W₂Parameter matrixes of two single-layer perceptrons are respectively obtained through training;

step S62, use

Indicating the degree of importance of each channel ultimately obtained,

weighted final hidden state

Is classifiedThe network obtains the final classification probability.

S7, outputting classification probability through a classification network according to the final hidden state, classifying the professional stereo video into a model suitable for children to watch or only suitable for adults to watch, and obtaining a well-constructed professional stereo video visual comfort classification model from the step S2; training the professional stereoscopic video visual comfort degree classification model, learning the optimal parameters of the professional stereoscopic video visual comfort degree classification model by solving a minimum loss function in the training process, and storing the trained model; the method specifically comprises the following steps:

step S71, to prevent network overfitting

Inputting a first random inactivation layer of a classification network layer;

wherein the cross entropy loss function L is defined as follows:

S8, inputting the left view of the video set to be tested and the corresponding disparity map into a trained model for classification prediction; the method specifically comprises the following steps:

Preferably, in the present embodiment, the professional stereoscopic video visual comfort classification model is composed of a network constructed in S2 to S7.

Preferably, in this embodiment, video frames and disparity maps of a plurality of consecutive shots of a professional stereoscopic video are used as input, a temporal inference network and a bidirectional gating cycle unit are used to respectively sense and evaluate the long and short time sequence relationship of the video from a frame level and a shot level, a multi-layer attention is used to integrate information of video segments and branches causing visual discomfort, and finally the professional stereoscopic video is judged to be suitable for children to watch or only suitable for adults to watch.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A professional stereoscopic video comfort classification method based on multilayer attention and BiGRU is characterized by comprising the following steps:

2. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU as claimed in claim 1, wherein the step S1 specifically comprises the following steps:

3. The method for classifying the comfort of the professional stereoscopic video based on multi-layer attention and BiGRU according to claim 2, wherein the step S2 specifically comprises the following steps:

The parameters of (1); in a similar manner to that described above,3-8 interframe time sequence relation T₃(V)、T₄(V)、T₅(V)、T₆(V)、T₇(V) and T₈(V) are respectively represented by the following formulae:

a two-layered multi-layered perceptron representing the temporal relationship between a-frames, each layer having 256 elements,theta is a multilayer perceptron

The parameters of (1);

T_all(V)＝[T₂(V),T₃(V),T₄(V),T₅(V),T₆(V),T₇(V),T₈(V)]。

4. the method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 3, wherein the step S3 specifically comprises the following steps:

u_a＝tanh(W_fT_a(V)+b_f)

wherein W_f and b_fParameters of a single-layer perceptron;

5. the method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 4, wherein the step S4 specifically comprises the following steps:

Amount of information to obtain hidden state h of output_t；r_t、z_t、

h_tThe calculation formula of (a) is as follows:

z_t＝σ(W_zx_t+U_zh_t-1)

r_t＝σ(W_rx_t+U_rh_t-1)

wherein ,

this indicates a hidden state output at time t, t 1, 2.

6. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 5, wherein the step S5 specifically comprises the following steps:

Solving hidden layer vector u_t：

wherein W_s and b_sIs a parameter of the single layer perceptron;

h^a＝[h^f,h^d]

so far the final shot level feature is complete.

7. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU according to claim 6, wherein the step S6 specifically comprises the following steps:

F_scale(·,·)＝σ(W₂δ(W₁h^a))

step S62, use

Indicating the degree of importance of each channel ultimately obtained,

weighted final hidden state

8. The method for classifying the comfort of professional stereoscopic video based on multi-layer attention and BiGRU as claimed in claim 7, wherein the step S7 specifically comprises the following steps:

step S71, to prevent network overfitting

Inputting a first random inactivation layer of a classification network layer;

wherein the cross entropy loss function L is defined as follows:

9. The method for classifying the comfort of the professional stereoscopic video based on multi-layer attention and BiGRU according to claim 8, wherein the step S8 specifically comprises the following steps: