CN110365966B

CN110365966B - Video quality evaluation method and device based on window

Info

Publication number: CN110365966B
Application number: CN201910500485.3A
Authority: CN
Inventors: 李辰; 徐迈; 蒋铼; 张善翌
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2020-07-28
Anticipated expiration: 2039-06-11
Also published as: CN110365966A

Abstract

The embodiment of the invention relates to a video quality evaluation method and a video quality evaluation device of a window, wherein the video quality evaluation method based on the window comprises the following steps: acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames; based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network; determining a quality score for the marred video based on the quality score for each marred video frame. The concerned window can be extracted more accurately, and in the embodiment of the invention, the quality score of the window picture can be more integrated, namely the quality score of the whole video can be used.

Description

Video quality evaluation method and device based on window

Technical Field

The embodiment of the invention relates to the technical field of video quality evaluation, in particular to a video quality evaluation method and device based on a window.

Background

With the rapid development of virtual reality technology, panoramic video has entered people's daily life as a new multimedia form. Typically, a viewer views panoramic video by wearing a Head Mounted Display (HMD), and thus only video frames within a window of the HMD are visible. However, spherical video content requires extremely high resolution to ensure clear presentation of the video content, and in order to transmit high-resolution panoramic video in a channel with limited bandwidth, a video compression technique is required to save the code rate after encoding, but this also results in reduction of visual quality. Therefore, research on panoramic video quality evaluation is urgently needed to guide the panoramic video encoding process.

At present, the quality evaluation method of the plane video based on the deep learning is based on the clipping image block to carry out the quality evaluation. Inspired by the methods, a panoramic video quality evaluation method estimates the quality fraction and weight of each image block by using a convolutional neural network, and then obtains the overall quality fraction of the panoramic video by calculating weighted average of the quality fractions of all the image blocks. However, when a viewer views the panoramic video, the viewer sees the content of the window rather than the image blocks. Therefore, the panoramic video quality evaluation based on the window is more reasonable and can reflect the visual quality sensed by people. However, no window-based video quality evaluation method is proposed at present.

Disclosure of Invention

In order to solve at least one problem in the prior art, at least one embodiment of the present invention provides a method and an apparatus for evaluating video quality based on a window.

In a first aspect, an embodiment of the present invention provides a method for evaluating video quality based on a window, where the method includes:

acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;

based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network;

determining a quality score for the marred video based on the quality score for each marred video frame.

In some embodiments, determining the quality score for any corrupted video frame comprises:

based on the damaged video frame, obtaining a plurality of candidate window positions and the corresponding weight of each candidate window position through the window extraction network;

obtaining a saliency image and a quality score of the extracted window through a window quality network based on the candidate window positions, the damaged video frame and the corresponding reference video frame;

and averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.

In some embodiments, obtaining, by the window extraction network, a plurality of candidate window positions and a weight corresponding to each of the candidate window positions based on the corrupted video frame comprises:

determining a temporal variation based on the corrupted video frame;

and obtaining a plurality of candidate window positions and the weight corresponding to each candidate window position through the window extraction network based on the damaged video frame and the time domain variation.

In some embodiments, determining a temporal variation based on the marred video frame comprises:

a temporal change between the corrupted video frame and a previous nth corrupted video frame is calculated.

In some embodiments, obtaining the saliency image and the quality score of the extracted window by a window quality network based on the plurality of candidate window positions, the corrupted video frame, and the corresponding reference video frame comprises:

determining a spatial variation between the damaged video frame and the corresponding reference video frame based on the damaged video frame and the corresponding reference video frame;

obtaining an image of the extracted window and a corresponding spatial variation based on the plurality of candidate window positions, the spatial variation, the damaged video frame and the corresponding reference video frame;

and obtaining the significant image and the quality score of the extracted window through the window quality network based on the image of the extracted window and the corresponding airspace variable quantity.

In some embodiments, obtaining the image of the extracted window and the corresponding spatial variance based on the candidate window positions, the spatial variance, the damaged video frame and the corresponding reference video frame comprises:

extracting at least one window position from the candidate window positions based on the corresponding weight of each candidate window position, and outputting the extracted window position and the corresponding weight;

and aligning the extracted window position, the damaged video frame, the airspace variation and the reference video frame to obtain an image of the extracted window and a corresponding airspace variation.

In some embodiments, averaging the quality scores of all of the extracted windows to obtain the quality score of the corrupted video frame comprises:

and based on the weight corresponding to the position of the extracted window, carrying out weighted average on the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.

and arithmetically averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.

In some embodiments, determining the quality score of the marred video based on the quality score of each marred video frame comprises:

and averaging the quality scores of all damaged video frames to obtain the quality score of the damaged video.

In a second aspect, an embodiment of the present invention further provides a video quality evaluation apparatus based on a window, including:

an acquisition unit configured to acquire a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;

and the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and the corresponding reference video frame and the window extraction network and the window quality network which are trained in advance.

A second determining unit for determining a quality score of the marred video based on the quality score of each marred video frame.

In the embodiment of the invention, the concerned window is more accurately extracted by predicting the head movement position of a person, so that the accuracy of the window extraction network calculation result is improved; in addition, the significance prediction task is used for assisting quality evaluation, and the accuracy of video quality evaluation is improved again.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of a window-based video quality assessment according to an embodiment of the present invention;

FIG. 2 is a block diagram of a window-based video quality assessment framework according to an embodiment of the present invention;

FIG. 3 is a VP-net network architecture provided by embodiments of the present invention;

FIG. 4 is a VQ-net network architecture provided by embodiments of the present invention;

FIG. 5 is a VP-net network architecture provided by embodiments of the present invention;

fig. 6 is a VQ-net network structure according to an embodiment of the present invention.

Detailed Description

In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. The specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention, are within the scope of the invention.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

As shown in fig. 1, the method disclosed in this embodiment may include the following steps 101 to 103:

101: a corrupted video and a reference video are acquired.

102: a quality score is determined for each corrupted video frame based on each corrupted video frame and the corresponding reference video frame.

103: based on the quality score of each corrupted video frame, a quality score of the corrupted video is determined.

In this embodiment of the present invention, in step 101, the damaged video includes a plurality of damaged video frames, and the reference video includes a plurality of reference video frames, and according to the damaged video frames and the reference video frames, a temporal variation between the damaged video frames and an nth damaged video frame before the damaged video frames and a spatial variation between the damaged video frames and corresponding reference video frames are calculated.

The time domain variation is calculated by the following methods: the difference can be made directly for the two frame images: f_t-F_t-ΔtOr F_t-Δt-F_t；

The spatial variation is calculated by the following methods: the difference can be made directly for the two frame images:

or

In the embodiment of the present invention, step 102 is mainly divided into two stages: window extraction and quality evaluation. In a window extraction stage, a window which is more concerned is extracted by predicting the head movement position of a person watching the panoramic video; in the quality evaluation stage, the quality score of the extracted window picture is predicted.

It can be understood that the deep convolutional neural network is applied to window extraction and quality evaluation in the embodiment of the present invention, which are a view port protocol network (VP-net) and a view quality network (VQ-net), respectively.

F frame of damaged video_tAnd the time domain variation obtained in step 101, inputting VP-net, and outputting a series of candidate window positions V ═ V₁,…,v_IAnd the corresponding weights

Wherein I is the number of candidate windows and is determined by the VP-net structure.

Then, according to the obtained candidate window position and corresponding weight, extracting a plurality of windows from the candidate window position, and simultaneously outputting the weight of the extracted window. The method provides a method for Non Maximum Suppression (NMS) of window softening, which is used for extracting windows according to candidate window positions and corresponding weights. The specific implementation of the NMS may be, but is not limited to, the window softening NMS proposed by the present method.

And aligning the extracted window position to an original video frame, and obtaining the content of the window and the airspace variable quantity in the window through the inverse transformation of the Nomontniy projection (also called sundial projection, spherical center projection and point tangent projection). The method provides a calculation mode of the NOMONNI inverse projection transform, and the calculation mode of the NOMONNI inverse projection transform can be but is not limited to the calculation mode provided herein.

And inputting the image of the extracted window and the space domain variable quantity thereof into VQ-net, and outputting the saliency image and the quality score of the window.

There are two methods for the quality score of the marred video frame, including the following:

according to the weight corresponding to the position of the extracted window, the quality scores of all the extracted windows are weighted and averaged to obtain the quality score of the damaged video frame; in addition, the quality scores of all the extracted windows can be averaged to obtain the quality score of the damaged video frame.

The method comprises the following specific steps:

1. inputting a damaged video and a reference video, recording the total frame number of the video as T, and numbering the video frames as 1,2, … and T according to time sequence. For the method, the damaged video t frame F is input each time_tT- Δ t frame F_t-ΔtReference video t frame

Into the frame shown in fig. 1. Wherein, the delta T is a parameter, and the legal value range is [1, T-1 ]]Inner integer, the suggested value is 1. T has a value range of [ Delta T, T]An integer within.

2. Preprocessing from the tth frame F of the marred video_tAnd t- Δ t frame F_t-ΔtAnd calculating to obtain time domain variation, and calculating to obtain space domain variation from the t-th frame of the damaged video and the t-th frame of the reference video.

2.1 the time domain variation is calculated, including but not limited to the following: can be directly corresponding to two frame imagesThe image is subjected to subtraction, i.e. F_t-F_t-ΔtOr F_t-Δt-F_t(ii) a The extraction method can also be used for extracting the dense optical flow between two frames, and the extraction method comprises but is not limited to Farneback algorithm, Horn-Schunck algorithm, FlowNet and FlowNet 2.0.

2.2 calculating the spatial variance, including but not limited to the following: the difference can be made directly for the two frame images, i.e.

Or

Structural similarity between two frames can also be calculated.

3. T frame F of damaged video_tAnd inputting the time domain variable quantity obtained in the step 2 into a visual window extraction network (VP-net), and outputting a series of candidate window positions V ═ V ═ net₁,…,v_IAnd the corresponding weights

4. And (2) Non Maximum Suppression (NMS), extracting k windows from the candidate window positions according to the step 3 candidate window positions and corresponding weights, and simultaneously outputting the weights of the extracted windows. The method provides a window softening NMS method, which is used for extracting a window according to a candidate window position and a corresponding weight. The specific implementation of the NMS may be, but is not limited to, the window softening NMS proposed by the present method. The specific steps of the window softening NMS method are as follows:

4.1 input: candidate window position V ═ V₁,…,v_IAnd the corresponding weights

Obtained in step 3. Setting a great circle distance threshold d^thExtractingWindow number threshold K^th. Wherein d is^thHas a legal value range of (0, pi)]The suggested value is pi/24; k^thHas a legal value range of [1, I]The suggested value is min {20, I }.

4.2 initialize variable k ← 1, set

4.3 finding the subscript of the maximum weight

4.4 the function d (v ', v') is the great circle distance between the input window positions v 'and v'. For the remaining candidate window positions in the set V, find and V_ιBetween them is less than d^thThe index of the candidate window position of (a), to form a set, I '← { iota' | d (v)_ι′,v_ι)<d^th,v_ι′∈V}.

The method provides a calculation mode of the great circle distance, and the calculation mode of the great circle distance can be, but is not limited to the calculation mode provided herein.

4.4.1 two sphere positions v ' are given (Φ ', θ '), v ″, θ ", where Φ ' and Φ" denote longitude and θ ' and θ "denote latitude, in rad.

4.4.2 great circle distance between two spherical surface positions

d(v′,v″)＝arccos(sinθ′sinθ″+cosθ′cosθ″cos(φ′-φ″)).

4.5 computing the weight of the kth extracted Window

4.6 computing the position of the kth extracted Window

4.7 add the position and weight of the extracted window to the output set:

4.8 exclude candidate windows and weights indexed in I' from the input set: v ← V \ V_ι′|ι′∈I′，

4.9 update variable k ← k +1.

4.10 repeat steps 4.3 through 4.9 until

Or k>K^th.

4.11 output extracted Window set V^pAnd its weight set W^p.

5. The windows are aligned. Aligning the window position extracted in the step 4 to an original video frame, and obtaining the content of the window and the airspace variation in the window through the inverse transformation of the Nomontoni projection (also called sundial projection, spherical center projection and point tangent projection).

The method provides a calculation mode of the NOMONNI inverse projection transform, and the calculation mode of the NOMONNI inverse projection transform can be but is not limited to the calculation mode provided herein.

5.1 input the position of the kth extracted Window

Wherein

And

respectively the longitude and latitude of the position of the extracted window, and the unit is rad; t frame F of damaged video_tThe width is W_FHeight of H_F(ii) a The airspace variable quantity obtained in the step 2

Resolution and F_tThe same is true.

5.2 initializing Window image C_kThe width is W and the height is H. The legal range of values for W and H is any positive integer. The proposed value is W540 and H600.

5.3 pairs of C_kAt a pixel position (x, y), an intermediate coordinate (f) is calculated_x,f_y) It has the same spatial dimensions as a unit sphere:

wherein a is_WAnd a_HThe angular range of the viewport, corresponding to W and H, is related to the viewport image resolution and the HMD physical size. The proposed value is a, corresponding to the proposed window resolution in 5.2 and the prevailing conditions of HMDs on the market_W＝71π/180rad，a_w＝74π/180rad.

5.4 by (f)_x,f_y) The spherical position (phi) corresponding to the pixel position (x, y)_x,y,θ_x,y) Can be obtained from the following formula, with units of rad:

wherein

c＝arctanρ.

5.5 the position of the sphere (phi)_x,y,θ_x,y) Mapping to F_tPixel coordinate (p) of_x,y,q_x,y) The mapping relationship is related to the mapping format used by the input video.

The method provides a calculation method corresponding to Equal Rectangular Projection (ERP), and when the input video uses ERP, the calculation method may be, but is not limited to, the calculation method provided herein. Meanwhile, the input video may be, but is not limited to, ERP. The calculation method corresponding to other mapping modes can be obtained by the calculation method of other mapping modes.

If the input video uses ERP, then (p)_x,y,q_x,y) Can be obtained from the following formula, with units of rad:

5.6

where ψ (-) is an interpolation function including, but not limited to, nearest neighbor interpolation, bilinear interpolation, spline curve interpolation.

5.7 repeat steps 5.3 to 5.6 for all coordinate positions (x, y), x ∈ [1, W ], y ∈ [1, H ] and are integers.

5.8 initializing window airspace variation image

Its resolution and C_kThe same is true. C is to be_kIs replaced by

F_tIs replaced by

Repeating steps 5.3 to 5.7.

Repeating steps 5.1 to 5.8 for all K equal to 1, …, K, resulting in image C of all extracted windows_kK is 1, …, K and its spatial variance image

6. For all K equal to 1, …, K, the image C of the window to be extracted is sequentially taken_kAnd airspace variable quantity image thereof

Inputting a visual quality network (VQ-net), and outputting a saliency image of a k-th window

And mass fraction s_k.

7. Averaging the mass fraction s of all extracted windows_kK is 1, …, K, resulting in an input marred video frame F_tMass fraction of

The method provides two calculation methods for averaging the scores to obtain the mass score

The calculation method of (a) may be, but is not limited to, the calculation method given herein.

7.1 given the quality score s of all extracted windows_k,k＝1,…,K，

May be their arithmetic mean:

7.2 given the quality score s of all extracted windows_kK is 1, …, and the weight of the extracted window

May be a weighted average of the window quality scores:

the quality scores of all the marred video frames are then averaged to obtain the quality score of the marred video.

8. For all T ∈ [ Δ T, T [ ]]And is an integer, repeating the steps 2 to 7 to obtain each damaged video frame F_tMass fraction of

Calculating the arithmetic mean of the quality scores obtained in the step 8 to obtain the quality score of the input damaged video

The embodiment discloses an apparatus, which may include the following units, which are specifically described as follows:

the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and a corresponding reference video frame and a pre-trained window extraction network and a window quality network;

Firstly, the method comprises the following steps: embodiments of the present invention provide for using VP-net to output a list of candidate window positions and corresponding weights, and two implementations of VP-net are provided herein. A VP-net is described in detail as follows:

1. VP-net may be a deep convolutional neural network as described below

1.1 network architecture

Network input as corrupted video frame F_tAnd the time domain variation obtained in step 101 is output as a series of window position shifts

And their weights

The network topology connection is shown in fig. 3. In the network architecture given belowDescription of the different components.

The method includes the steps of 1.1.1 resampling, giving a series of predefined spherical positions, obtaining corresponding pixel coordinates according to a mapping mode used by an input video, and obtaining corresponding pixel values after interpolation, wherein the used predefined spherical positions include but are not limited to SOFT, Clenshaw-Curtis and Gauss-L egenderre.

1.1.2 downsampling. For each tensor in fig. 3 that needs to be down-sampled, its down-sampled size is the same as the size of the tensor to be connected after it is down-sampled. The downsampling uses gradient-conducted interpolation, including but not limited to bi/tri-linear interpolation.

The 1.1.3SO (3) tensor is converted to the S2 tensor. Note T^SO(3)Defining a tensor on SO (3) whose coordinates are represented by (α, gamma), and recording T^S2The method provides two calculation ways of converting the SO (3) tensor into the S2 tensor, and the calculation way of converting the SO (3) tensor into the S2 tensor can be, but is not limited to, the calculation way provided herein.

The 1.1.3.1SO (3) tensor can be converted to the S2 tensor by averaging over the γ dimension:

the 1.1.3.2SO (3) tensor can be converted to the S2 tensor by maximizing in the γ dimension:

1.1.4 the flexible max function is computed over the width and height dimensions of the input tensor.

1.1.5 the specific configuration of each layer of the network containing learnable parameters is shown in Table 1.

Table 1 corresponds to the specific configuration of the layers of the network of fig. 3 containing learnable parameters

1.2 Objective function clause definition of the Objective function in VP-net training

1.2.1 spherical anchor definition. Note the book

The feature vector at each pixel position of the transformed S2 tensor of the SO (3) tensor output for the SO3Conv10 corresponds to a specific coordinate position on the spherical surface, and the corresponding relationship is the same as ERP. According to the inverse transformation of the mapping transformation, the specific coordinate positions can be obtained by calculation, and the specific coordinate positions of the spherical surface are defined as spherical anchor points and are marked as v^a＝(φ^a,θ^a) Wherein phi^aAnd theta^aRespectively, longitude and latitude.

Note the book

If the number of the middle pixel positions is I, the total number of the spherical anchor points is I

As described in step 3, VP-net outputs I candidate window positions, which correspond to I spherical anchors one-to-one, and which do not change with input changes, and which can be regarded as a constant attribute of the network.

1.2.2 window weight objective function. Given J experimenters viewing the input marred video, the head movement position truth value of the t-th frame

Anchor point to spherical surface

Its weight true value w_iThe definition is as follows:

wherein,

is composed of

And

as defined above. Sigma is a parameter, the legal value range is (0, ∞), and the suggested value is 18.33 pi/180;

the objective function of window weight is defined as window weight true value

And window weights for network prediction

The relative entropy between distributions, also known as Kullback-L eibler (K L) divergence, is defined as follows:

1.2.3 Window position offset objective function. Anchor point to spherical surface

Define the corresponding window position shift truth value as

True value of head movement position to its nearest:

the window position shift objective function is defined as each predicted window position shift Δ ν_iAnd its true value Δ

Smoothing between

Distance, is recorded as

The objective function is defined as follows:

in summary, the objective function in training VP-net is defined as follows:

wherein λ is_wAnd λ_vAs a parameter, the legal value range is positive, and the suggested value is λ_w＝1，λ_v＝5。

2. VP-net may be a deep convolutional neural network as described below

2.1 network architecture. Network input as corrupted video frame F_tAnd the time domain variable quantity obtained in the step 2 is output as the probability that the network predicts that the input video frame contains the lens motion

A series of window position offsets

And their weights

The network topology connection is shown in fig. 5. A description of the different components in the network structure is given below.

2.1.1 resampling. As described above.

The 2.1.2SO (3) tensor is converted to the S2 tensor. As described above.

2.1.3 flexibility maximum function. As described above.

2.1.4 center Gaussian weight map, noted

As pixel coordinates, can be generated by:

wherein

Is tensor T^sThe side length of (1) corresponds to the above suggested value of 1.1.1, then there is

Is a parameter, the legal value range is (0, infinity), the suggested value is

2.1.5 the specific configuration of each layer in the network containing learnable parameters is shown in table 2.

Table 2 corresponds to the specific configuration of the network of fig. 5 with layers having learnable parameters

2.2 objective function the terms define the objective function when training VP-net as follows.

2.2.1 spherical anchor definition. As described above.

2.2.2 window weight objective function. As described above.

2.2.3 Window position offset objective function. As described above.

2.2.4 lens motion detection objective function. The real label of whether the input video frame contains the shot motion is l (when the input video frame contains the shot motion, l is 1, otherwise, l is 0), and the shot motion detection objective function is defined as a binary cross entropy objective function as follows:

2.2.5 Overall, the objective function when training VP-net is defined as follows:

wherein

λ_wAnd λ_vAs a parameter, the legal value range is a positive number, and the suggested value is

λ_w＝1，λ_v＝5.

II, secondly: embodiments of the present invention, which mention the use of VQ-net to output saliency maps and quality scores for multiple views, propose two implementations of VQ-net. A VQ-net is described in detail as follows:

1. VQ-net may be a deep convolutional neural network as described below

1.1 network architecture

Network input is a Windows image C_kAnd airspace variable quantity image thereof

Saliency map with output as input window

And mass fraction s_kThe network topology connection is as shown in fig. 6. A description of the different components in the network structure is given below.

1.1.1 the flexible max function is computed over the width and height dimensions of the input tensor.

1.1.2 upsampling. For the tensor requiring up-sampling in fig. 6, the size after up-sampling is the same as the tensor multiplied after up-sampling. Upsampling uses gradient-guided interpolation, including but not limited to bilinear interpolation.

1.1.3 the specific configuration of the convolutional and pooling layers in the network is shown in Table 3.

Table 3 corresponds to the specific configuration of the network convolution layer and the pooling layer of FIG. 6

1.1.4 dense connection block DenseBlock the configuration is as defined in DenseNet, as shown in Table 4.

Table 4 corresponds to the specific configuration of the tight connection blocks of fig. 6

1.2 the objective function is as follows, the itemization defines the objective function when training VQ-net

1.2.1, significance prediction objective function. Given that experimenter watches the input impaired video, in window C_kEye movement saliency map truth value M within range_kThe significance map is regarded as a probability distribution, and the significance prediction objective function is defined as the significance map of the network prediction window

And its true value M_kRelative entropy between, defined as:

wherein M is_k(x ', y') and

are respectively M_kAnd

a saliency value at a pixel position (x ', y').

1.2.2 quality fraction objective function. Given the subjective quality score s of the video corresponding to the input window, the quality score objective function is defined as the quality score s of the network prediction window_kThe squared error from s is defined as follows:

in summary, the objective function when training VQ-net is defined as follows:

wherein λ is_MAnd λ_sAs a parameter, the legal value range is positive, and the suggested value is λ_M＝10，λ_s＝1×10³。

2. VQ-net may be a deep convolutional neural network as described below.

2.1 network architecture network input as Windows image C_kAnd airspace variable quantity image thereof

Saliency map with output as input window

And mass fraction s_kThe network topology connection is as shown in fig. 4. A description of the different components in the network structure is given below.

2.1.1 flexibility maximum function as described in section 2.

2.1.2 downsampling. For the tensor requiring downsampling in fig. 4, the downsampled size is the same as the size of the tensor to be connected after downsampling. The downsampling uses gradient-guided interpolation, including but not limited to bilinear interpolation.

2.1.3 closely connected blocks are as described in section 2.

2.1.4 the specific configuration of the convolutional and pooling layers in the network is shown in Table 5.

Table 5 corresponds to the specific configuration of the network convolution layer and the pooling layer of FIG. 4

2.2.1 significance prediction objective function. As described in section 2.

2.2.2 quality fraction objective function. As described in section 2.

In summary, the objective function when training VQ-net is defined as follows:

wherein λ is_MAnd λ_sAs a parameter, the legal value range is positive, and the suggested value is λ_M＝10，λ_s＝1×10⁴。

The apparatus disclosed in the above embodiments can implement the processes of the methods disclosed in the above method embodiments, and in order to avoid repetition, the details are not described here again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.

Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims

1. A method for evaluating video quality based on windows is characterized by comprising the following steps:

determining a quality score for the marred video based on the quality score for each marred video frame;

wherein determining the quality score of any corrupted video frame comprises:

2. The method of claim 1, wherein obtaining a plurality of candidate window positions and a weight corresponding to each of the candidate window positions via the window extraction network based on the corrupted video frame comprises:

determining a temporal variation based on the corrupted video frame;

3. The method of claim 2, wherein determining the temporal variation based on the corrupted video frame comprises:

4. The method of claim 1, wherein the impaired view is based on the candidate window positions

The frequency frame and the corresponding reference video frame obtain the significant image and the quality score of the extracted window through a window quality network, and the method comprises the following steps:

5. The method of claim 4, wherein obtaining the image of the extracted window and the corresponding spatial variance based on the candidate window positions, the spatial variance, the corrupted video frame and the corresponding reference video frame comprises:

6. The method of claim 1, wherein averaging the quality scores of all extracted windows to obtain the quality score of the corrupted video frame comprises:

7. The method of claim 1, wherein averaging the quality scores of all extracted windows to obtain the quality score of the corrupted video frame comprises:

8. The method of claim 1, wherein determining the quality score of the marred video based on the quality score of each marred video frame comprises:

9. A window-based video quality evaluation apparatus, comprising:

a second determining unit for determining a quality score of the marred video based on a quality score of each marred video frame;

wherein determining the quality score of any corrupted video frame comprises: