CN110365966B - Video quality evaluation method and device based on window - Google Patents
Video quality evaluation method and device based on window Download PDFInfo
- Publication number
- CN110365966B CN110365966B CN201910500485.3A CN201910500485A CN110365966B CN 110365966 B CN110365966 B CN 110365966B CN 201910500485 A CN201910500485 A CN 201910500485A CN 110365966 B CN110365966 B CN 110365966B
- Authority
- CN
- China
- Prior art keywords
- window
- video frame
- video
- quality
- damaged
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 20
- 238000012935 Averaging Methods 0.000 claims description 14
- 230000002123 temporal effect Effects 0.000 claims description 7
- 230000008859 change Effects 0.000 claims description 3
- 230000001771 impaired effect Effects 0.000 claims description 2
- 230000006870 function Effects 0.000 description 34
- 238000004364 calculation method Methods 0.000 description 22
- 238000012549 training Methods 0.000 description 8
- 238000013507 mapping Methods 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 230000004886 head movement Effects 0.000 description 4
- 238000011176 pooling Methods 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 230000000007 visual effect Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 241001247287 Pentalinon luteum Species 0.000 description 2
- 238000012952 Resampling Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000001303 quality assessment method Methods 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 208000013057 hereditary mucoepithelial dysplasia Diseases 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N17/00—Diagnosis, testing or measuring for television systems or their details
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention relates to a video quality evaluation method and a video quality evaluation device of a window, wherein the video quality evaluation method based on the window comprises the following steps: acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames; based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network; determining a quality score for the marred video based on the quality score for each marred video frame. The concerned window can be extracted more accurately, and in the embodiment of the invention, the quality score of the window picture can be more integrated, namely the quality score of the whole video can be used.
Description
Technical Field
The embodiment of the invention relates to the technical field of video quality evaluation, in particular to a video quality evaluation method and device based on a window.
Background
With the rapid development of virtual reality technology, panoramic video has entered people's daily life as a new multimedia form. Typically, a viewer views panoramic video by wearing a Head Mounted Display (HMD), and thus only video frames within a window of the HMD are visible. However, spherical video content requires extremely high resolution to ensure clear presentation of the video content, and in order to transmit high-resolution panoramic video in a channel with limited bandwidth, a video compression technique is required to save the code rate after encoding, but this also results in reduction of visual quality. Therefore, research on panoramic video quality evaluation is urgently needed to guide the panoramic video encoding process.
At present, the quality evaluation method of the plane video based on the deep learning is based on the clipping image block to carry out the quality evaluation. Inspired by the methods, a panoramic video quality evaluation method estimates the quality fraction and weight of each image block by using a convolutional neural network, and then obtains the overall quality fraction of the panoramic video by calculating weighted average of the quality fractions of all the image blocks. However, when a viewer views the panoramic video, the viewer sees the content of the window rather than the image blocks. Therefore, the panoramic video quality evaluation based on the window is more reasonable and can reflect the visual quality sensed by people. However, no window-based video quality evaluation method is proposed at present.
Disclosure of Invention
In order to solve at least one problem in the prior art, at least one embodiment of the present invention provides a method and an apparatus for evaluating video quality based on a window.
In a first aspect, an embodiment of the present invention provides a method for evaluating video quality based on a window, where the method includes:
acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network;
determining a quality score for the marred video based on the quality score for each marred video frame.
In some embodiments, determining the quality score for any corrupted video frame comprises:
based on the damaged video frame, obtaining a plurality of candidate window positions and the corresponding weight of each candidate window position through the window extraction network;
obtaining a saliency image and a quality score of the extracted window through a window quality network based on the candidate window positions, the damaged video frame and the corresponding reference video frame;
and averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
In some embodiments, obtaining, by the window extraction network, a plurality of candidate window positions and a weight corresponding to each of the candidate window positions based on the corrupted video frame comprises:
determining a temporal variation based on the corrupted video frame;
and obtaining a plurality of candidate window positions and the weight corresponding to each candidate window position through the window extraction network based on the damaged video frame and the time domain variation.
In some embodiments, determining a temporal variation based on the marred video frame comprises:
a temporal change between the corrupted video frame and a previous nth corrupted video frame is calculated.
In some embodiments, obtaining the saliency image and the quality score of the extracted window by a window quality network based on the plurality of candidate window positions, the corrupted video frame, and the corresponding reference video frame comprises:
determining a spatial variation between the damaged video frame and the corresponding reference video frame based on the damaged video frame and the corresponding reference video frame;
obtaining an image of the extracted window and a corresponding spatial variation based on the plurality of candidate window positions, the spatial variation, the damaged video frame and the corresponding reference video frame;
and obtaining the significant image and the quality score of the extracted window through the window quality network based on the image of the extracted window and the corresponding airspace variable quantity.
In some embodiments, obtaining the image of the extracted window and the corresponding spatial variance based on the candidate window positions, the spatial variance, the damaged video frame and the corresponding reference video frame comprises:
extracting at least one window position from the candidate window positions based on the corresponding weight of each candidate window position, and outputting the extracted window position and the corresponding weight;
and aligning the extracted window position, the damaged video frame, the airspace variation and the reference video frame to obtain an image of the extracted window and a corresponding airspace variation.
In some embodiments, averaging the quality scores of all of the extracted windows to obtain the quality score of the corrupted video frame comprises:
and based on the weight corresponding to the position of the extracted window, carrying out weighted average on the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
In some embodiments, averaging the quality scores of all of the extracted windows to obtain the quality score of the corrupted video frame comprises:
and arithmetically averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
In some embodiments, determining the quality score of the marred video based on the quality score of each marred video frame comprises:
and averaging the quality scores of all damaged video frames to obtain the quality score of the damaged video.
In a second aspect, an embodiment of the present invention further provides a video quality evaluation apparatus based on a window, including:
an acquisition unit configured to acquire a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
and the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and the corresponding reference video frame and the window extraction network and the window quality network which are trained in advance.
A second determining unit for determining a quality score of the marred video based on the quality score of each marred video frame.
In the embodiment of the invention, the concerned window is more accurately extracted by predicting the head movement position of a person, so that the accuracy of the window extraction network calculation result is improved; in addition, the significance prediction task is used for assisting quality evaluation, and the accuracy of video quality evaluation is improved again.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a window-based video quality assessment according to an embodiment of the present invention;
FIG. 2 is a block diagram of a window-based video quality assessment framework according to an embodiment of the present invention;
FIG. 3 is a VP-net network architecture provided by embodiments of the present invention;
FIG. 4 is a VQ-net network architecture provided by embodiments of the present invention;
FIG. 5 is a VP-net network architecture provided by embodiments of the present invention;
fig. 6 is a VQ-net network structure according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. The specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention, are within the scope of the invention.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
As shown in fig. 1, the method disclosed in this embodiment may include the following steps 101 to 103:
101: a corrupted video and a reference video are acquired.
102: a quality score is determined for each corrupted video frame based on each corrupted video frame and the corresponding reference video frame.
103: based on the quality score of each corrupted video frame, a quality score of the corrupted video is determined.
In this embodiment of the present invention, in step 101, the damaged video includes a plurality of damaged video frames, and the reference video includes a plurality of reference video frames, and according to the damaged video frames and the reference video frames, a temporal variation between the damaged video frames and an nth damaged video frame before the damaged video frames and a spatial variation between the damaged video frames and corresponding reference video frames are calculated.
The time domain variation is calculated by the following methods: the difference can be made directly for the two frame images: ft-Ft-ΔtOr Ft-Δt-Ft;
The spatial variation is calculated by the following methods: the difference can be made directly for the two frame images:or
In the embodiment of the present invention, step 102 is mainly divided into two stages: window extraction and quality evaluation. In a window extraction stage, a window which is more concerned is extracted by predicting the head movement position of a person watching the panoramic video; in the quality evaluation stage, the quality score of the extracted window picture is predicted.
It can be understood that the deep convolutional neural network is applied to window extraction and quality evaluation in the embodiment of the present invention, which are a view port protocol network (VP-net) and a view quality network (VQ-net), respectively.
F frame of damaged videotAnd the time domain variation obtained in step 101, inputting VP-net, and outputting a series of candidate window positions V ═ V1,…,vIAnd the corresponding weightsWherein I is the number of candidate windows and is determined by the VP-net structure.
Then, according to the obtained candidate window position and corresponding weight, extracting a plurality of windows from the candidate window position, and simultaneously outputting the weight of the extracted window. The method provides a method for Non Maximum Suppression (NMS) of window softening, which is used for extracting windows according to candidate window positions and corresponding weights. The specific implementation of the NMS may be, but is not limited to, the window softening NMS proposed by the present method.
And aligning the extracted window position to an original video frame, and obtaining the content of the window and the airspace variable quantity in the window through the inverse transformation of the Nomontniy projection (also called sundial projection, spherical center projection and point tangent projection). The method provides a calculation mode of the NOMONNI inverse projection transform, and the calculation mode of the NOMONNI inverse projection transform can be but is not limited to the calculation mode provided herein.
And inputting the image of the extracted window and the space domain variable quantity thereof into VQ-net, and outputting the saliency image and the quality score of the window.
There are two methods for the quality score of the marred video frame, including the following:
according to the weight corresponding to the position of the extracted window, the quality scores of all the extracted windows are weighted and averaged to obtain the quality score of the damaged video frame; in addition, the quality scores of all the extracted windows can be averaged to obtain the quality score of the damaged video frame.
The method comprises the following specific steps:
1. inputting a damaged video and a reference video, recording the total frame number of the video as T, and numbering the video frames as 1,2, … and T according to time sequence. For the method, the damaged video t frame F is input each timetT- Δ t frame Ft-ΔtReference video t frameInto the frame shown in fig. 1. Wherein, the delta T is a parameter, and the legal value range is [1, T-1 ]]Inner integer, the suggested value is 1. T has a value range of [ Delta T, T]An integer within.
2. Preprocessing from the tth frame F of the marred videotAnd t- Δ t frame Ft-ΔtAnd calculating to obtain time domain variation, and calculating to obtain space domain variation from the t-th frame of the damaged video and the t-th frame of the reference video.
2.1 the time domain variation is calculated, including but not limited to the following: can be directly corresponding to two frame imagesThe image is subjected to subtraction, i.e. Ft-Ft-ΔtOr Ft-Δt-Ft(ii) a The extraction method can also be used for extracting the dense optical flow between two frames, and the extraction method comprises but is not limited to Farneback algorithm, Horn-Schunck algorithm, FlowNet and FlowNet 2.0.
2.2 calculating the spatial variance, including but not limited to the following: the difference can be made directly for the two frame images, i.e.OrStructural similarity between two frames can also be calculated.
3. T frame F of damaged videotAnd inputting the time domain variable quantity obtained in the step 2 into a visual window extraction network (VP-net), and outputting a series of candidate window positions V ═ V ═ net1,…,vIAnd the corresponding weightsWherein I is the number of candidate windows and is determined by the VP-net structure.
4. And (2) Non Maximum Suppression (NMS), extracting k windows from the candidate window positions according to the step 3 candidate window positions and corresponding weights, and simultaneously outputting the weights of the extracted windows. The method provides a window softening NMS method, which is used for extracting a window according to a candidate window position and a corresponding weight. The specific implementation of the NMS may be, but is not limited to, the window softening NMS proposed by the present method. The specific steps of the window softening NMS method are as follows:
4.1 input: candidate window position V ═ V1,…,vIAnd the corresponding weights Obtained in step 3. Setting a great circle distance threshold dthExtractingWindow number threshold Kth. Wherein d isthHas a legal value range of (0, pi)]The suggested value is pi/24; kthHas a legal value range of [1, I]The suggested value is min {20, I }.
4.4 the function d (v ', v') is the great circle distance between the input window positions v 'and v'. For the remaining candidate window positions in the set V, find and VιBetween them is less than dthThe index of the candidate window position of (a), to form a set, I '← { iota' | d (v)ι′,vι)<dth,vι′∈V}.
The method provides a calculation mode of the great circle distance, and the calculation mode of the great circle distance can be, but is not limited to the calculation mode provided herein.
4.4.1 two sphere positions v ' are given (Φ ', θ '), v ″, θ ", where Φ ' and Φ" denote longitude and θ ' and θ "denote latitude, in rad.
4.4.2 great circle distance between two spherical surface positions
d(v′,v″)=arccos(sinθ′sinθ″+cosθ′cosθ″cos(φ′-φ″)).
4.9 update variable k ← k + 1.
4.11 output extracted Window set VpAnd its weight set Wp.
5. The windows are aligned. Aligning the window position extracted in the step 4 to an original video frame, and obtaining the content of the window and the airspace variation in the window through the inverse transformation of the Nomontoni projection (also called sundial projection, spherical center projection and point tangent projection).
The method provides a calculation mode of the NOMONNI inverse projection transform, and the calculation mode of the NOMONNI inverse projection transform can be but is not limited to the calculation mode provided herein.
5.1 input the position of the kth extracted WindowWhereinAndrespectively the longitude and latitude of the position of the extracted window, and the unit is rad; t frame F of damaged videotThe width is WFHeight of HF(ii) a The airspace variable quantity obtained in the step 2Resolution and FtThe same is true.
5.2 initializing Window image CkThe width is W and the height is H. The legal range of values for W and H is any positive integer. The proposed value is W540 and H600.
5.3 pairs of CkAt a pixel position (x, y), an intermediate coordinate (f) is calculatedx,fy) It has the same spatial dimensions as a unit sphere:
wherein a isWAnd aHThe angular range of the viewport, corresponding to W and H, is related to the viewport image resolution and the HMD physical size. The proposed value is a, corresponding to the proposed window resolution in 5.2 and the prevailing conditions of HMDs on the marketW=71π/180rad,aw=74π/180rad.
5.4 by (f)x,fy) The spherical position (phi) corresponding to the pixel position (x, y)x,y,θx,y) Can be obtained from the following formula, with units of rad:
5.5 the position of the sphere (phi)x,y,θx,y) Mapping to FtPixel coordinate (p) ofx,y,qx,y) The mapping relationship is related to the mapping format used by the input video.
The method provides a calculation method corresponding to Equal Rectangular Projection (ERP), and when the input video uses ERP, the calculation method may be, but is not limited to, the calculation method provided herein. Meanwhile, the input video may be, but is not limited to, ERP. The calculation method corresponding to other mapping modes can be obtained by the calculation method of other mapping modes.
If the input video uses ERP, then (p)x,y,qx,y) Can be obtained from the following formula, with units of rad:
5.6where ψ (-) is an interpolation function including, but not limited to, nearest neighbor interpolation, bilinear interpolation, spline curve interpolation.
5.7 repeat steps 5.3 to 5.6 for all coordinate positions (x, y), x ∈ [1, W ], y ∈ [1, H ] and are integers.
5.8 initializing window airspace variation imageIts resolution and CkThe same is true. C is to bekIs replaced byFtIs replaced byRepeating steps 5.3 to 5.7.
Repeating steps 5.1 to 5.8 for all K equal to 1, …, K, resulting in image C of all extracted windowskK is 1, …, K and its spatial variance image
6. For all K equal to 1, …, K, the image C of the window to be extracted is sequentially takenkAnd airspace variable quantity image thereofInputting a visual quality network (VQ-net), and outputting a saliency image of a k-th windowAnd mass fraction sk.
7. Averaging the mass fraction s of all extracted windowskK is 1, …, K, resulting in an input marred video frame FtMass fraction ofThe method provides two calculation methods for averaging the scores to obtain the mass scoreThe calculation method of (a) may be, but is not limited to, the calculation method given herein.
7.2 given the quality score s of all extracted windowskK is 1, …, and the weight of the extracted window May be a weighted average of the window quality scores:
the quality scores of all the marred video frames are then averaged to obtain the quality score of the marred video.
8. For all T ∈ [ Δ T, T [ ]]And is an integer, repeating the steps 2 to 7 to obtain each damaged video frame FtMass fraction of
Calculating the arithmetic mean of the quality scores obtained in the step 8 to obtain the quality score of the input damaged video
The embodiment discloses an apparatus, which may include the following units, which are specifically described as follows:
an acquisition unit configured to acquire a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and a corresponding reference video frame and a pre-trained window extraction network and a window quality network;
a second determining unit for determining a quality score of the marred video based on the quality score of each marred video frame.
Firstly, the method comprises the following steps: embodiments of the present invention provide for using VP-net to output a list of candidate window positions and corresponding weights, and two implementations of VP-net are provided herein. A VP-net is described in detail as follows:
1. VP-net may be a deep convolutional neural network as described below
1.1 network architecture
Network input as corrupted video frame FtAnd the time domain variation obtained in step 101 is output as a series of window position shiftsAnd their weightsThe network topology connection is shown in fig. 3. In the network architecture given belowDescription of the different components.
The method includes the steps of 1.1.1 resampling, giving a series of predefined spherical positions, obtaining corresponding pixel coordinates according to a mapping mode used by an input video, and obtaining corresponding pixel values after interpolation, wherein the used predefined spherical positions include but are not limited to SOFT, Clenshaw-Curtis and Gauss-L egenderre.
1.1.2 downsampling. For each tensor in fig. 3 that needs to be down-sampled, its down-sampled size is the same as the size of the tensor to be connected after it is down-sampled. The downsampling uses gradient-conducted interpolation, including but not limited to bi/tri-linear interpolation.
The 1.1.3SO (3) tensor is converted to the S2 tensor. Note TSO(3)Defining a tensor on SO (3) whose coordinates are represented by (α, gamma), and recording TS2The method provides two calculation ways of converting the SO (3) tensor into the S2 tensor, and the calculation way of converting the SO (3) tensor into the S2 tensor can be, but is not limited to, the calculation way provided herein.
The 1.1.3.1SO (3) tensor can be converted to the S2 tensor by averaging over the γ dimension:
the 1.1.3.2SO (3) tensor can be converted to the S2 tensor by maximizing in the γ dimension:
1.1.4 the flexible max function is computed over the width and height dimensions of the input tensor.
1.1.5 the specific configuration of each layer of the network containing learnable parameters is shown in Table 1.
Table 1 corresponds to the specific configuration of the layers of the network of fig. 3 containing learnable parameters
1.2 Objective function clause definition of the Objective function in VP-net training
1.2.1 spherical anchor definition. Note the bookThe feature vector at each pixel position of the transformed S2 tensor of the SO (3) tensor output for the SO3Conv10 corresponds to a specific coordinate position on the spherical surface, and the corresponding relationship is the same as ERP. According to the inverse transformation of the mapping transformation, the specific coordinate positions can be obtained by calculation, and the specific coordinate positions of the spherical surface are defined as spherical anchor points and are marked as va=(φa,θa) Wherein phiaAnd thetaaRespectively, longitude and latitude.
Note the bookIf the number of the middle pixel positions is I, the total number of the spherical anchor points is IAs described in step 3, VP-net outputs I candidate window positions, which correspond to I spherical anchors one-to-one, and which do not change with input changes, and which can be regarded as a constant attribute of the network.
1.2.2 window weight objective function. Given J experimenters viewing the input marred video, the head movement position truth value of the t-th frameAnchor point to spherical surfaceIts weight true value wiThe definition is as follows:
wherein,is composed ofAndas defined above. Sigma is a parameter, the legal value range is (0, ∞), and the suggested value is 18.33 pi/180;
the objective function of window weight is defined as window weight true valueAnd window weights for network predictionThe relative entropy between distributions, also known as Kullback-L eibler (K L) divergence, is defined as follows:
1.2.3 Window position offset objective function. Anchor point to spherical surfaceDefine the corresponding window position shift truth value asTrue value of head movement position to its nearest:
the window position shift objective function is defined as each predicted window position shift Δ νiAnd its true value ΔSmoothing betweenDistance, is recorded asThe objective function is defined as follows:
in summary, the objective function in training VP-net is defined as follows:
wherein λ iswAnd λvAs a parameter, the legal value range is positive, and the suggested value is λw=1,λv=5。
2. VP-net may be a deep convolutional neural network as described below
2.1 network architecture. Network input as corrupted video frame FtAnd the time domain variable quantity obtained in the step 2 is output as the probability that the network predicts that the input video frame contains the lens motionA series of window position offsetsAnd their weightsThe network topology connection is shown in fig. 5. A description of the different components in the network structure is given below.
2.1.1 resampling. As described above.
The 2.1.2SO (3) tensor is converted to the S2 tensor. As described above.
2.1.3 flexibility maximum function. As described above.
whereinIs tensor TsThe side length of (1) corresponds to the above suggested value of 1.1.1, then there is Is a parameter, the legal value range is (0, infinity), the suggested value is
2.1.5 the specific configuration of each layer in the network containing learnable parameters is shown in table 2.
Table 2 corresponds to the specific configuration of the network of fig. 5 with layers having learnable parameters
2.2 objective function the terms define the objective function when training VP-net as follows.
2.2.1 spherical anchor definition. As described above.
2.2.2 window weight objective function. As described above.
2.2.3 Window position offset objective function. As described above.
2.2.4 lens motion detection objective function. The real label of whether the input video frame contains the shot motion is l (when the input video frame contains the shot motion, l is 1, otherwise, l is 0), and the shot motion detection objective function is defined as a binary cross entropy objective function as follows:
2.2.5 Overall, the objective function when training VP-net is defined as follows:
whereinλwAnd λvAs a parameter, the legal value range is a positive number, and the suggested value isλw=1,λv=5.
II, secondly: embodiments of the present invention, which mention the use of VQ-net to output saliency maps and quality scores for multiple views, propose two implementations of VQ-net. A VQ-net is described in detail as follows:
1. VQ-net may be a deep convolutional neural network as described below
1.1 network architecture
Network input is a Windows image CkAnd airspace variable quantity image thereofSaliency map with output as input windowAnd mass fraction skThe network topology connection is as shown in fig. 6. A description of the different components in the network structure is given below.
1.1.1 the flexible max function is computed over the width and height dimensions of the input tensor.
1.1.2 upsampling. For the tensor requiring up-sampling in fig. 6, the size after up-sampling is the same as the tensor multiplied after up-sampling. Upsampling uses gradient-guided interpolation, including but not limited to bilinear interpolation.
1.1.3 the specific configuration of the convolutional and pooling layers in the network is shown in Table 3.
Table 3 corresponds to the specific configuration of the network convolution layer and the pooling layer of FIG. 6
1.1.4 dense connection block DenseBlock the configuration is as defined in DenseNet, as shown in Table 4.
Table 4 corresponds to the specific configuration of the tight connection blocks of fig. 6
1.2 the objective function is as follows, the itemization defines the objective function when training VQ-net
1.2.1, significance prediction objective function. Given that experimenter watches the input impaired video, in window CkEye movement saliency map truth value M within rangekThe significance map is regarded as a probability distribution, and the significance prediction objective function is defined as the significance map of the network prediction windowAnd its true value MkRelative entropy between, defined as:
1.2.2 quality fraction objective function. Given the subjective quality score s of the video corresponding to the input window, the quality score objective function is defined as the quality score s of the network prediction windowkThe squared error from s is defined as follows:
in summary, the objective function when training VQ-net is defined as follows:
wherein λ isMAnd λsAs a parameter, the legal value range is positive, and the suggested value is λM=10,λs=1×103。
2. VQ-net may be a deep convolutional neural network as described below.
2.1 network architecture network input as Windows image CkAnd airspace variable quantity image thereofSaliency map with output as input windowAnd mass fraction skThe network topology connection is as shown in fig. 4. A description of the different components in the network structure is given below.
2.1.1 flexibility maximum function as described in section 2.
2.1.2 downsampling. For the tensor requiring downsampling in fig. 4, the downsampled size is the same as the size of the tensor to be connected after downsampling. The downsampling uses gradient-guided interpolation, including but not limited to bilinear interpolation.
2.1.3 closely connected blocks are as described in section 2.
2.1.4 the specific configuration of the convolutional and pooling layers in the network is shown in Table 5.
Table 5 corresponds to the specific configuration of the network convolution layer and the pooling layer of FIG. 4
2.2 objective function the terms define the objective function when training VP-net as follows.
2.2.1 significance prediction objective function. As described in section 2.
2.2.2 quality fraction objective function. As described in section 2.
In summary, the objective function when training VQ-net is defined as follows:
wherein λ isMAnd λsAs a parameter, the legal value range is positive, and the suggested value is λM=10,λs=1×104。
The apparatus disclosed in the above embodiments can implement the processes of the methods disclosed in the above method embodiments, and in order to avoid repetition, the details are not described here again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.
Claims (9)
1. A method for evaluating video quality based on windows is characterized by comprising the following steps:
acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network;
determining a quality score for the marred video based on the quality score for each marred video frame;
wherein determining the quality score of any corrupted video frame comprises:
based on the damaged video frame, obtaining a plurality of candidate window positions and the corresponding weight of each candidate window position through the window extraction network;
obtaining a saliency image and a quality score of the extracted window through a window quality network based on the candidate window positions, the damaged video frame and the corresponding reference video frame;
and averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
2. The method of claim 1, wherein obtaining a plurality of candidate window positions and a weight corresponding to each of the candidate window positions via the window extraction network based on the corrupted video frame comprises:
determining a temporal variation based on the corrupted video frame;
and obtaining a plurality of candidate window positions and the weight corresponding to each candidate window position through the window extraction network based on the damaged video frame and the time domain variation.
3. The method of claim 2, wherein determining the temporal variation based on the corrupted video frame comprises:
a temporal change between the corrupted video frame and a previous nth corrupted video frame is calculated.
4. The method of claim 1, wherein the impaired view is based on the candidate window positions
The frequency frame and the corresponding reference video frame obtain the significant image and the quality score of the extracted window through a window quality network, and the method comprises the following steps:
determining a spatial variation between the damaged video frame and the corresponding reference video frame based on the damaged video frame and the corresponding reference video frame;
obtaining an image of the extracted window and a corresponding spatial variation based on the plurality of candidate window positions, the spatial variation, the damaged video frame and the corresponding reference video frame;
and obtaining the significant image and the quality score of the extracted window through the window quality network based on the image of the extracted window and the corresponding airspace variable quantity.
5. The method of claim 4, wherein obtaining the image of the extracted window and the corresponding spatial variance based on the candidate window positions, the spatial variance, the corrupted video frame and the corresponding reference video frame comprises:
extracting at least one window position from the candidate window positions based on the corresponding weight of each candidate window position, and outputting the extracted window position and the corresponding weight;
and aligning the extracted window position, the damaged video frame, the airspace variation and the reference video frame to obtain an image of the extracted window and a corresponding airspace variation.
6. The method of claim 1, wherein averaging the quality scores of all extracted windows to obtain the quality score of the corrupted video frame comprises:
and based on the weight corresponding to the position of the extracted window, carrying out weighted average on the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
7. The method of claim 1, wherein averaging the quality scores of all extracted windows to obtain the quality score of the corrupted video frame comprises:
and arithmetically averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
8. The method of claim 1, wherein determining the quality score of the marred video based on the quality score of each marred video frame comprises:
and averaging the quality scores of all damaged video frames to obtain the quality score of the damaged video.
9. A window-based video quality evaluation apparatus, comprising:
an acquisition unit configured to acquire a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and a corresponding reference video frame and a pre-trained window extraction network and a window quality network;
a second determining unit for determining a quality score of the marred video based on a quality score of each marred video frame;
wherein determining the quality score of any corrupted video frame comprises:
based on the damaged video frame, obtaining a plurality of candidate window positions and the corresponding weight of each candidate window position through the window extraction network;
obtaining a saliency image and a quality score of the extracted window through a window quality network based on the candidate window positions, the damaged video frame and the corresponding reference video frame;
and averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910500485.3A CN110365966B (en) | 2019-06-11 | 2019-06-11 | Video quality evaluation method and device based on window |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910500485.3A CN110365966B (en) | 2019-06-11 | 2019-06-11 | Video quality evaluation method and device based on window |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110365966A CN110365966A (en) | 2019-10-22 |
CN110365966B true CN110365966B (en) | 2020-07-28 |
Family
ID=68216886
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910500485.3A Active CN110365966B (en) | 2019-06-11 | 2019-06-11 | Video quality evaluation method and device based on window |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110365966B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112163990B (en) | 2020-09-08 | 2022-10-25 | 上海交通大学 | Significance prediction method and system for 360-degree image |
US20220415037A1 (en) * | 2021-06-24 | 2022-12-29 | Meta Platforms, Inc. | Video corruption detection |
CN115953727B (en) * | 2023-03-15 | 2023-06-09 | 浙江天行健水务有限公司 | Method, system, electronic equipment and medium for detecting floc sedimentation rate |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103338379B (en) * | 2013-06-05 | 2015-04-29 | 宁波大学 | Stereoscopic video objective quality evaluation method based on machine learning |
CN104506852B (en) * | 2014-12-25 | 2016-08-24 | 北京航空航天大学 | A kind of objective quality assessment method towards video conference coding |
CN106412571B (en) * | 2016-10-12 | 2018-06-19 | 天津大学 | A kind of method for evaluating video quality based on gradient similarity standard difference |
CN108337504A (en) * | 2018-01-30 | 2018-07-27 | 中国科学技术大学 | A kind of method and device of evaluation video quality |
CN108449595A (en) * | 2018-03-22 | 2018-08-24 | 天津大学 | Virtual reality method for evaluating video quality is referred to entirely based on convolutional neural networks |
CN108900864B (en) * | 2018-07-23 | 2019-12-10 | 西安电子科技大学 | full-reference video quality evaluation method based on motion trail |
-
2019
- 2019-06-11 CN CN201910500485.3A patent/CN110365966B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110365966A (en) | 2019-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11468697B2 (en) | Pedestrian re-identification method based on spatio-temporal joint model of residual attention mechanism and device thereof | |
CN113822969B (en) | Training neural radiation field model, face generation method, device and server | |
CN110910447B (en) | Visual odometer method based on dynamic and static scene separation | |
CN110365966B (en) | Video quality evaluation method and device based on window | |
CN107622257A (en) | A kind of neural network training method and three-dimension gesture Attitude estimation method | |
EP3598387B1 (en) | Learning method and program | |
CN114565655B (en) | Depth estimation method and device based on pyramid segmentation attention | |
CN113393522A (en) | 6D pose estimation method based on monocular RGB camera regression depth information | |
CN115035171B (en) | Self-supervision monocular depth estimation method based on self-attention guide feature fusion | |
CN111626308B (en) | Real-time optical flow estimation method based on lightweight convolutional neural network | |
CN110942484B (en) | Camera self-motion estimation method based on occlusion perception and feature pyramid matching | |
CN111325784A (en) | Unsupervised pose and depth calculation method and system | |
CN110992414B (en) | Indoor monocular scene depth estimation method based on convolutional neural network | |
CN105488759B (en) | A kind of image super-resolution rebuilding method based on local regression model | |
CN108462868A (en) | The prediction technique of user's fixation point in 360 degree of panorama VR videos | |
CN114429555A (en) | Image density matching method, system, equipment and storage medium from coarse to fine | |
CN102231844A (en) | Video image fusion performance evaluation method based on structure similarity and human vision | |
CN112907557A (en) | Road detection method, road detection device, computing equipment and storage medium | |
CN115496663A (en) | Video super-resolution reconstruction method based on D3D convolution intra-group fusion network | |
CN112819697A (en) | Remote sensing image space-time fusion method and system | |
CN116934592A (en) | Image stitching method, system, equipment and medium based on deep learning | |
CN111861949A (en) | Multi-exposure image fusion method and system based on generation countermeasure network | |
CN114821434A (en) | Space-time enhanced video anomaly detection method based on optical flow constraint | |
CN116188550A (en) | Self-supervision depth vision odometer based on geometric constraint | |
CN113034398A (en) | Method and system for eliminating jelly effect in urban surveying and mapping based on artificial intelligence |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |