CN110365966B - Video quality evaluation method and device based on window - Google Patents

Video quality evaluation method and device based on window Download PDF

Info

Publication number
CN110365966B
CN110365966B CN201910500485.3A CN201910500485A CN110365966B CN 110365966 B CN110365966 B CN 110365966B CN 201910500485 A CN201910500485 A CN 201910500485A CN 110365966 B CN110365966 B CN 110365966B
Authority
CN
China
Prior art keywords
window
video frame
video
quality
damaged
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910500485.3A
Other languages
Chinese (zh)
Other versions
CN110365966A (en
Inventor
李辰
徐迈
蒋铼
张善翌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beihang University filed Critical Beihang University
Priority to CN201910500485.3A priority Critical patent/CN110365966B/en
Publication of CN110365966A publication Critical patent/CN110365966A/en
Application granted granted Critical
Publication of CN110365966B publication Critical patent/CN110365966B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N17/00Diagnosis, testing or measuring for television systems or their details
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention relates to a video quality evaluation method and a video quality evaluation device of a window, wherein the video quality evaluation method based on the window comprises the following steps: acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames; based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network; determining a quality score for the marred video based on the quality score for each marred video frame. The concerned window can be extracted more accurately, and in the embodiment of the invention, the quality score of the window picture can be more integrated, namely the quality score of the whole video can be used.

Description

Video quality evaluation method and device based on window
Technical Field
The embodiment of the invention relates to the technical field of video quality evaluation, in particular to a video quality evaluation method and device based on a window.
Background
With the rapid development of virtual reality technology, panoramic video has entered people's daily life as a new multimedia form. Typically, a viewer views panoramic video by wearing a Head Mounted Display (HMD), and thus only video frames within a window of the HMD are visible. However, spherical video content requires extremely high resolution to ensure clear presentation of the video content, and in order to transmit high-resolution panoramic video in a channel with limited bandwidth, a video compression technique is required to save the code rate after encoding, but this also results in reduction of visual quality. Therefore, research on panoramic video quality evaluation is urgently needed to guide the panoramic video encoding process.
At present, the quality evaluation method of the plane video based on the deep learning is based on the clipping image block to carry out the quality evaluation. Inspired by the methods, a panoramic video quality evaluation method estimates the quality fraction and weight of each image block by using a convolutional neural network, and then obtains the overall quality fraction of the panoramic video by calculating weighted average of the quality fractions of all the image blocks. However, when a viewer views the panoramic video, the viewer sees the content of the window rather than the image blocks. Therefore, the panoramic video quality evaluation based on the window is more reasonable and can reflect the visual quality sensed by people. However, no window-based video quality evaluation method is proposed at present.
Disclosure of Invention
In order to solve at least one problem in the prior art, at least one embodiment of the present invention provides a method and an apparatus for evaluating video quality based on a window.
In a first aspect, an embodiment of the present invention provides a method for evaluating video quality based on a window, where the method includes:
acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network;
determining a quality score for the marred video based on the quality score for each marred video frame.
In some embodiments, determining the quality score for any corrupted video frame comprises:
based on the damaged video frame, obtaining a plurality of candidate window positions and the corresponding weight of each candidate window position through the window extraction network;
obtaining a saliency image and a quality score of the extracted window through a window quality network based on the candidate window positions, the damaged video frame and the corresponding reference video frame;
and averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
In some embodiments, obtaining, by the window extraction network, a plurality of candidate window positions and a weight corresponding to each of the candidate window positions based on the corrupted video frame comprises:
determining a temporal variation based on the corrupted video frame;
and obtaining a plurality of candidate window positions and the weight corresponding to each candidate window position through the window extraction network based on the damaged video frame and the time domain variation.
In some embodiments, determining a temporal variation based on the marred video frame comprises:
a temporal change between the corrupted video frame and a previous nth corrupted video frame is calculated.
In some embodiments, obtaining the saliency image and the quality score of the extracted window by a window quality network based on the plurality of candidate window positions, the corrupted video frame, and the corresponding reference video frame comprises:
determining a spatial variation between the damaged video frame and the corresponding reference video frame based on the damaged video frame and the corresponding reference video frame;
obtaining an image of the extracted window and a corresponding spatial variation based on the plurality of candidate window positions, the spatial variation, the damaged video frame and the corresponding reference video frame;
and obtaining the significant image and the quality score of the extracted window through the window quality network based on the image of the extracted window and the corresponding airspace variable quantity.
In some embodiments, obtaining the image of the extracted window and the corresponding spatial variance based on the candidate window positions, the spatial variance, the damaged video frame and the corresponding reference video frame comprises:
extracting at least one window position from the candidate window positions based on the corresponding weight of each candidate window position, and outputting the extracted window position and the corresponding weight;
and aligning the extracted window position, the damaged video frame, the airspace variation and the reference video frame to obtain an image of the extracted window and a corresponding airspace variation.
In some embodiments, averaging the quality scores of all of the extracted windows to obtain the quality score of the corrupted video frame comprises:
and based on the weight corresponding to the position of the extracted window, carrying out weighted average on the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
In some embodiments, averaging the quality scores of all of the extracted windows to obtain the quality score of the corrupted video frame comprises:
and arithmetically averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
In some embodiments, determining the quality score of the marred video based on the quality score of each marred video frame comprises:
and averaging the quality scores of all damaged video frames to obtain the quality score of the damaged video.
In a second aspect, an embodiment of the present invention further provides a video quality evaluation apparatus based on a window, including:
an acquisition unit configured to acquire a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
and the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and the corresponding reference video frame and the window extraction network and the window quality network which are trained in advance.
A second determining unit for determining a quality score of the marred video based on the quality score of each marred video frame.
In the embodiment of the invention, the concerned window is more accurately extracted by predicting the head movement position of a person, so that the accuracy of the window extraction network calculation result is improved; in addition, the significance prediction task is used for assisting quality evaluation, and the accuracy of video quality evaluation is improved again.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a window-based video quality assessment according to an embodiment of the present invention;
FIG. 2 is a block diagram of a window-based video quality assessment framework according to an embodiment of the present invention;
FIG. 3 is a VP-net network architecture provided by embodiments of the present invention;
FIG. 4 is a VQ-net network architecture provided by embodiments of the present invention;
FIG. 5 is a VP-net network architecture provided by embodiments of the present invention;
fig. 6 is a VQ-net network structure according to an embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, the present invention will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. The specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention, are within the scope of the invention.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
As shown in fig. 1, the method disclosed in this embodiment may include the following steps 101 to 103:
101: a corrupted video and a reference video are acquired.
102: a quality score is determined for each corrupted video frame based on each corrupted video frame and the corresponding reference video frame.
103: based on the quality score of each corrupted video frame, a quality score of the corrupted video is determined.
In this embodiment of the present invention, in step 101, the damaged video includes a plurality of damaged video frames, and the reference video includes a plurality of reference video frames, and according to the damaged video frames and the reference video frames, a temporal variation between the damaged video frames and an nth damaged video frame before the damaged video frames and a spatial variation between the damaged video frames and corresponding reference video frames are calculated.
The time domain variation is calculated by the following methods: the difference can be made directly for the two frame images: ft-Ft-ΔtOr Ft-Δt-Ft
The spatial variation is calculated by the following methods: the difference can be made directly for the two frame images:
Figure BDA0002090052340000051
or
Figure BDA0002090052340000052
In the embodiment of the present invention, step 102 is mainly divided into two stages: window extraction and quality evaluation. In a window extraction stage, a window which is more concerned is extracted by predicting the head movement position of a person watching the panoramic video; in the quality evaluation stage, the quality score of the extracted window picture is predicted.
It can be understood that the deep convolutional neural network is applied to window extraction and quality evaluation in the embodiment of the present invention, which are a view port protocol network (VP-net) and a view quality network (VQ-net), respectively.
F frame of damaged videotAnd the time domain variation obtained in step 101, inputting VP-net, and outputting a series of candidate window positions V ═ V1,…,vIAnd the corresponding weights
Figure BDA0002090052340000061
Wherein I is the number of candidate windows and is determined by the VP-net structure.
Then, according to the obtained candidate window position and corresponding weight, extracting a plurality of windows from the candidate window position, and simultaneously outputting the weight of the extracted window. The method provides a method for Non Maximum Suppression (NMS) of window softening, which is used for extracting windows according to candidate window positions and corresponding weights. The specific implementation of the NMS may be, but is not limited to, the window softening NMS proposed by the present method.
And aligning the extracted window position to an original video frame, and obtaining the content of the window and the airspace variable quantity in the window through the inverse transformation of the Nomontniy projection (also called sundial projection, spherical center projection and point tangent projection). The method provides a calculation mode of the NOMONNI inverse projection transform, and the calculation mode of the NOMONNI inverse projection transform can be but is not limited to the calculation mode provided herein.
And inputting the image of the extracted window and the space domain variable quantity thereof into VQ-net, and outputting the saliency image and the quality score of the window.
There are two methods for the quality score of the marred video frame, including the following:
according to the weight corresponding to the position of the extracted window, the quality scores of all the extracted windows are weighted and averaged to obtain the quality score of the damaged video frame; in addition, the quality scores of all the extracted windows can be averaged to obtain the quality score of the damaged video frame.
The method comprises the following specific steps:
1. inputting a damaged video and a reference video, recording the total frame number of the video as T, and numbering the video frames as 1,2, … and T according to time sequence. For the method, the damaged video t frame F is input each timetT- Δ t frame Ft-ΔtReference video t frame
Figure BDA0002090052340000062
Into the frame shown in fig. 1. Wherein, the delta T is a parameter, and the legal value range is [1, T-1 ]]Inner integer, the suggested value is 1. T has a value range of [ Delta T, T]An integer within.
2. Preprocessing from the tth frame F of the marred videotAnd t- Δ t frame Ft-ΔtAnd calculating to obtain time domain variation, and calculating to obtain space domain variation from the t-th frame of the damaged video and the t-th frame of the reference video.
2.1 the time domain variation is calculated, including but not limited to the following: can be directly corresponding to two frame imagesThe image is subjected to subtraction, i.e. Ft-Ft-ΔtOr Ft-Δt-Ft(ii) a The extraction method can also be used for extracting the dense optical flow between two frames, and the extraction method comprises but is not limited to Farneback algorithm, Horn-Schunck algorithm, FlowNet and FlowNet 2.0.
2.2 calculating the spatial variance, including but not limited to the following: the difference can be made directly for the two frame images, i.e.
Figure BDA0002090052340000071
Or
Figure BDA0002090052340000072
Structural similarity between two frames can also be calculated.
3. T frame F of damaged videotAnd inputting the time domain variable quantity obtained in the step 2 into a visual window extraction network (VP-net), and outputting a series of candidate window positions V ═ V ═ net1,…,vIAnd the corresponding weights
Figure BDA0002090052340000073
Wherein I is the number of candidate windows and is determined by the VP-net structure.
4. And (2) Non Maximum Suppression (NMS), extracting k windows from the candidate window positions according to the step 3 candidate window positions and corresponding weights, and simultaneously outputting the weights of the extracted windows. The method provides a window softening NMS method, which is used for extracting a window according to a candidate window position and a corresponding weight. The specific implementation of the NMS may be, but is not limited to, the window softening NMS proposed by the present method. The specific steps of the window softening NMS method are as follows:
4.1 input: candidate window position V ═ V1,…,vIAnd the corresponding weights
Figure BDA0002090052340000074
Figure BDA0002090052340000075
Obtained in step 3. Setting a great circle distance threshold dthExtractingWindow number threshold Kth. Wherein d isthHas a legal value range of (0, pi)]The suggested value is pi/24; kthHas a legal value range of [1, I]The suggested value is min {20, I }.
4.2 initialize variable k ← 1, set
Figure BDA0002090052340000076
4.3 finding the subscript of the maximum weight
Figure BDA0002090052340000077
4.4 the function d (v ', v') is the great circle distance between the input window positions v 'and v'. For the remaining candidate window positions in the set V, find and VιBetween them is less than dthThe index of the candidate window position of (a), to form a set, I '← { iota' | d (v)ι′,vι)<dth,vι′∈V}.
The method provides a calculation mode of the great circle distance, and the calculation mode of the great circle distance can be, but is not limited to the calculation mode provided herein.
4.4.1 two sphere positions v ' are given (Φ ', θ '), v ″, θ ", where Φ ' and Φ" denote longitude and θ ' and θ "denote latitude, in rad.
4.4.2 great circle distance between two spherical surface positions
d(v′,v″)=arccos(sinθ′sinθ″+cosθ′cosθ″cos(φ′-φ″)).
4.5 computing the weight of the kth extracted Window
Figure BDA0002090052340000081
4.6 computing the position of the kth extracted Window
Figure BDA0002090052340000082
4.7 add the position and weight of the extracted window to the output set:
Figure BDA0002090052340000083
Figure BDA0002090052340000084
4.8 exclude candidate windows and weights indexed in I' from the input set: v ← V \ Vι′|ι′∈I′,
Figure BDA0002090052340000085
4.9 update variable k ← k +1.
4.10 repeat steps 4.3 through 4.9 until
Figure BDA0002090052340000086
Or k>Kth.
4.11 output extracted Window set VpAnd its weight set Wp.
5. The windows are aligned. Aligning the window position extracted in the step 4 to an original video frame, and obtaining the content of the window and the airspace variation in the window through the inverse transformation of the Nomontoni projection (also called sundial projection, spherical center projection and point tangent projection).
The method provides a calculation mode of the NOMONNI inverse projection transform, and the calculation mode of the NOMONNI inverse projection transform can be but is not limited to the calculation mode provided herein.
5.1 input the position of the kth extracted Window
Figure BDA0002090052340000087
Wherein
Figure BDA0002090052340000088
And
Figure BDA0002090052340000089
respectively the longitude and latitude of the position of the extracted window, and the unit is rad; t frame F of damaged videotThe width is WFHeight of HF(ii) a The airspace variable quantity obtained in the step 2
Figure BDA00020900523400000810
Resolution and FtThe same is true.
5.2 initializing Window image CkThe width is W and the height is H. The legal range of values for W and H is any positive integer. The proposed value is W540 and H600.
5.3 pairs of CkAt a pixel position (x, y), an intermediate coordinate (f) is calculatedx,fy) It has the same spatial dimensions as a unit sphere:
Figure BDA0002090052340000091
wherein a isWAnd aHThe angular range of the viewport, corresponding to W and H, is related to the viewport image resolution and the HMD physical size. The proposed value is a, corresponding to the proposed window resolution in 5.2 and the prevailing conditions of HMDs on the marketW=71π/180rad,aw=74π/180rad.
5.4 by (f)x,fy) The spherical position (phi) corresponding to the pixel position (x, y)x,yx,y) Can be obtained from the following formula, with units of rad:
Figure BDA0002090052340000092
Figure BDA0002090052340000093
wherein
Figure BDA0002090052340000094
c=arctanρ.
5.5 the position of the sphere (phi)x,yx,y) Mapping to FtPixel coordinate (p) ofx,y,qx,y) The mapping relationship is related to the mapping format used by the input video.
The method provides a calculation method corresponding to Equal Rectangular Projection (ERP), and when the input video uses ERP, the calculation method may be, but is not limited to, the calculation method provided herein. Meanwhile, the input video may be, but is not limited to, ERP. The calculation method corresponding to other mapping modes can be obtained by the calculation method of other mapping modes.
If the input video uses ERP, then (p)x,y,qx,y) Can be obtained from the following formula, with units of rad:
Figure BDA0002090052340000095
5.6
Figure BDA0002090052340000096
where ψ (-) is an interpolation function including, but not limited to, nearest neighbor interpolation, bilinear interpolation, spline curve interpolation.
5.7 repeat steps 5.3 to 5.6 for all coordinate positions (x, y), x ∈ [1, W ], y ∈ [1, H ] and are integers.
5.8 initializing window airspace variation image
Figure BDA0002090052340000101
Its resolution and CkThe same is true. C is to bekIs replaced by
Figure BDA0002090052340000102
FtIs replaced by
Figure BDA0002090052340000103
Repeating steps 5.3 to 5.7.
Repeating steps 5.1 to 5.8 for all K equal to 1, …, K, resulting in image C of all extracted windowskK is 1, …, K and its spatial variance image
Figure BDA0002090052340000104
6. For all K equal to 1, …, K, the image C of the window to be extracted is sequentially takenkAnd airspace variable quantity image thereof
Figure BDA0002090052340000105
Inputting a visual quality network (VQ-net), and outputting a saliency image of a k-th window
Figure BDA0002090052340000106
And mass fraction sk.
7. Averaging the mass fraction s of all extracted windowskK is 1, …, K, resulting in an input marred video frame FtMass fraction of
Figure BDA0002090052340000107
The method provides two calculation methods for averaging the scores to obtain the mass score
Figure BDA0002090052340000108
The calculation method of (a) may be, but is not limited to, the calculation method given herein.
7.1 given the quality score s of all extracted windowsk,k=1,…,K,
Figure BDA0002090052340000109
May be their arithmetic mean:
Figure BDA00020900523400001010
7.2 given the quality score s of all extracted windowskK is 1, …, and the weight of the extracted window
Figure BDA00020900523400001011
Figure BDA00020900523400001012
May be a weighted average of the window quality scores:
Figure BDA00020900523400001013
the quality scores of all the marred video frames are then averaged to obtain the quality score of the marred video.
8. For all T ∈ [ Δ T, T [ ]]And is an integer, repeating the steps 2 to 7 to obtain each damaged video frame FtMass fraction of
Figure BDA00020900523400001014
Calculating the arithmetic mean of the quality scores obtained in the step 8 to obtain the quality score of the input damaged video
Figure BDA00020900523400001015
The embodiment discloses an apparatus, which may include the following units, which are specifically described as follows:
an acquisition unit configured to acquire a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and a corresponding reference video frame and a pre-trained window extraction network and a window quality network;
a second determining unit for determining a quality score of the marred video based on the quality score of each marred video frame.
Firstly, the method comprises the following steps: embodiments of the present invention provide for using VP-net to output a list of candidate window positions and corresponding weights, and two implementations of VP-net are provided herein. A VP-net is described in detail as follows:
1. VP-net may be a deep convolutional neural network as described below
1.1 network architecture
Network input as corrupted video frame FtAnd the time domain variation obtained in step 101 is output as a series of window position shifts
Figure BDA0002090052340000111
And their weights
Figure BDA0002090052340000112
The network topology connection is shown in fig. 3. In the network architecture given belowDescription of the different components.
The method includes the steps of 1.1.1 resampling, giving a series of predefined spherical positions, obtaining corresponding pixel coordinates according to a mapping mode used by an input video, and obtaining corresponding pixel values after interpolation, wherein the used predefined spherical positions include but are not limited to SOFT, Clenshaw-Curtis and Gauss-L egenderre.
1.1.2 downsampling. For each tensor in fig. 3 that needs to be down-sampled, its down-sampled size is the same as the size of the tensor to be connected after it is down-sampled. The downsampling uses gradient-conducted interpolation, including but not limited to bi/tri-linear interpolation.
The 1.1.3SO (3) tensor is converted to the S2 tensor. Note TSO(3)Defining a tensor on SO (3) whose coordinates are represented by (α, gamma), and recording TS2The method provides two calculation ways of converting the SO (3) tensor into the S2 tensor, and the calculation way of converting the SO (3) tensor into the S2 tensor can be, but is not limited to, the calculation way provided herein.
The 1.1.3.1SO (3) tensor can be converted to the S2 tensor by averaging over the γ dimension:
Figure BDA0002090052340000121
the 1.1.3.2SO (3) tensor can be converted to the S2 tensor by maximizing in the γ dimension:
Figure BDA0002090052340000122
1.1.4 the flexible max function is computed over the width and height dimensions of the input tensor.
1.1.5 the specific configuration of each layer of the network containing learnable parameters is shown in Table 1.
Figure BDA0002090052340000123
Figure BDA0002090052340000131
Table 1 corresponds to the specific configuration of the layers of the network of fig. 3 containing learnable parameters
1.2 Objective function clause definition of the Objective function in VP-net training
1.2.1 spherical anchor definition. Note the book
Figure BDA0002090052340000132
The feature vector at each pixel position of the transformed S2 tensor of the SO (3) tensor output for the SO3Conv10 corresponds to a specific coordinate position on the spherical surface, and the corresponding relationship is the same as ERP. According to the inverse transformation of the mapping transformation, the specific coordinate positions can be obtained by calculation, and the specific coordinate positions of the spherical surface are defined as spherical anchor points and are marked as va=(φaa) Wherein phiaAnd thetaaRespectively, longitude and latitude.
Note the book
Figure BDA0002090052340000133
If the number of the middle pixel positions is I, the total number of the spherical anchor points is I
Figure BDA0002090052340000134
As described in step 3, VP-net outputs I candidate window positions, which correspond to I spherical anchors one-to-one, and which do not change with input changes, and which can be regarded as a constant attribute of the network.
1.2.2 window weight objective function. Given J experimenters viewing the input marred video, the head movement position truth value of the t-th frame
Figure BDA0002090052340000135
Anchor point to spherical surface
Figure BDA0002090052340000136
Its weight true value wiThe definition is as follows:
Figure BDA0002090052340000137
wherein,
Figure BDA0002090052340000141
is composed of
Figure BDA0002090052340000142
And
Figure BDA0002090052340000143
as defined above. Sigma is a parameter, the legal value range is (0, ∞), and the suggested value is 18.33 pi/180;
the objective function of window weight is defined as window weight true value
Figure BDA0002090052340000144
And window weights for network prediction
Figure BDA0002090052340000145
The relative entropy between distributions, also known as Kullback-L eibler (K L) divergence, is defined as follows:
Figure BDA0002090052340000146
1.2.3 Window position offset objective function. Anchor point to spherical surface
Figure BDA0002090052340000147
Define the corresponding window position shift truth value as
Figure BDA0002090052340000148
True value of head movement position to its nearest:
Figure BDA0002090052340000149
the window position shift objective function is defined as each predicted window position shift Δ νiAnd its true value Δ
Figure BDA00020900523400001410
Smoothing between
Figure BDA00020900523400001419
Distance, is recorded as
Figure BDA00020900523400001411
The objective function is defined as follows:
Figure BDA00020900523400001412
in summary, the objective function in training VP-net is defined as follows:
Figure BDA00020900523400001413
wherein λ iswAnd λvAs a parameter, the legal value range is positive, and the suggested value is λw=1,λv=5。
2. VP-net may be a deep convolutional neural network as described below
2.1 network architecture. Network input as corrupted video frame FtAnd the time domain variable quantity obtained in the step 2 is output as the probability that the network predicts that the input video frame contains the lens motion
Figure BDA00020900523400001414
A series of window position offsets
Figure BDA00020900523400001415
And their weights
Figure BDA00020900523400001416
The network topology connection is shown in fig. 5. A description of the different components in the network structure is given below.
2.1.1 resampling. As described above.
The 2.1.2SO (3) tensor is converted to the S2 tensor. As described above.
2.1.3 flexibility maximum function. As described above.
2.1.4 center Gaussian weight map, noted
Figure BDA00020900523400001417
Figure BDA00020900523400001418
As pixel coordinates, can be generated by:
Figure BDA0002090052340000151
wherein
Figure BDA0002090052340000152
Is tensor TsThe side length of (1) corresponds to the above suggested value of 1.1.1, then there is
Figure BDA0002090052340000153
Figure BDA0002090052340000154
Is a parameter, the legal value range is (0, infinity), the suggested value is
Figure BDA0002090052340000155
2.1.5 the specific configuration of each layer in the network containing learnable parameters is shown in table 2.
Figure BDA0002090052340000156
Figure BDA0002090052340000161
Table 2 corresponds to the specific configuration of the network of fig. 5 with layers having learnable parameters
2.2 objective function the terms define the objective function when training VP-net as follows.
2.2.1 spherical anchor definition. As described above.
2.2.2 window weight objective function. As described above.
2.2.3 Window position offset objective function. As described above.
2.2.4 lens motion detection objective function. The real label of whether the input video frame contains the shot motion is l (when the input video frame contains the shot motion, l is 1, otherwise, l is 0), and the shot motion detection objective function is defined as a binary cross entropy objective function as follows:
Figure BDA0002090052340000171
2.2.5 Overall, the objective function when training VP-net is defined as follows:
Figure BDA0002090052340000172
wherein
Figure BDA0002090052340000173
λwAnd λvAs a parameter, the legal value range is a positive number, and the suggested value is
Figure BDA0002090052340000174
λw=1,λv=5.
II, secondly: embodiments of the present invention, which mention the use of VQ-net to output saliency maps and quality scores for multiple views, propose two implementations of VQ-net. A VQ-net is described in detail as follows:
1. VQ-net may be a deep convolutional neural network as described below
1.1 network architecture
Network input is a Windows image CkAnd airspace variable quantity image thereof
Figure BDA0002090052340000175
Saliency map with output as input window
Figure BDA0002090052340000176
And mass fraction skThe network topology connection is as shown in fig. 6. A description of the different components in the network structure is given below.
1.1.1 the flexible max function is computed over the width and height dimensions of the input tensor.
1.1.2 upsampling. For the tensor requiring up-sampling in fig. 6, the size after up-sampling is the same as the tensor multiplied after up-sampling. Upsampling uses gradient-guided interpolation, including but not limited to bilinear interpolation.
1.1.3 the specific configuration of the convolutional and pooling layers in the network is shown in Table 3.
Figure BDA0002090052340000177
Figure BDA0002090052340000181
Table 3 corresponds to the specific configuration of the network convolution layer and the pooling layer of FIG. 6
1.1.4 dense connection block DenseBlock the configuration is as defined in DenseNet, as shown in Table 4.
Figure BDA0002090052340000182
Figure BDA0002090052340000191
Table 4 corresponds to the specific configuration of the tight connection blocks of fig. 6
1.2 the objective function is as follows, the itemization defines the objective function when training VQ-net
1.2.1, significance prediction objective function. Given that experimenter watches the input impaired video, in window CkEye movement saliency map truth value M within rangekThe significance map is regarded as a probability distribution, and the significance prediction objective function is defined as the significance map of the network prediction window
Figure BDA0002090052340000192
And its true value MkRelative entropy between, defined as:
Figure BDA0002090052340000193
wherein M isk(x ', y') and
Figure BDA0002090052340000194
are respectively MkAnd
Figure BDA0002090052340000195
a saliency value at a pixel position (x ', y').
1.2.2 quality fraction objective function. Given the subjective quality score s of the video corresponding to the input window, the quality score objective function is defined as the quality score s of the network prediction windowkThe squared error from s is defined as follows:
Figure BDA0002090052340000196
in summary, the objective function when training VQ-net is defined as follows:
Figure BDA0002090052340000197
wherein λ isMAnd λsAs a parameter, the legal value range is positive, and the suggested value is λM=10,λs=1×103
2. VQ-net may be a deep convolutional neural network as described below.
2.1 network architecture network input as Windows image CkAnd airspace variable quantity image thereof
Figure BDA0002090052340000198
Saliency map with output as input window
Figure BDA0002090052340000199
And mass fraction skThe network topology connection is as shown in fig. 4. A description of the different components in the network structure is given below.
2.1.1 flexibility maximum function as described in section 2.
2.1.2 downsampling. For the tensor requiring downsampling in fig. 4, the downsampled size is the same as the size of the tensor to be connected after downsampling. The downsampling uses gradient-guided interpolation, including but not limited to bilinear interpolation.
2.1.3 closely connected blocks are as described in section 2.
2.1.4 the specific configuration of the convolutional and pooling layers in the network is shown in Table 5.
Figure BDA0002090052340000201
Figure BDA0002090052340000211
Table 5 corresponds to the specific configuration of the network convolution layer and the pooling layer of FIG. 4
2.2 objective function the terms define the objective function when training VP-net as follows.
2.2.1 significance prediction objective function. As described in section 2.
2.2.2 quality fraction objective function. As described in section 2.
In summary, the objective function when training VQ-net is defined as follows:
Figure BDA0002090052340000212
wherein λ isMAnd λsAs a parameter, the legal value range is positive, and the suggested value is λM=10,λs=1×104
The apparatus disclosed in the above embodiments can implement the processes of the methods disclosed in the above method embodiments, and in order to avoid repetition, the details are not described here again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (9)

1. A method for evaluating video quality based on windows is characterized by comprising the following steps:
acquiring a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
based on each damaged video frame and the corresponding reference video frame, determining the quality score of each damaged video frame by using a pre-trained window extraction network and a window quality network;
determining a quality score for the marred video based on the quality score for each marred video frame;
wherein determining the quality score of any corrupted video frame comprises:
based on the damaged video frame, obtaining a plurality of candidate window positions and the corresponding weight of each candidate window position through the window extraction network;
obtaining a saliency image and a quality score of the extracted window through a window quality network based on the candidate window positions, the damaged video frame and the corresponding reference video frame;
and averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
2. The method of claim 1, wherein obtaining a plurality of candidate window positions and a weight corresponding to each of the candidate window positions via the window extraction network based on the corrupted video frame comprises:
determining a temporal variation based on the corrupted video frame;
and obtaining a plurality of candidate window positions and the weight corresponding to each candidate window position through the window extraction network based on the damaged video frame and the time domain variation.
3. The method of claim 2, wherein determining the temporal variation based on the corrupted video frame comprises:
a temporal change between the corrupted video frame and a previous nth corrupted video frame is calculated.
4. The method of claim 1, wherein the impaired view is based on the candidate window positions
The frequency frame and the corresponding reference video frame obtain the significant image and the quality score of the extracted window through a window quality network, and the method comprises the following steps:
determining a spatial variation between the damaged video frame and the corresponding reference video frame based on the damaged video frame and the corresponding reference video frame;
obtaining an image of the extracted window and a corresponding spatial variation based on the plurality of candidate window positions, the spatial variation, the damaged video frame and the corresponding reference video frame;
and obtaining the significant image and the quality score of the extracted window through the window quality network based on the image of the extracted window and the corresponding airspace variable quantity.
5. The method of claim 4, wherein obtaining the image of the extracted window and the corresponding spatial variance based on the candidate window positions, the spatial variance, the corrupted video frame and the corresponding reference video frame comprises:
extracting at least one window position from the candidate window positions based on the corresponding weight of each candidate window position, and outputting the extracted window position and the corresponding weight;
and aligning the extracted window position, the damaged video frame, the airspace variation and the reference video frame to obtain an image of the extracted window and a corresponding airspace variation.
6. The method of claim 1, wherein averaging the quality scores of all extracted windows to obtain the quality score of the corrupted video frame comprises:
and based on the weight corresponding to the position of the extracted window, carrying out weighted average on the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
7. The method of claim 1, wherein averaging the quality scores of all extracted windows to obtain the quality score of the corrupted video frame comprises:
and arithmetically averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
8. The method of claim 1, wherein determining the quality score of the marred video based on the quality score of each marred video frame comprises:
and averaging the quality scores of all damaged video frames to obtain the quality score of the damaged video.
9. A window-based video quality evaluation apparatus, comprising:
an acquisition unit configured to acquire a damaged video and a reference video; the corrupted video comprises a plurality of corrupted video frames and the reference video comprises a plurality of reference video frames;
the first determining unit is used for determining the quality score of each damaged video frame based on each damaged video frame and a corresponding reference video frame and a pre-trained window extraction network and a window quality network;
a second determining unit for determining a quality score of the marred video based on a quality score of each marred video frame;
wherein determining the quality score of any corrupted video frame comprises:
based on the damaged video frame, obtaining a plurality of candidate window positions and the corresponding weight of each candidate window position through the window extraction network;
obtaining a saliency image and a quality score of the extracted window through a window quality network based on the candidate window positions, the damaged video frame and the corresponding reference video frame;
and averaging the quality scores of all the extracted windows to obtain the quality score of the damaged video frame.
CN201910500485.3A 2019-06-11 2019-06-11 Video quality evaluation method and device based on window Active CN110365966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910500485.3A CN110365966B (en) 2019-06-11 2019-06-11 Video quality evaluation method and device based on window

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910500485.3A CN110365966B (en) 2019-06-11 2019-06-11 Video quality evaluation method and device based on window

Publications (2)

Publication Number Publication Date
CN110365966A CN110365966A (en) 2019-10-22
CN110365966B true CN110365966B (en) 2020-07-28

Family

ID=68216886

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910500485.3A Active CN110365966B (en) 2019-06-11 2019-06-11 Video quality evaluation method and device based on window

Country Status (1)

Country Link
CN (1) CN110365966B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112163990B (en) 2020-09-08 2022-10-25 上海交通大学 Significance prediction method and system for 360-degree image
US20220415037A1 (en) * 2021-06-24 2022-12-29 Meta Platforms, Inc. Video corruption detection
CN115953727B (en) * 2023-03-15 2023-06-09 浙江天行健水务有限公司 Method, system, electronic equipment and medium for detecting floc sedimentation rate

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103338379B (en) * 2013-06-05 2015-04-29 宁波大学 Stereoscopic video objective quality evaluation method based on machine learning
CN104506852B (en) * 2014-12-25 2016-08-24 北京航空航天大学 A kind of objective quality assessment method towards video conference coding
CN106412571B (en) * 2016-10-12 2018-06-19 天津大学 A kind of method for evaluating video quality based on gradient similarity standard difference
CN108337504A (en) * 2018-01-30 2018-07-27 中国科学技术大学 A kind of method and device of evaluation video quality
CN108449595A (en) * 2018-03-22 2018-08-24 天津大学 Virtual reality method for evaluating video quality is referred to entirely based on convolutional neural networks
CN108900864B (en) * 2018-07-23 2019-12-10 西安电子科技大学 full-reference video quality evaluation method based on motion trail

Also Published As

Publication number Publication date
CN110365966A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
US11468697B2 (en) Pedestrian re-identification method based on spatio-temporal joint model of residual attention mechanism and device thereof
CN113822969B (en) Training neural radiation field model, face generation method, device and server
CN110910447B (en) Visual odometer method based on dynamic and static scene separation
CN110365966B (en) Video quality evaluation method and device based on window
CN107622257A (en) A kind of neural network training method and three-dimension gesture Attitude estimation method
EP3598387B1 (en) Learning method and program
CN114565655B (en) Depth estimation method and device based on pyramid segmentation attention
CN113393522A (en) 6D pose estimation method based on monocular RGB camera regression depth information
CN115035171B (en) Self-supervision monocular depth estimation method based on self-attention guide feature fusion
CN111626308B (en) Real-time optical flow estimation method based on lightweight convolutional neural network
CN110942484B (en) Camera self-motion estimation method based on occlusion perception and feature pyramid matching
CN111325784A (en) Unsupervised pose and depth calculation method and system
CN110992414B (en) Indoor monocular scene depth estimation method based on convolutional neural network
CN105488759B (en) A kind of image super-resolution rebuilding method based on local regression model
CN108462868A (en) The prediction technique of user&#39;s fixation point in 360 degree of panorama VR videos
CN114429555A (en) Image density matching method, system, equipment and storage medium from coarse to fine
CN102231844A (en) Video image fusion performance evaluation method based on structure similarity and human vision
CN112907557A (en) Road detection method, road detection device, computing equipment and storage medium
CN115496663A (en) Video super-resolution reconstruction method based on D3D convolution intra-group fusion network
CN112819697A (en) Remote sensing image space-time fusion method and system
CN116934592A (en) Image stitching method, system, equipment and medium based on deep learning
CN111861949A (en) Multi-exposure image fusion method and system based on generation countermeasure network
CN114821434A (en) Space-time enhanced video anomaly detection method based on optical flow constraint
CN116188550A (en) Self-supervision depth vision odometer based on geometric constraint
CN113034398A (en) Method and system for eliminating jelly effect in urban surveying and mapping based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant