CN109783475B

CN109783475B - Method for constructing large-scale database of video distortion effect markers

Info

Publication number: CN109783475B
Application number: CN201910062151.2A
Authority: CN
Inventors: 赵铁松; 何灵璐; 魏宏安; 林丽群
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-01-23
Filing date: 2019-01-23
Publication date: 2022-06-14
Anticipated expiration: 2039-01-23
Also published as: CN109783475A

Abstract

The invention relates to a construction method of a large-scale database of video distortion effect markers, which comprises the steps of firstly preparing a large-scale test video sequence containing a certain distortion effect; then, identifying a perceptible distortion area; then, carrying out primary segmentation and marking on the distortion area by using a space sliding window to obtain a primary segmented positive sample and a primary segmented negative sample; and finally, further obtaining the fine edges of the distortion regions by using a small step sliding mode, thereby obtaining a large-scale database with fine marks. The database of the invention can objectively mark the video distortion effect, can be used for constructing a corresponding distortion effect recognition algorithm and guides the improvement of video coding and transmission strategies.

Description

Method for constructing large-scale database of video distortion effect markers

Technical Field

The invention relates to the field of video quality evaluation, in particular to a method for constructing a large-scale database of video distortion effect markers.

Background

In the last decade, rapid development of video coding, network transmission and display technologies has witnessed the prosperity of High Definition (HD), Ultra High Definition (UHD) and 3D/360 degree video. According to the Visual Network Index (VNI) published by cisco, video content has already occupied 2/3 bandwidth of current broadband and mobile networks and will grow to 80% -90% in the visible future. Worldwide, internet video users will grow from 14 billion in 2016 to nearly 19 billion in 2021, while the number of mobile end users will also grow explosively. By 2021, worldwide internet video viewing sessions will be up to three trillion minutes per month, which corresponds to up to 500 million years of video being viewed per month, or approximately 100 million minutes of video being viewed per second.

The preprocessing of these digital videos, including image enhancement, visual transformation, stitching, etc., inevitably results in visual distortion of the picture. Meanwhile, video content is growing, requiring that video quality be maximized under limited bit rate or bandwidth constraints, typically by lossy video coding techniques. The current state-of-the-art video coding schemes all use a common hybrid video coding structure, with standard procedures including intra prediction, inter motion estimation and compensation, followed by transform, quantization and entropy coding. To facilitate these functions in large-sized video, the encoder further divides the frame into slices and coding units. Therefore, in the case where the bit rate is not high enough, the compressed video contains frames, slices, and various information loss within and between units, resulting in visual distortion. The presence of these video distortions greatly degrades the user experience of current video viewing.

In addition, broadband and mobile network based information transmission is packet-switched based, and video stream data is divided into packets, each of which is transmitted independently. Packets may be dropped at intermediate network nodes (e.g., switches or routers) due to buffer overflows or considered dropped due to excessive queuing delay. For video streaming or real-time video communication systems, any packets that arrive after the allowed delay time are also considered lost. The loss of these packets, and the failure of subsequent error correction algorithms, can cause distortion in the content of the video, which also results in a reduction in the user experience of current video viewing.

Detection and classification of the distortions described above is a challenging task. Conventionally, quality indicators such as Sum of Absolute Difference (SAD), Mean Square Error (MSE), peak signal to noise ratio (PSNR), Structural Similarity (SSIM), etc. cannot be directly detected for distortion. At the user end, the distortion is highly visible but cannot be measured accurately. The development of software and hardware in recent years has greatly promoted the arrival of the 4K/8K era, user-centered video processing coding and transmission become more important, and the appearance of deep learning makes it possible to identify distortion and quantitatively study. While deep learning relies on large database markers. Currently, there are several databases for video distortion, but these databases only make quality judgments on the overall distortion of pictures, and no specific distortion region is marked; at the same time, these databases are relatively small, typically hundreds to thousands of pictures, and are not sufficient to support the large scale of samples required for deep learning.

Disclosure of Invention

In view of the above, the present invention provides a method for constructing a large-scale database of video distortion effect markers, where the obtained database objectively marks video distortion effects, and can be used to construct a corresponding distortion effect recognition algorithm and guide improvement of video encoding and transmission strategies.

The invention is realized by adopting the following scheme: a method for constructing a large-scale database marked by video distortion effects specifically comprises the following steps:

step S1: preparing a large-scale test video sequence containing certain distortion effects;

step S2: identifying a perceivable distortion zone;

step S3: carrying out primary segmentation and marking on the distortion area by using a space sliding window to obtain a primary segmented positive sample and a primary segmented negative sample;

step S4: and further obtaining a fine edge of the distortion region by using a small step sliding mode, thereby obtaining a large-scale database with fine marks.

Further, step S1 is specifically: carrying out uniform coding and transmission processing on the source sequence to generate not less than 4 test sequences possibly containing distortion effects; the source sequence is an original sequence which is collected in a natural scene and is not encoded and transmitted; the source sequence should cover more than 4 spatial resolutions (at least 2 spatial resolutions of 720p or more), and each spatial resolution should include more than 4 videos collected from different scenes.

Further, step S3 is specifically: using a sliding window, the perceivable distortion region defined in step S2 is traversed, and the positive and negative type samples are cut.

Further, the sliding window is used in a manner related to the type of the perceptible distortion;

when the perceivable distortion is a spatial distortion, the sliding window is a two-dimensional rectangle traversing a single frame pixel by pixel; if more than 1/2 of the pixels in the sliding window belong to the marked spatial type distortion region, marking the image block cut by the window as a positive type sample; otherwise, marking as a negative sample; in addition, randomly selecting an image block cut by the sliding window traversal source sequence, and marking the image block as a negative sample;

when the perceptible distortion is time-type distortion, the sliding window is a cube, the section of the sliding window is the same as that of a two-dimensional rectangle of space-type distortion, and the long axis of the sliding window is expanded in a plurality of frames before and after the marking frame; if more than 1/2 pixels in any section of the cube belong to the marked time-based distortion region, marking the image group cut by the stereo window as a positive type; otherwise, marking as a negative class; in addition, the image group cut by the sliding window traversing source sequence is randomly selected and marked as a negative class.

Further, step S4 is specifically: training a deep convolutional neural network based on the positive and negative samples preliminarily segmented in the step S2 to obtain a preliminary sample classifier with certain discrimination capability; setting double thresholds Th _ high and Th _ low, wherein Th _ high > Th > Th _ low, if the output y of the preliminary sample classifier is greater than Th, judging that the output belongs to a positive class, otherwise, judging that the output belongs to a negative class;

For any test video, non-overlapping image blocks are cut and labeled as follows:

if the image block passes through the preliminary sample classifier and the output value is greater than Th _ high, all points in the area of the image block are marked as 1; if the output value of the preliminary sample classifier is smaller than Th _ low, all the unmarked points in the position or the area of the preliminary sample classifier are marked as 0; if the two conditions are not met, traversing all the image blocks overlapped with the image block pixel by pixel, and if the output value of any overlapped block passing through the primary sample classifier is greater than Th _ high, marking all points in the area where the overlapped block is located as 1;

if the areas marked as 1 can not be communicated with each other, the connection operation is tried;

and marking all the areas marked as 1 as distortion areas, and repeating the steps S3 and S4 as required to finally obtain a large-scale database with fine marks.

Further, step S4 further includes: for any test video, the non-overlapping image sets are cut and labeled as follows:

if the image group passes through the preliminary sample classifier and the output value is greater than Th _ high, all points in the region are marked as 1; if the output value of the preliminary sample classifier is smaller than Th _ low, all the unmarked points are marked as 0 in the position or the area where the preliminary sample classifier is located; if the two conditions are not met, traversing all image groups overlapped with the image group pixel by pixel, and if the output value of any overlapped group passing through the primary sample classifier is greater than Th _ high, marking all points in the area of the overlapped group as 1;

If the regions marked 1 cannot communicate with each other, a connection operation is attempted.

Further, the connecting operation includes the steps of:

step S11: extracting edge points of any marked region to form an edge point set of the region, and recording the maximum distance between any two points in the point set as the scale of the region;

step S12: respectively extracting edge point sets of two unconnected mark areas, and recording the maximum distance between the two areas as the maximum distance between the two areas after the connection between the two point sets is processed by a random sampling consistency algorithm;

step S13: and for two unconnected marking areas, if the maximum distance is smaller than the scale of any marking area, marking all areas through which the connecting line between the two areas passes as 1.

Compared with the prior art, the invention has the following beneficial effects: the invention provides a method for constructing a large-scale marking database aiming at distortion effects generated by video coding and transmission, wherein the database can objectively mark the video distortion effects, can be used for constructing a corresponding distortion effect identification algorithm and guides to improve video coding and transmission strategies.

Drawings

Fig. 1 is a schematic diagram of spatial distortion small block tag classification according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of classifying small block labels in a temporally distorted continuous distorted video frame according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of further obtaining a fine edge of a distortion region by using a small step sliding manner in the embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the embodiment provides a method for constructing a large-scale database of video distortion effect markers, which specifically includes the following steps:

step S2: identifying a perceptible distortion zone;

In this embodiment, step S1 specifically includes: carrying out uniform coding and transmission processing on the source sequence to generate not less than 4 test sequences possibly containing distortion effects; the source sequence is an original sequence which is collected in a natural scene and is not encoded and transmitted; the source sequence should cover more than 4 spatial resolutions (at least 2 spatial resolutions of 720p or more), and each spatial resolution should include more than 4 videos collected from different scenes.

Preferably, in this embodiment, in step S2, in order to identify all the perceivable distortion areas, the tester (i.e. the marker) is required to mark all the video sequences in this embodiment. The test procedure followed the ITU-R bt.500 document, and was divided into two phases. During the pre-training phase, all testers are informed of the test procedure and receive training to identify the distorting effects. In the formal testing stage, all testers are required to view the sequences, and an ergonomic device such as a mouse is used for preliminarily enclosing a perceptible distortion area. To ensure the reliability of the mark, all test sequences are randomly arranged. To avoid visual fatigue, a free rest time of not less than 1 time was set during the test.

In this embodiment, step S3 specifically includes: using a sliding window, the perceptual distortion region outlined in step S2 is traversed, and positive and negative class samples are cut.

In this embodiment, the sliding window is used in a manner related to the type of the perceptible distortion;

as shown in fig. 1, when the perceivable distortion is a spatial type distortion, the sliding window is a two-dimensional rectangle traversing a single frame pixel by pixel; if more than 1/2 pixels in the sliding window belong to the marked spatial type distortion region, the image block cut by the window is marked as a positive type (as shown in (a) in fig. 1); otherwise, it is marked as negative class (as shown in (b) of fig. 1); in addition, randomly selecting the image blocks cut by the sliding window traversal source sequence, and marking the image blocks as negative classes;

as shown in fig. 2, when the perceivable distortion is a temporal distortion, the sliding window is a cube, the cross section of the sliding window is the same as that of the two-dimensional rectangle of the spatial distortion, and the long axis of the sliding window is expanded in a plurality of frames before and after the mark frame; considering the reaction time of the tester, if more than 1/2 pixels in any section of the cube belong to the marked time-based distortion region, the image group cut by the stereo window is marked as a positive type sample (as shown in (a) in fig. 2); otherwise, marking as a negative class sample (as shown in (b) of fig. 2); in addition, the image group cut by the sliding window traversing the source sequence is randomly selected and marked as a negative sample.

In this embodiment, step S4 specifically includes: training a deep convolutional neural network based on the positive and negative samples preliminarily segmented in the step S2 to obtain a preliminary sample classifier with certain discrimination capacity; setting double thresholds Th _ high and Th _ low, wherein Th _ high > Th > Th _ low, if the output y of the preliminary sample classifier is greater than Th, judging that the output belongs to a positive class, otherwise, judging that the output belongs to a negative class;

for any test video, non-overlapping image blocks are cut (for spatial type distortion) and labeled as follows:

as shown in fig. 3 (a), all regions are first marked as 0; as shown in (b) of fig. 3, if the image block passes through the preliminary sample classifier and the output value is greater than Th _ high, all points in the area where the image block is located are marked as 1; as shown in (c) of fig. 3, if the output value of the preliminary sample classifier is smaller than Th _ low, all the points that are not marked in the position or region are marked as 0; as shown in (d) of fig. 3, if the two conditions are not satisfied, traversing all the image blocks overlapped with the image block pixel by pixel, and if the output value of any overlapped block passing through the preliminary sample classifier is greater than Th _ high, all the points in the region where the overlapped block is located are marked as 1;

If the regions marked as 1 cannot be connected with each other, the connection operation is attempted;

In this embodiment, step S4 further includes: for any test video, the non-overlapping image sets are cut (for temporal distortion) and labeled as follows:

In this embodiment, the connecting operation includes the following steps:

step S13: and for two unconnected marking areas, if the maximum distance is smaller than the dimension of any marking area, marking all areas through which the connecting line between the two areas passes as 1.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A method for constructing a large-scale database of video distortion effect markers is characterized by comprising the following steps: the method comprises the following steps:

step S1: preparing a large-scale test video sequence containing distortion effects;

step S2: identifying a perceivable distortion zone;

step S3: using a sliding window to perform preliminary segmentation and marking on the distortion area to obtain a preliminary segmented positive sample and a preliminary segmented negative sample;

step S4: further obtaining a fine edge of the distortion region by using a small step sliding mode, thereby obtaining a large-scale database with fine marks;

step S3 specifically includes: traversing the perceptible distortion area identified in the step S2 by using a sliding window, and cutting positive and negative samples;

the sliding window is used in a mode related to the generation type of the perceptible distortion;

when the perceivable distortion is a spatial distortion, the sliding window is a two-dimensional rectangle traversing a single frame pixel by pixel; if more than 1/2 of the pixels in the sliding window belong to the marked spatial type distortion region, marking the image block cut by the window as a positive type; otherwise, marking as a negative class; in addition, randomly selecting the image blocks cut by the sliding window traversal source sequence, and marking the image blocks as negative classes;

when the perceivable distortion is time-type distortion, the sliding window is a cube, the section of the sliding window is the same as that of a two-dimensional rectangle of space-type distortion, and the long axis of the sliding window is expanded in a plurality of frames before and after the marking frame; if more than 1/2 pixels in any section of the cube belong to the marked time-based distortion region, marking the image group cut by the sliding window of the cube as a positive sample; otherwise, marking as a negative sample; in addition, randomly selecting an image group cut by the sliding window traversal source sequence, and marking the image group as a negative sample;

Step S4 specifically includes: training a deep convolutional neural network based on the positive and negative samples preliminarily segmented in the step S3 to obtain a preliminary sample classifier with discrimination capability; setting double thresholds Th _ high and Th _ low, wherein Th _ high > Th > Th _ low, if the output y of the preliminary sample classifier is greater than Th, judging that the output belongs to a positive class, otherwise, judging that the output belongs to a negative class;

if the image block passes through the preliminary sample classifier and the output value is greater than Th _ high, all points in the area of the image block are marked as 1; if the output value of the preliminary sample classifier is smaller than Th _ low, all the unmarked points are marked as 0 in the position or the area where the preliminary sample classifier is located; if the two conditions are not met, traversing all the image blocks overlapped with the image block pixel by pixel, and if the output value of any overlapped block passing through the primary sample classifier is greater than Th _ high, marking all points in the area where the overlapped block is located as 1;

2. The method of claim 1, wherein the step of constructing the large-scale database of video distortion effect markers comprises: step S1 specifically includes: carrying out unified coding and transmission processing on the source sequence to generate not less than 4 test sequences possibly containing distortion effects; the source sequence is an original sequence which is collected in a natural scene and is not subjected to any coding and transmission; the source sequence needs to cover more than 4 spatial resolutions, and each spatial resolution needs to contain more than 4 videos collected in different scenes.

3. The method of claim 2, wherein the step of constructing the large-scale database of video distortion effect markers comprises: step S4 further includes: for any test video, the non-overlapping image sets are cut and labeled as follows:

4. The method according to claim 3, wherein the database comprises at least one of the following components: the connecting operation comprises the following steps:

step S11: extracting edge points of any marked region to form an edge point set of the region, and recording the maximum distance between any two points in the edge point set as the scale of the region;

step S12: respectively extracting edge point sets of two unconnected mark areas, connecting lines between the two point sets, and recording the maximum distance after a random sampling consistency algorithm as the maximum distance between the two areas;