CN116363068A - Video image quality evaluation method and electronic equipment - Google Patents

Video image quality evaluation method and electronic equipment Download PDF

Info

Publication number
CN116363068A
CN116363068A CN202310184311.7A CN202310184311A CN116363068A CN 116363068 A CN116363068 A CN 116363068A CN 202310184311 A CN202310184311 A CN 202310184311A CN 116363068 A CN116363068 A CN 116363068A
Authority
CN
China
Prior art keywords
video
image quality
image
samples
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310184311.7A
Other languages
Chinese (zh)
Inventor
张家斌
李静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youku Technology Co Ltd
Original Assignee
Beijing Youku Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youku Technology Co Ltd filed Critical Beijing Youku Technology Co Ltd
Priority to CN202310184311.7A priority Critical patent/CN116363068A/en
Publication of CN116363068A publication Critical patent/CN116363068A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4007Scaling of whole images or parts thereof, e.g. expanding or contracting based on interpolation, e.g. bilinear interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a video image quality evaluation method and electronic equipment, wherein the method comprises the following steps: determining at least one image frame from a target video to be subjected to image quality evaluation, and dividing the image frame into a plurality of image blocks in a space dimension; the image quality evaluation model comprises a backbone network, a first full-connection layer and a second full-connection layer, wherein the first full-connection layer is used for outputting image quality evaluation scores of the image blocks, and the second full-connection layer is used for outputting weights of the image blocks; weighting calculation processing is carried out on the image quality evaluation scores corresponding to the image blocks according to the weights corresponding to the image blocks so as to obtain the image quality evaluation scores of the image frames; and determining the image quality evaluation score of the target video according to the image quality evaluation score of the at least one image frame. According to the embodiment of the application, the accuracy of the image quality evaluation model can be improved.

Description

Video image quality evaluation method and electronic equipment
Technical Field
The present disclosure relates to the field of video image quality assessment technologies, and in particular, to a video image quality assessment method and an electronic device.
Background
The quality of video content is affected not only by the resolution (generally, the higher the resolution is, the higher the quality is not necessarily, but also the upper limit of the video image quality is determined in the case of higher resolution), but also by various factors such as the quality of the source, the video code rate, the video coding protocol, and the like. For example, many video materials are recorded by mobile phones, cameras, etc., so the final image quality of such video is greatly dependent on the shooting equipment and materials. For example, many authoring teams or individuals may choose to record video with a higher than final film-forming specification, i.e., assuming 4K of video is to be produced, 5K resolution video may be recorded, facilitating later scaling and cropping. Because the picture shot by 5K can retain more details and colors, cleaner color mixing effect and clearer color presentation can be brought in the later period, and even if only 4K video is output finally, the overall image quality and color of the picture are better than those of the video cut out after the picture is recorded by 4K standard. In addition, in the shooting process, different shooting environments (weather, light), shooting parameters (resolution, frame rate, focusing, sensitivity, white balance) and the like directly or indirectly affect the final film quality; in addition, multiple transcoding compressions may be involved in the post-production of video, including the reproduction of older movies and television shows, etc. may result in a change in image quality.
That is, there may be a difference in image quality for video content of the same resolution. Therefore, when the streaming media playing platform obtains the video content from the provider, it is generally required to judge the quality of the video content, wherein the judgment of the image quality (picture quality) of the video content is included. By evaluating the image quality of the video content, it can be determined whether the admission condition of the platform is met or whether different video distribution strategies can be provided for video content with different image quality. For example, for a video distribution strategy, some video contents with better image quality can be selected to form a single plate, and a higher code rate can be provided when the video contents are played, so that a better viewing experience is provided for a user; for video content with poor image quality, the above-mentioned plate can not be selected, and when it is played, it can also adopt lower code rate so as to save system transmission resource, etc..
Regarding the judgment of the image quality of the video content, the judgment is performed manually in the traditional manner, however, with the development of the video entertainment industry and the self-media industry, the number of videos that the streaming media playing platform needs to be on line every day is gradually increased, and only manual inspection needs to consume a great deal of labor cost and is difficult to ensure timeliness. In addition, in the prior art, some video quality evaluation models exist, and the evaluation of the video content quality can be realized through a deep learning network model, however, because the parameter quantity is huge and the calculation is complex, only the evaluation of the video content with lower resolution is usually supported, and for the video content with 4K or even higher resolution which is more and more in the ultra-high-definition era, the high-resolution video needs to be downsampled to low resolution so as to adapt to the limitation of the input resolution of the deep learning network model. The downsampling process of the high-resolution video tends to affect the video image quality, so that the quality of the model is inaccurate.
Disclosure of Invention
The application provides a video image quality evaluation method and electronic equipment, which can improve the accuracy of an image quality evaluation model.
The application provides the following scheme:
a video quality assessment method, comprising:
determining at least one image frame from a target video to be subjected to image quality evaluation, and dividing the image frame into a plurality of image blocks in a space dimension;
the image quality evaluation model comprises a backbone network, a first full-connection layer and a second full-connection layer, wherein the first full-connection layer is used for outputting image quality evaluation scores of the image blocks, and the second full-connection layer is used for outputting weights of the image blocks;
weighting calculation processing is carried out on the image quality evaluation scores corresponding to the image blocks according to the weights corresponding to the image blocks so as to obtain the image quality evaluation scores of the image frames;
and determining the image quality evaluation score of the target video according to the image quality evaluation score of the at least one image frame.
Wherein the dividing the image frame into a plurality of image blocks in the spatial dimension comprises:
dividing the image frame into a plurality of image blocks according to the size of the image frame and the maximum input size supported by the image quality assessment model so that the size of the image blocks is smaller than or equal to the maximum input size supported by the image quality assessment model.
The image quality evaluation model is obtained by supervised learning through a plurality of video samples in a training data set and corresponding subjective image quality evaluation scores;
wherein subjective image quality assessment scores of the plurality of video samples are obtained by:
generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a testee user;
sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling better/worse image quality in the two video samples are provided;
recording labeling results of the tested person user on the plurality of sample pairs;
and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the multiple testee users.
Wherein the acquiring a plurality of video samples in a training dataset comprises:
sampling a plurality of video contents from a plurality of video categories according to a video category division rule in a target system;
cutting the video content into a plurality of video fragments with equal time length in the time dimension, and determining a video fragment representing the video content from the plurality of video fragments;
And selecting a target number of video clips from the video clips corresponding to the video contents according to the video characteristics in the multiple dimensions, and determining the target number of video clips as the video samples.
A video sample processing method, comprising:
acquiring a plurality of video samples in a training data set;
generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a testee user;
sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling better/worse image quality in the two video samples are provided;
recording labeling results of the tested person user on the plurality of sample pairs;
and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the plurality of testee users, wherein the video samples and the corresponding subjective image quality assessment scores are used for training an image quality assessment model.
Wherein, play the plurality of sample pairs in turn, including:
and outputting the two video samples in the same sample pair to two display devices for simultaneous playing, wherein the two display devices have the same resolution and the same resolution as the plurality of video samples.
The labeling process is carried out among a plurality of tested person users in series;
the generating a plurality of sample pairs from the plurality of video samples includes:
selecting a video segment with uncertainty higher than a threshold according to the labeling result of the historical tested person user;
and when a sample pair is generated for a new tested person user, increasing the proportion of the video fragments with uncertainty higher than a threshold value in the sample pair.
The recording the labeling results of the tested person user on the plurality of sample pairs comprises the following steps:
numbering the video samples respectively, and generating an N multiplied by N matrix, wherein N is the number of the video samples, the elements in the matrix correspond to sample pairs, the element values of the elements represent samples corresponding to row serial numbers/column serial numbers in the same sample pair, and the samples corresponding to the column serial numbers/row serial numbers are marked as better/worse times when compared;
and saving the labeling results of the tested person user on the plurality of sample pairs by updating the element values at the corresponding positions in the matrix.
The step of saving the labeling results of the tested person user on the plurality of sample pairs by updating the element values at the corresponding positions in the matrix comprises the following steps:
And if the sample pair consists of the video sample of the x row and the video sample of the y column, and the labeling result is that the image quality of the video sample of the x row is better than that of the video sample of the y column, adding 1 to the element value of the x row and the y column in the matrix.
The determining the subjective image quality assessment score of the video sample according to the labeling results corresponding to the plurality of testee users comprises the following steps:
determining the total times n1 marked as better in image quality in the process of comparing the video sample with other video samples according to the sum of element values in the row of the same video sample;
determining the total times n2 which are not marked as having better image quality in the process of comparing the video sample with other video samples according to the sum of the element values in the column of the video sample;
and determining subjective image quality assessment scores of the video samples according to the total times n1 and n 2.
Wherein the acquiring a plurality of video samples in a training dataset comprises:
sampling a plurality of video contents from a plurality of video categories according to a video category division rule in a target system;
cutting the video content into a plurality of video fragments with equal time length in the time dimension, and determining a video fragment representing the video content from the plurality of video fragments;
And selecting a target number of video clips from the video clips corresponding to the video contents according to the video characteristics in the multiple dimensions, and determining the target number of video clips as the video samples.
Wherein, still include:
the video segments are upsampled or downsampled such that the plurality of video samples have the same resolution.
A video image quality assessment apparatus comprising:
an image dividing unit for determining at least one image frame from a target video to be subjected to image quality evaluation, and dividing the image frame into a plurality of image blocks in a spatial dimension;
the input unit is used for respectively inputting the image blocks into an image quality evaluation module, the image quality evaluation model comprises a backbone network, a first full-connection layer and a second full-connection layer, wherein the first full-connection layer is used for outputting image quality evaluation scores of the image blocks, and the second full-connection layer is used for outputting weights of the image blocks;
a calculating unit, configured to perform weighted calculation processing on the image quality evaluation scores corresponding to the image blocks according to weights corresponding to the image blocks, so as to obtain image quality evaluation scores of the image frames;
And the score determining unit is used for determining the image quality evaluation score of the target video according to the image quality evaluation score of the at least one image frame.
A video sample processing device, comprising:
a sample acquisition unit for acquiring a plurality of video samples in a training data set;
the sample pair generating unit is used for generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a tested person user;
the playing unit is used for sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling the better/worse image quality in the two video samples are provided;
the recording unit is used for recording the labeling results of the tested person user on the plurality of sample pairs;
the score determining unit is used for determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the plurality of tested person users, and the video samples and the corresponding subjective image quality assessment scores are used for training an image quality assessment model.
A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the preceding claims.
An electronic device, comprising:
one or more processors; and
a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding claims.
According to a specific embodiment provided by the application, the application discloses the following technical effects:
according to the embodiment of the application, since the specific image quality evaluation model may include a backbone network, a first full-connection layer and a second full-connection layer, and the first full-connection layer and the second full-connection layer are respectively used for outputting image quality evaluation scores and weights, when the image quality evaluation is specifically performed on a target video, in the target video with higher resolution, an image frame in the video can be divided into a plurality of image blocks in space dimension, and then input into the image quality evaluation model for evaluation, so that the image quality evaluation model can output the image quality evaluation scores of each image block and corresponding adaptive weights, and then weighting calculation processing can be performed on the image quality evaluation scores respectively corresponding to the image blocks according to the weights respectively corresponding to the image blocks, so as to obtain the image quality evaluation scores of the image frames, and then the image quality evaluation scores of the target video can be determined according to the image quality evaluation scores of at least one image frame. In this way, the size of the input data of the image quality assessment model can be within the range of the input size supported by the image quality assessment model, and the resolution of a single image block is not reduced due to the cutting mode in the space dimension, so that the image quality of the image block is not affected, and the accuracy of the assessment result output by the image quality assessment model is improved.
In the preferred embodiment, since a plurality of video clips in the training data set can be generated into a plurality of sample pairs, the testee can select the one with better/worse image quality after watching two video clips in the sample pairs, and the user labeling process can be completed. And then, the subjective image quality assessment scores of the video clips can be obtained through conversion of the labeling results given by the plurality of testers aiming at the plurality of sample pairs, so as to be used for carrying out supervision training on the image quality assessment model. Therefore, in this way, the labeling process of the testee can be simplified and the participation threshold of the testee can be reduced compared with the mode that the testee directly gives out the image quality evaluation scores of the video clips. In addition, the mode of selecting the video clips with better/worse image quality from the two video clips which are played simultaneously is easier to obtain more accurate labeling results from the tested person, and accordingly, the accuracy of the subjective image quality assessment score generated through conversion can be improved. Furthermore, when the model is supervised and trained by using the more accurate subjective image quality assessment score, the training effect of the image quality assessment model can be improved, so that the accuracy of the objective image quality assessment score output by the image quality assessment model is further improved.
Of course, not all of the above-described advantages need be achieved at the same time in practicing any one of the products of the present application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application;
FIGS. 2-1 to 2-3 are schematic diagrams illustrating recording modes of labeling information according to embodiments of the present application;
FIG. 3 is a flow chart of a first method provided by an embodiment of the present application;
FIG. 4 is a flow chart of a second method provided by an embodiment of the present application;
FIG. 5 is a schematic diagram of a first apparatus provided in an embodiment of the present application;
FIG. 6 is a schematic diagram of a second apparatus provided in an embodiment of the present application;
fig. 7 is a schematic diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application are within the scope of the protection of the present application.
In the embodiment of the application, in order to more accurately evaluate the image quality of the video through the image quality evaluation model, aspects such as model training and specific model structure are improved. Specifically, in the embodiment of the present application, the following procedure may be adopted: and establishing a training data set, carrying out subjective quality evaluation experiments on data in the data set, obtaining subjective image quality evaluation scores of videos in the data set, and carrying out supervised training on an image quality evaluation model by utilizing the training data set and the subjective image quality evaluation scores. Thus, the image quality evaluation model obtained through training can be used for evaluating the image quality of specific videos. In order to enable the image quality assessment model to better adapt to 4K, 5K and even higher resolution videos, the structure of the image quality assessment model is also improved.
In particular, in the prior art, the training data set is usually obtained by using some open-source video image quality evaluation data sets, however, in the process of implementing the present application, the inventor of the present application finds that, for a streaming media playing platform, multiple video categories may be involved, and different platforms may be different for the classification of the categories, for example, in a streaming media playing platform, a specific video content category may include a movie, a television show, B movie, B television show, C movie, C television show, D family teaching and synthetic skill documentaries, E family learning and synthetic skill documentaries, and animation. However, the existing open source data set generally cannot completely cover multiple video categories such as theatrical recording in a platform, and further cannot provide subjective image quality assessment scores matched with the distribution of video content under the categories. If training of objective evaluation models is performed by using these data sets, no video image quality evaluation score matching the specific application requirements can be obtained.
Therefore, in order to ensure accurate quality evaluation of video content in a specific streaming media playing platform, the data set can be established according to the class classification mode of specific videos in the platform. For example, for the platform in the foregoing example, a particular collected dataset may include multiple categories of a movie, a television show, B movie, B television show, C movie, C television show, D family education variety documentaries, E family education variety documentaries, and cartoon, such that the data in the dataset has a good distribution among the multiple different categories.
In addition, in order to make the video duration be favorable for the subjective quality assessment experiment, in a preferred mode, the video content selected by each category can be cut into video segments with the length of 10 seconds (or other lengths), and then one video segment corresponding to each video content can be selected respectively to form a video base. Further, according to video characteristics (except for the time, other characteristics can be obtained by algorithm calculation) such as the time, the space complexity, the time complexity, the noise, the contrast, the exposure degree and the like of the video segments, sampling is carried out in a video base, and N video segments with the length of 10 seconds are obtained. For example, assuming 1000 video clips in a video base, 100 video clips can be obtained by sampling based on video features, and so on. The video clips can form a training data set, and the training data set is acquired according to categories and can be sampled according to video features and the like in the process of generating the training data set, so that data in the training data set can be distributed more comprehensively in multiple dimensions such as the categories, the video features and the like, and further, a more accurate evaluation result can be obtained when the trained image quality evaluation model evaluates the image quality of the video in a specific platform.
In order to make the image quality of each video segment more comparable, each video segment may have the same resolution by upsampling or downsampling a specific video segment. The upsampling is to enlarge the resolution of the image, and may be implemented by bilinear interpolation or the like. Downsampling is used to reduce the resolution of the image, and may be implemented by extracting pixels from the original image. That is, assuming that the model is trained with the goal of evaluating the image quality of a 4K resolution video, some of the acquired video segments may have their own resolution lower than 4K, some may be higher than 4K, and for these resolutions lower or higher, each video segment may be processed as a 4K resolution video segment by upsampling or downsampling, and so on.
After the training data set is obtained, a process of marking each video clip therein is involved. In this embodiment of the present application, since the purpose of training the model is to enable the model to output the image quality evaluation score of a video when a video is input, if a conventional training data set is marked, each video segment in the training data set needs to be marked with a corresponding image quality score by a manual marking method (because the quality is related to the manual marking, the model may be referred to as a "subjective image quality evaluation score"), that is, a plurality of testees need to mark a quality score for a specific video segment by watching the video segment, for example, the testees score the video segment between 0 and 10 scores, and so on. However, in implementing the present application, the present inventors have found that, in the context of the embodiments of the present application, instead of scoring a single distortion dimension or defect type of the video image quality, videos of different content, different sharpness, different defect types and degrees are evaluated. If a person to be tested makes a specific score after watching the video, the difficulty of execution is high, and the score is difficult to accurately represent the image quality level of the video in the whole data set. Therefore, in the embodiment of the application, subjective experiments are performed by adopting a video comparison mode. That is, the subject only needs to select one of the two video clips that are simultaneously played, which has better/worse subjective image quality. That is, although it is difficult or possibly not enough to directly make the testee score a specific score for the video, it is relatively easy to make the testee indicate which image quality is better/worse between two videos played simultaneously, and the labeling result will be more accurate.
Thus, in the embodiment of the present application, when a specific subject participates in an experiment, multiple sample pairs may be generated from a training data set established previously, each sample pair may include two video clips, then the two video clips may be played simultaneously, and then the two video clips may be stopped for several seconds, and the subject indicates which image quality between the two video clips is better/worse. And then play the next pair of samples, and so on. In a preferred manner, since the video clips in the training dataset may have the same resolution and the resolution may be relatively high, two display devices with corresponding resolutions, e.g. two 4K display screens, etc., may be used in order to obtain a more accurate comparison result. Thus, two video clips in the same sample pair can be output to two 4K display screens for playing, and a tested person can select one of the video clips with better/worse image quality by comparing the video clips played in the two display screens. For the convenience of marking, a display screen may be provided, for example, a common computer display may be provided, and a user operation interface may be displayed in the display screen, where an operation option for marking a better/worse image in the two currently played video clips may be included, and the tested person may complete marking through the operation option.
For example, a particular test system may be as shown in fig. 1, which may include a test host 11, a host display 12, and two 4K display devices 13, 14, assuming that the video clips are all 4K resolution, particularly when subjective experiments are being performed. In particular, each time a test subject participates in an experiment, a test host may generate a number of sample pairs from each video segment in the training dataset. For example, from 100 video clips, 100 sample pairs are generated, each sample pair includes two video clips, and the two video clips in the same sample pair can be respectively output to two 4K display devices for simultaneous playing each time, and a tested person marks a picture pair comparison result through a test interface displayed in a host display screen. For example, the test interface may be as shown at 15 in fig. 1, and the user may select better/worse video clip quality displayed in the left or right display device. The same procedure was also performed when the next subject participated in the experiment. Thus, the result of the image pair ratio given by the plurality of subjects for the plurality of sample pairs can be obtained.
In a preferred embodiment, in order to facilitate saving the result of the image pair comparison given by the subject, a plurality of video segments in the training data set (assuming that the number of video segments in the training data set is N) may be encoded, and an n×n matrix may be generated, where each element in the matrix may correspond to a sample pair, for example, where the element in the x-th row and y-th column may be used to represent a sample pair formed by the video segment x and the video segment y, and so on, so that all sample pairs selected from the N video segments may appear in the matrix. The element value of each element in the matrix may represent a sample corresponding to a row number/column number in a corresponding sample pair, and is marked as a better/worse number when comparing the samples corresponding to the column number/row number. Specifically, in the initial state, the initial value of each element in the matrix may be 0, after the testee selects the better/worse image quality for a certain sample pair, the element corresponding to the sample pair (there are two such elements) may be located in the matrix, and the element with the sample marked as the better/worse image quality as the row number/column number is determined from the two elements, and the element value of the element is added by 1.
For example, assuming that there are a total of 5 video segments, A, B, C, D, E respectively, in the training dataset, the matrix formed may be as shown in fig. 2-1 in the initial state, where the initial values of the respective elements are all 0. Assuming that 5 sample pairs are generated for the first subject, AB, BC, AD, DE, CE for each, and A, C, A, E, E for each of the image quality preferred values selected for each of the sample pairs, the updated element values of each element may be processed by adding 1 to the element values of the following positions in the matrix as shown in fig. 2-2: (1, 2), (3, 2), (1, 4), (5, 3). For example, the element values at the (1, 2) positions are processed by adding 1, which represents that the A, B two video clips are compared, and the current subject considers that the image quality of the video clip a is better, and other positions are similarly processed. Then, assuming that a new subject participates in the experiment, 5 sample pairs are regenerated, each of which is BD, AC, CD, AE, BC, and the image quality preference of the new subject selected for each sample pair is D, A, C, A, B, the corresponding element values in the matrix may be updated, as shown in fig. 2-3, and the element values at the following positions in the matrix are all subjected to 1 addition processing: (4, 2), (1, 3), (3, 4), (1, 5), (2, 3). Similarly, the element values of the elements at multiple locations in the matrix may be obtained gradually. And the subjective quality distribution condition of each video segment in the data set can be calculated according to the updated matrix, and the experiment can be stopped until the change of the quality distribution of N video segments in the data set is smaller than a set threshold value.
It should be noted that, in the preferred embodiment, each of the testees may participate in the experiment in a serial manner, that is, after one testee a completes viewing and labeling of M sample pairs, M sample pairs are reselected, and the test is performed by the testee B. In the case of performing an experiment on the same subject, usually not all video clips are present in the video pair, for example, assuming that there are 100 video clips in the training dataset, the number of pairwise combinations that can be formed is 100×99/2=4950, that is, 4950 sample pairs can be generated, but each time 100 sample pairs are generated, an experiment is performed on the same subject. When the next subject participates in the experiment, 100 sample pairs are regenerated, each sample pair may be different, and only a portion of the 100 video clips will appear in the same sample pair. In this case, since each subject can obtain a part of subjective quality scores about the current training dataset after the labeling is completed, a specific video segment may show different characteristics in terms of quality certainty after labeling of some subjects. There may be some video segments whose quality is relatively stable, e.g. in the sample pairs in which the video segment appears, the video segment will in most cases be marked as better; in addition, there may be some video clips, where the quality may have higher uncertainty, for example, some video clips are marked as better in the sample pair where the video clip appears, some video clips are not marked as better, and the number of the video clips is similar, which belongs to the video clip with higher uncertainty of quality.
For more deterministic video clips, it is generally unnecessary to perform the test repeatedly, while for more deterministic video clips, more of the more deterministic video clips can be involved in the subsequent test process to obtain subjective assessment results that are more deterministic or more representative of the general perspective of such video clips. For this reason, in a preferred embodiment, when a sample pair is generated for a new testee, a video segment with uncertainty higher than a certain threshold value may be selected according to the labeling result of the historical testee user, and a higher probability of occurrence in the sample pair is given to the video segment with higher uncertainty, so that the proportion of occurrence of the video segment with higher uncertainty in the sample pair is increased. That is, more new subjects of an indeterminate higher video clip may be evaluated to obtain more deterministic evaluation results for such video clip.
After the experimental process is completed, the subjective image quality evaluation score of each video segment can be determined according to the labeling results of a plurality of testees. Specifically, if the image quality of the video sample of the x-th row is better than that of the video sample of the y-th column under the condition that the labeling result is stored through the matrix, adding 1 to the element values of the x-th row and the y-th column in the matrix, then the sum of the element values of each video segment in the row of the matrix can represent the total number n1 of labeling the video segment as better in the process of comparing the video segment with other video segments; the sum of the element values of the video segment on the column of the matrix represents the total number n2 of times of the video segment which is not marked as being better in the process of comparing the video segment with other video segments; adding n1 and n2 above may represent the total number of times n that the video segment has appeared in each sample pair (the sample pairs generated for each subject are summed together), so that n 1/(n1+n2) may represent the ratio of the video segment marked as being better in the sample pair in which the video segment has appeared, and further, the subjective image quality assessment score obtained by the video segment may be determined by the ratio. Of course, the subjective image quality assessment score may also be determined by other means, including, for example, converting the matrix data into subjective quality scores for each sample by maximum likelihood estimation, and so forth, which are not described in detail herein.
After the subjective image quality evaluation score of each video segment in the training data set is obtained, the subjective image quality evaluation score can be used as the labeling information of each video segment to be used for training an image quality evaluation model, so that the image quality evaluation model has the capability of marking accurate image quality evaluation scores for new video content.
Among them, as the image quality evaluation model, swin transducer or the like can be used as a backbone network for extracting features. The Swin transducer is used as a network with better performance in various visual tasks, and can be compatible with a wide range of visual tasks. Specifically, the Swin transducer may input an image of h×w×3 (H is the image height, W is the image width, and 3 is the number of RGB channels), and output a characteristic of (H/32) ×w/32) ×8C (C is the super parameter). After the output characteristics of the Swin transducer, a regression network head comprising a plurality of full connection layers is added, so that the output of a single numerical value can be obtained.
Specifically, a plurality of image frames in a video clip may be used as an input of an image quality evaluation model, and the subjective image quality evaluation score for the image frames may be the same as the subjective image quality evaluation score of the video clip to which the image frames belong. For example, assuming that the subjective image quality evaluation score of a video clip is 0.8 score, the subjective image quality evaluation scores of a plurality of image frames in the video clip may be 0.8 score, and the image quality evaluation model may be supervised and trained according to the subjective image quality evaluation scores corresponding to the image frames. However, since there is a huge amount of parameters in the backbone network, there is a limit to the input image size that can be maximally supported, for example, swin transducer can maximally support H and W for 386×386 image input. While in the present embodiment the video clips in a particular training dataset may be 4K or even higher resolution, for example, for 4K video the H and W of its image frames may be 4096 x 2160. In this case, such an image frame cannot be directly input to a network such as a Swin Transformer. If the image frames of the high resolution video in the training dataset are downsampled to within 386 x 386 according to the prior art processing, the image quality is severely lost, resulting in inaccurate quality assessment.
Therefore, in the embodiment of the present application, a video quality evaluation policy based on a spatial attention mechanism is further provided, specifically, an image frame of a high-resolution video may be divided into a plurality of image blocks in a spatial dimension, so that a size of each image block is within a range supportable by a model. For example, in the Swin Transformer network, an image frame may be divided into image batches composed of N386×386×3 image blocks, and the Swin Transformer network is input to obtain the characteristics of n× (H/32) (W/32) (8C). In this way, the resolution of each image block is ensured to be unchanged due to the division of the image frames in the spatial dimension, and the image quality is not changed, so that the influence on the accuracy of the evaluation result can be avoided.
However, in the subjective experimental process, the subjective image quality evaluation score obtained in the dimension of the video segment in the training data set is not included in the subjective image quality evaluation score corresponding to each specific image block, so that the model training process using the image block as input data cannot be supervised directly by using the subjective image quality evaluation score generated for the video segment. Therefore, in this embodiment of the present application, the network structure of the image quality evaluation model may be further improved, specifically, two fully-connected layers (each node is connected to all nodes of the previous layer and used for integrating the features extracted from the previous layer, so as to map the learned "distributed feature representation" to the sample mark space, and output the classification or regression result), where the output of one fully-connected layer is used for representing the respective image quality evaluation quality scores of the N image blocks, and the output of the other fully-connected layer is used for representing the respective spatially adaptive weights of the N image blocks. Then, the image quality evaluation quality scores of the N image blocks are weighted (including weighted summation, weighted average, or the like) by the N weights, so that the image quality evaluation quality scores of the entire image frame can be obtained. Furthermore, in the training process, the calculated image quality evaluation score of the whole image frame can be as close as possible to the subjective image quality evaluation score of the image frame (which can be the same as the subjective image quality evaluation score corresponding to the specific video segment), parameters in the network can be adjusted until the algorithm converges, and the training of the model can be completed.
After training, a specific image quality evaluation model can be used for evaluating the image quality of the new video content, and an image quality evaluation score of the video content is output. Specifically, for a target video (for example, a video that is newly uploaded to a streaming media playing platform by a video producer, etc.) that needs to be subjected to image quality evaluation, at least one image frame may be determined first, then the image frame is divided into a plurality of image blocks in a spatial dimension, and the plurality of image blocks are input into an image quality evaluation module, so as to output an image quality evaluation score and a weight of each image block. And then, carrying out weighted calculation processing on the image quality evaluation scores corresponding to the image blocks according to the weights corresponding to the image blocks by using a linear calculation module so as to obtain the image quality evaluation scores of the image frames. Further, an image quality assessment score of the target video may be determined based on the image quality assessment score of the at least one image frame. For example, the image quality evaluation scores obtained by a plurality of image frames in the same target video may be averaged to obtain the image quality evaluation score of the target video, and so on.
It can be seen that, according to the embodiments of the present application, the prior art may be improved from various aspects, such as obtaining a training data set, obtaining a subjective image quality assessment score, and using a structure of an image quality assessment model. The data in the training data set is more balanced among the categories by sampling under each category of the specific streaming media playing platform, so that the finally generated image quality evaluation model is more suitable for the distribution situation of video categories in the specific streaming media playing platform, and a more accurate image quality evaluation result is obtained. In the process of obtaining the subjective image quality evaluation score, the testee is enabled to conduct pairwise comparison on the video clips, the person with better/worse image quality is marked, and finally the result of pairwise comparison is converted into the subjective image quality evaluation score of the specific video clip, so that the method is beneficial to obtaining more accurate marking results and reducing the participation threshold of the testee compared with the method that the testee directly gives out the image quality score for the watched video clip. Finally, in terms of model structure, the whole image frame is divided into a plurality of image blocks, the image blocks are input into a model, two full-connection layers are added after a backbone network, so that one full-connection layer can be used for outputting the image quality evaluation scores of the image blocks, the other full-connection layer can be used for outputting the self-adaptive weights of the image blocks, and finally the image quality evaluation scores of the image frames are obtained by weighting calculation of the image quality evaluation scores of the plurality of image blocks. In this way, a specific image quality evaluation model can be used for performing image quality evaluation on high-resolution image frames, and compared with the implementation mode of downsampling and then inputting the image frames into the model, the method is more beneficial to improving the accuracy of image quality evaluation, and in addition, subjective image quality evaluation scores obtained based on video segment dimensions can be applied to supervision training of the image quality evaluation model.
It should be noted that in practical applications, the above aspects may be combined together, so that the final trained model has the advantages described above, or only a single aspect or several aspects may be implemented, so that the trained model has some of the advantages described above. For example, the subjective image quality assessment score of a specific training data set may be obtained by using the subjective experimental method provided in the embodiment of the present application, and then in a specific image quality assessment model portion, the subjective image quality assessment score may be implemented by downsampling a high-resolution image frame to a range that can be supported by the model in a conventional manner; or, in the image quality evaluation model part, the image frames with high resolution are segmented and then input into the model, the quality and the weight of each image block are respectively output through two full-connection layers, and in the process of subjective labeling of the video segments in the training data set, the process of directly labeling the image quality scores of specific video segments by a testee in a traditional mode can also be realized, and the like.
Based on the foregoing, the following describes a technical solution specifically provided in the embodiments of the present application.
Example 1
First, this embodiment describes a process of evaluating the image quality of the video content using the image quality evaluation model after training for a specific model. Specifically, the first embodiment provides a video image quality evaluation method, referring to fig. 3, the method may include:
s301: at least one image frame is determined from a target video to be subjected to image quality assessment, and the image frame is divided into a plurality of image blocks in a spatial dimension.
The target video to be subjected to the image quality evaluation may be a video submitted, uploaded or released by a video content producer to a streaming media playing platform, for which image quality evaluation needs to be performed on the target video to determine whether a specific image quality meets an admission rule of the platform (for example, a video with too poor image quality may not be allowed to be released to the specific platform so as not to affect a viewing experience of a user, etc.), or a video distribution policy when facing the user may also be determined based on a specific image quality evaluation result, for example, some videos with good image quality may be provided to the user through a specific channel, etc.
In the embodiment of the present application, the specific image quality evaluation model may adopt the structure of connecting two full-connection layers after the backbone network as described above, and the specific target video is usually a video with a relatively high resolution, so for the image frame therein, the block processing may be performed first, that is, the image block may be divided into a plurality of image blocks in the spatial dimension. In particular, when the image block is divided, the size of each image block may be controlled within a range that can be supported by the image quality evaluation model. That is, the image frame may be divided into a plurality of image blocks according to the size of the image frame and the maximum input size supported by the image quality assessment model in particular so that the size of the image block is smaller than or equal to the maximum input size supported by the image quality assessment model.
S302: and respectively inputting the image blocks into an image quality evaluation module, wherein the image quality evaluation model comprises a backbone network, a first full-connection layer and a second full-connection layer, the first full-connection layer is used for outputting image quality evaluation scores of the image blocks, and the second full-connection layer is used for outputting weights of the image blocks.
After obtaining the plurality of image blocks, the plurality of image blocks may be input to the image quality evaluation module, and since the image quality evaluation module in the embodiment of the present application includes the backbone network, the first full connection layer, and the second full connection layer, the image quality evaluation score and the weight of each image block may be output through the first full connection layer and the second full connection layer, respectively.
S303: and carrying out weighted calculation processing on the image quality evaluation scores corresponding to the image blocks according to the weights corresponding to the image blocks so as to obtain the image quality evaluation scores of the image frames.
The weighted calculation process herein may include weighted summation, weighted averaging, and the like.
S304: and determining the image quality evaluation score of the target video according to the image quality evaluation score of the at least one image frame.
After obtaining the image quality evaluation score of at least one image frame, the image quality evaluation score of the target video may be further determined, for example, an average value of the image quality evaluation scores of a plurality of image frames may be taken as the image quality evaluation score of the target video, and so on.
Specifically, the image quality evaluation model may be obtained by performing supervised learning from a plurality of video samples in the training dataset and corresponding subjective image quality evaluation scores. Specifically, in a preferred embodiment, the subjective image quality assessment scores of the plurality of video samples are obtained by: generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a testee user; sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling better/worse image quality in the two video samples are provided; recording labeling results of the tested person user on the plurality of sample pairs; and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the plurality of testee users, wherein the video samples and the corresponding subjective image quality assessment scores.
In addition, in a preferred embodiment, when acquiring the training data set, in order to make the video samples in the data set more uniform in terms of categories, video features and the like, a plurality of video contents may be sampled from a plurality of video categories according to a video category division rule in a target system, then the video contents may be cut into a plurality of video segments with equal time lengths in a time dimension, video segments representing the video contents may be determined from the plurality of video segments, and video segments corresponding to the plurality of video contents may be sampled according to video features in the plurality of dimensions, so as to obtain a target number of video segments, and the target number of video segments may be determined as the plurality of video samples.
In this embodiment, since the specific image quality assessment model may include a backbone network, a first full-connection layer, and a second full-connection layer, where the first full-connection layer and the second full-connection layer are respectively used to output an image quality assessment score and a weight, when the image quality assessment is specifically performed on a target video, in a target video with higher resolution, an image frame in the video may be divided into a plurality of image blocks in a spatial dimension, and then input into the image quality assessment model for assessment, so that the image quality assessment model may output an image quality assessment score and a corresponding adaptive weight of each image block, and then may perform weighted calculation processing on the image quality assessment scores respectively corresponding to the plurality of image blocks according to the weights respectively corresponding to the plurality of image blocks, so as to obtain the image quality assessment score of the image frame, and then may determine the image quality assessment score of the target video according to the image quality assessment score of the at least one image frame. In this way, the size of the input data of the image quality assessment model can be within the range of the input size supported by the image quality assessment model, and the resolution of a single image block is not reduced due to the cutting mode in the space dimension, so that the image quality of the image block is not affected, and the accuracy of the assessment result output by the image quality assessment model is improved.
Example two
The second embodiment provides a video sample processing method mainly aiming at the process of obtaining subjective image quality assessment scores of video clips in a training data set, and referring to fig. 4, the method may include:
s401: a plurality of video samples in a training dataset is acquired.
In particular, in order to make the video samples in the training data set as uniformly distributed as possible in terms of categories, video features and the like, a plurality of video contents may be sampled from a plurality of video categories according to a video category division rule in a target system, and then the video contents are cut into a plurality of video segments with equal time lengths in a time dimension, and a video segment representing the video contents is determined from the plurality of video segments. Then, according to video characteristics in multiple dimensions, sampling video segments corresponding to the multiple video contents to obtain a target number of video segments, and determining the target number of video segments as the multiple video samples. Therefore, the samples in the training data set are distributed uniformly in the dimensions of categories, video features and the like, so that the training effect on the image quality assessment model is improved, and the accuracy of the assessment result given by the image quality assessment model is improved.
In addition, in the embodiment of the application, the testee is enabled to compare the image quality of different video clips in the same sample pair, and mark the better/worse of the video clips, but the original video content may not all have the same resolution, and the video content with different resolutions may not have the same image quality. Therefore, in a preferred embodiment, the video segment may also be up-sampled or down-sampled so that the multiple video samples have the same resolution, and then compared by the subject.
S402: and generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a tested person user.
After the training data set is acquired, multiple subjects may be invited to participate in the experiment. Specifically, each time a subject participates in an experiment, a plurality of pairs of samples may be generated from a plurality of video samples in the training data set, so that the subject performs image quality comparison on the video samples based on the pairs of samples.
In the specific implementation, the specific experimental process can be carried out among a plurality of testee users in series, at this time, video clips with uncertainty higher than a threshold value can be selected according to the labeling results of historical testee users, and when sample pairs are generated for new testee users, the proportion of the video clips with uncertainty higher than the threshold value in the sample pairs is increased. In this way, more uncertain higher video segments can be made to appear in sample pairs facing subsequent subjects to obtain more annotation data about such video segments, and thus more deterministic annotation results about such video segments.
S403: and sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling better/worse image quality in the two video samples are provided.
After generating the plurality of pairs of samples, the plurality of pairs of samples may be played in sequence. For the same sample pair, two video clips can be played at the same time, so that the testees can conveniently compare the video clips. In this embodiment, since the video clip may generally have a relatively high resolution, for example, 4K, etc., in order to obtain a more accurate comparison result, two video samples in the same sample pair may be output to two display devices for simultaneous playing, where the two display devices have the same resolution and are the same as the resolutions of the multiple video samples. In this way, two video clips in the same sample pair can be played under the same playing condition, so that the comparability between the two video clips is further improved.
In addition to playing specific sample pairs, an operation interface may be displayed through another display, so that the testee can submit labeling results specifically related to each sample pair based on the operation interface, that is, select the better/worse one from each sample pair, and submit through the operation interface.
S404: recording labeling results of the tested person user on the plurality of sample pairs.
And after receiving labeling results submitted by the user of the tested person on the plurality of sample pairs, recording. In particular, in an optional manner, the plurality of video samples may be numbered separately, and an n×n matrix may be generated, where N is the number of video samples, elements in the matrix correspond to pairs of samples, element values of the elements represent samples in the same pair of samples, and samples corresponding to row numbers/column numbers are marked as better/worse times when comparing with samples corresponding to column numbers/row numbers; in this way, the labeling results of the plurality of sample pairs by the user of the testee can be saved by updating the element values at the corresponding positions in the matrix. For example, in a specific implementation manner, if the sample pair is composed of the video sample of the x-th row and the video sample of the y-th column, and the labeling result is that the image quality of the video sample of the x-th row is better than that of the video sample of the y-th column, the element value of the x-th row and the y-th column in the matrix may be subjected to 1 addition processing.
S405: and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the plurality of testee users, wherein the video samples and the corresponding subjective image quality assessment scores are used for training an image quality assessment model.
In the process of obtaining the labeling results of the multiple testee users, the change condition of the quality distribution of each video segment can be calculated, and if the change condition is smaller than a certain threshold value, the experimental process can be ended. In this way, labeling results of a plurality of subject users can be obtained, and then the labeling results can be converted into subjective image quality assessment scores of specific video samples.
The method for converting the labeling result into the subjective image quality assessment score can be various. For example, in one mode, if the labeling results of the multiple testees are stored in the matrix mode, and if the labeling results are that the image quality of the video samples in the x-th row is better than that of the video samples in the y-th column when the samples consist of the video samples in the x-th row and the video samples in the y-th column, then the element values in the x-th row and the y-th column in the matrix are processed by adding 1, at this time, the total number n1 of the labeling of the video samples with the better image quality in the process of comparing the video samples with other video samples can be determined according to the sum of the element values in the row of the video samples, and the total number n2 of the video samples with the other video samples which are not labeled with the better image quality in the process of comparing the video samples can be determined according to the sum of the element values in the row of the video samples; further, a subjective image quality assessment score of the video sample may be determined based on the times n1 and n 2.
In summary, with the second embodiment, since a plurality of pairs of samples can be generated from a plurality of video clips in the training data set, the subject can select the person with better/worse image quality after viewing two video clips in the pairs of samples, and thus the user labeling process can be completed. And then, the subjective image quality assessment scores of the video clips can be obtained through conversion of the labeling results given by the plurality of testers aiming at the plurality of sample pairs, so as to be used for carrying out supervision training on the image quality assessment model. Therefore, in this way, the labeling process of the testee can be simplified and the participation threshold of the testee can be reduced compared with the mode that the testee directly gives out the image quality evaluation scores of the video clips. In addition, the mode of selecting the video clips with better/worse image quality from the two video clips which are played simultaneously is easier to obtain more accurate labeling results from the tested person, and accordingly, the accuracy of the subjective image quality assessment score generated through conversion can be improved. Furthermore, when the model is supervised and trained by using the more accurate subjective image quality assessment score, the training effect of the image quality assessment model can be improved, so that the accuracy of the objective image quality assessment score output by the image quality assessment model is also improved.
For the parts of the first and second embodiments that are not described in detail, reference may be made to the descriptions of the other parts in the specification, and the details are not repeated here.
It should be noted that, in the embodiments of the present application, the use of user data may be involved, and in practical applications, user specific personal data may be used in the schemes described herein within the scope allowed by applicable legal regulations in the country where the applicable legal regulations are met (for example, the user explicitly agrees to the user to actually notify the user, etc.).
Corresponding to the first embodiment, the embodiment of the present application further provides a video image quality evaluation apparatus, referring to fig. 5, the apparatus may include:
an image dividing unit 501 for determining at least one image frame from a target video to be subjected to image quality evaluation, and dividing the image frame into a plurality of image blocks in a spatial dimension;
an input unit 502, configured to input the plurality of image blocks into an image quality evaluation module, where the image quality evaluation module includes a backbone network, a first fully-connected layer and a second fully-connected layer, where the first fully-connected layer is configured to output an image quality evaluation score of the image block, and the second fully-connected layer is configured to output a weight of the image block;
A calculating unit 503, configured to perform weighted calculation processing on the image quality evaluation scores corresponding to the image blocks according to weights corresponding to the image blocks, so as to obtain image quality evaluation scores of the image frames;
a score determining unit 504, configured to determine an image quality evaluation score of the target video according to the image quality evaluation score of the at least one image frame.
Wherein, the image segmentation unit may specifically be used for:
dividing the image frame into a plurality of image blocks according to the size of the image frame and the maximum input size supported by the image quality assessment model so that the size of the image blocks is smaller than or equal to the maximum input size supported by the image quality assessment model.
Specifically, the image quality evaluation model is obtained by performing supervised learning by using a plurality of video samples in a training data set and corresponding subjective image quality evaluation scores;
wherein subjective image quality assessment scores of the plurality of video samples are obtained by:
generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a testee user;
sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling better/worse image quality in the two video samples are provided;
Recording labeling results of the tested person user on the plurality of sample pairs;
and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the multiple testee users.
Particularly, when a plurality of video samples in the training data set are acquired, a plurality of video contents can be selected from a plurality of video categories according to a video category dividing rule in a target system; cutting the video content into a plurality of video fragments with equal time length in the time dimension, and determining a video fragment representing the video content from the plurality of video fragments; and selecting a target number of video clips from the video clips corresponding to the video contents according to the video characteristics in the multiple dimensions, and determining the target number of video clips as the video samples.
Corresponding to the embodiment, the embodiment of the present application further provides a video sample processing device, referring to fig. 6, the device may include:
a sample acquiring unit 601, configured to acquire a plurality of video samples in a training data set;
a sample pair generating unit 602, configured to generate a plurality of sample pairs according to the plurality of video samples in a process of labeling the plurality of video samples by a subject user;
A playing unit 603, configured to sequentially play the plurality of sample pairs, where in a process of simultaneously playing two video samples in the same sample pair, an operation option for labeling a better/worse image quality of the two video samples is provided;
a recording unit 604, configured to record labeling results of the plurality of sample pairs by the subject user;
the score determining unit 605 is configured to determine subjective image quality assessment scores of the video samples according to labeling results corresponding to the multiple subject users, where the video samples and the corresponding subjective image quality assessment scores are used for training an image quality assessment model.
Wherein, the playing unit may specifically be used for:
and outputting the two video samples in the same sample pair to two display devices for simultaneous playing, wherein the two display devices have the same resolution and the same resolution as the plurality of video samples.
Specifically, the labeling process is carried out among a plurality of tested person users in series;
the sample pair generating unit may specifically be configured to:
selecting a video segment with uncertainty higher than a threshold according to the labeling result of the historical tested person user;
And when a sample pair is generated for a new tested person user, increasing the proportion of the video fragments with uncertainty higher than a threshold value in the sample pair.
In particular, the recording unit may be specifically configured to:
numbering the video samples respectively, and generating an N multiplied by N matrix, wherein N is the number of the video samples, the elements in the matrix correspond to sample pairs, the element values of the elements represent samples corresponding to row serial numbers/column serial numbers in the same sample pair, and the samples corresponding to the column serial numbers/row serial numbers are marked as better/worse times when compared;
and saving the labeling results of the tested person user on the plurality of sample pairs by updating the element values at the corresponding positions in the matrix.
Specifically, if the sample pair is composed of the video sample of the x-th row and the video sample of the y-th column, and the labeling result is that the image quality of the video sample of the x-th row is better than that of the video sample of the y-th column, adding 1 to the element value of the x-th row and the y-th column in the matrix.
The score determination unit may specifically be configured to:
determining the total times n1 marked as better in image quality in the process of comparing the video sample with other video samples according to the sum of element values in the row of the same video sample;
Determining the total times n2 which are not marked as having better image quality in the process of comparing the video sample with other video samples according to the sum of the element values in the column of the video sample;
and determining subjective image quality assessment scores of the video samples according to the times n1 and n 2.
In particular, the sample acquisition unit may be specifically configured to:
sampling a plurality of video contents from a plurality of video categories according to a video category division rule in a target system;
cutting the video content into a plurality of video fragments with equal time length in the time dimension, and determining a video fragment representing the video content from the plurality of video fragments;
and selecting a target number of video clips from the video clips corresponding to the video contents according to the video characteristics in the multiple dimensions, and determining the target number of video clips as the video samples.
In addition, the apparatus may further include:
and the resolution processing unit is used for carrying out up-sampling or down-sampling processing on the video clips so that the video samples have the same resolution.
In addition, the embodiment of the application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method of any one of the foregoing method embodiments.
And an electronic device comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of the preceding method embodiments.
Fig. 7 illustrates an architecture of an electronic device, which may include a processor 710, a video display adapter 711, a disk drive 712, an input/output interface 713, a network interface 714, and a memory 720, among others. The processor 710, the video display adapter 711, the disk drive 712, the input/output interface 713, the network interface 714, and the memory 720 may be communicatively connected via a communication bus 730.
The processor 710 may be implemented by a general-purpose CPU (Central Processing Unit, processor), microprocessor, application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided herein.
The Memory 720 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory ), static storage device, dynamic storage device, or the like. The memory 720 may store an operating system 721 for controlling the operation of the electronic device 700, and a Basic Input Output System (BIOS) for controlling the low-level operation of the electronic device 700. In addition, a web browser 723, a data storage management system 724, a video image quality evaluation processing system 725, and the like may be stored. The video quality evaluation processing system 725 may be an application program that specifically implements the operations of the foregoing steps in the embodiments of the present application. In general, when implemented in software or firmware, the relevant program code is stored in memory 720 and executed by processor 710.
The input/output interface 713 is used to connect with an input/output module to enable information input and output. The input/output module may be configured as a component in a device (not shown) or may be external to the device to provide corresponding functionality. Wherein the input devices may include a keyboard, mouse, touch screen, microphone, various types of sensors, etc., and the output devices may include a display, speaker, vibrator, indicator lights, etc.
The network interface 714 is used to connect communication modules (not shown) to enable communication interactions of the device with other devices. The communication module may implement communication through a wired manner (such as USB, network cable, etc.), or may implement communication through a wireless manner (such as mobile network, WIFI, bluetooth, etc.).
Bus 730 includes a path to transfer information between various components of the device (e.g., processor 710, video display adapter 711, disk drive 712, input/output interface 713, network interface 714, and memory 720).
It should be noted that although the above devices illustrate only the processor 710, the video display adapter 711, the disk drive 712, the input/output interface 713, the network interface 714, the memory 720, the bus 730, etc., the device may include other components necessary to achieve proper operation in an implementation. Furthermore, it will be understood by those skilled in the art that the above-described apparatus may include only the components necessary to implement the present application, and not all the components shown in the drawings.
From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in the embodiments or some parts of the embodiments of the present application.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a system or system embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, with reference to the description of the method embodiment being made in part. The systems and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The video image quality evaluation method and the electronic device provided by the application are described in detail, and specific examples are applied to illustrate the principles and the implementation modes of the application, and the description of the above examples is only used for helping to understand the method and the core idea of the application; also, as will occur to those of ordinary skill in the art, many modifications are possible in view of the teachings of the present application, both in the detailed description and the scope of its applications. In view of the foregoing, this description should not be construed as limiting the application.

Claims (14)

1. A video quality assessment method, comprising:
determining at least one image frame from a target video to be subjected to image quality evaluation, and dividing the image frame into a plurality of image blocks in a space dimension;
the image quality evaluation model comprises a backbone network, a first full-connection layer and a second full-connection layer, wherein the first full-connection layer is used for outputting image quality evaluation scores of the image blocks, and the second full-connection layer is used for outputting weights of the image blocks;
weighting calculation processing is carried out on the image quality evaluation scores corresponding to the image blocks according to the weights corresponding to the image blocks so as to obtain the image quality evaluation scores of the image frames;
And determining the image quality evaluation score of the target video according to the image quality evaluation score of the at least one image frame.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the dividing the image frame into a plurality of image blocks in a spatial dimension includes:
dividing the image frame into a plurality of image blocks according to the size of the image frame and the maximum input size supported by the image quality assessment model so that the size of the image blocks is smaller than or equal to the maximum input size supported by the image quality assessment model.
3. The method of claim 1, wherein the step of determining the position of the substrate comprises,
the image quality evaluation model is obtained by supervised learning through a plurality of video samples in a training data set and corresponding subjective image quality evaluation scores;
wherein subjective image quality assessment scores of the plurality of video samples are obtained by:
generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a testee user;
sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling better/worse image quality in the two video samples are provided;
Recording labeling results of the tested person user on the plurality of sample pairs;
and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the multiple testee users.
4. The method of claim 3, wherein the step of,
the acquiring a plurality of video samples in a training dataset includes:
sampling a plurality of video contents from a plurality of video categories according to a video category division rule in a target system;
cutting the video content into a plurality of video fragments with equal time length in the time dimension, and determining a video fragment representing the video content from the plurality of video fragments;
and selecting a target number of video clips from the video clips corresponding to the video contents according to the video characteristics in the multiple dimensions, and determining the target number of video clips as the video samples.
5. A method of processing video samples, comprising:
acquiring a plurality of video samples in a training data set;
generating a plurality of sample pairs according to the plurality of video samples in the process of labeling the plurality of video samples by a testee user;
Sequentially playing the plurality of sample pairs, wherein in the process of simultaneously playing two video samples in the same sample pair, operation options for labeling better/worse image quality in the two video samples are provided;
recording labeling results of the tested person user on the plurality of sample pairs;
and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the plurality of testee users, wherein the video samples and the corresponding subjective image quality assessment scores are used for training an image quality assessment model.
6. The method of claim 5, wherein the step of determining the position of the probe is performed,
the sequentially playing the plurality of sample pairs includes:
and outputting the two video samples in the same sample pair to two display devices for simultaneous playing, wherein the two display devices have the same resolution and the same resolution as the plurality of video samples.
7. The method of claim 5, wherein the step of determining the position of the probe is performed,
the labeling process is carried out among a plurality of tested person users in series;
the generating a plurality of sample pairs from the plurality of video samples includes:
Selecting a video segment with uncertainty higher than a threshold according to the labeling result of the historical tested person user;
and when a sample pair is generated for a new tested person user, increasing the proportion of the video fragments with uncertainty higher than a threshold value in the sample pair.
8. The method of claim 5, wherein the step of determining the position of the probe is performed,
recording the labeling results of the tested person user on the plurality of sample pairs, wherein the recording comprises the following steps:
numbering the video samples respectively, and generating an N multiplied by N matrix, wherein N is the number of the video samples, the elements in the matrix correspond to sample pairs, the element values of the elements represent samples corresponding to row serial numbers/column serial numbers in the same sample pair, and the samples corresponding to the column serial numbers/row serial numbers are marked as better/worse times when compared;
and saving the labeling results of the tested person user on the plurality of sample pairs by updating the element values at the corresponding positions in the matrix.
9. The method of claim 8, wherein the step of determining the position of the first electrode is performed,
the step of saving the labeling results of the tested person user on the plurality of sample pairs by updating the element values at the corresponding positions in the matrix comprises the following steps:
And if the sample pair consists of the video sample of the x row and the video sample of the y column, and the labeling result is that the image quality of the video sample of the x row is better than that of the video sample of the y column, adding 1 to the element value of the x row and the y column in the matrix.
10. The method of claim 9, wherein the step of determining the position of the substrate comprises,
and determining subjective image quality assessment scores of the video samples according to the labeling results corresponding to the plurality of testee users, wherein the subjective image quality assessment scores comprise the following steps:
determining the total times n1 marked as better in image quality in the process of comparing the video sample with other video samples according to the sum of element values in the row of the same video sample;
determining the total times n2 which are not marked as having better image quality in the process of comparing the video sample with other video samples according to the sum of the element values in the column of the video sample;
and determining subjective image quality assessment scores of the video samples according to the total times n1 and n 2.
11. The method according to any one of claims 5 to 10, wherein,
the acquiring a plurality of video samples in a training dataset includes:
sampling a plurality of video contents from a plurality of video categories according to a video category division rule in a target system;
Cutting the video content into a plurality of video fragments with equal time length in the time dimension, and determining a video fragment representing the video content from the plurality of video fragments;
and selecting a target number of video clips from the video clips corresponding to the video contents according to the video characteristics in the multiple dimensions, and determining the target number of video clips as the video samples.
12. The method as recited in claim 11, further comprising:
the video segments are upsampled or downsampled such that the plurality of video samples have the same resolution.
13. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method of any of claims 1 to 12.
14. An electronic device, comprising:
one or more processors; and
a memory associated with the one or more processors for storing program instructions that, when read for execution by the one or more processors, perform the steps of the method of any of claims 1 to 12.
CN202310184311.7A 2023-02-28 2023-02-28 Video image quality evaluation method and electronic equipment Pending CN116363068A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310184311.7A CN116363068A (en) 2023-02-28 2023-02-28 Video image quality evaluation method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310184311.7A CN116363068A (en) 2023-02-28 2023-02-28 Video image quality evaluation method and electronic equipment

Publications (1)

Publication Number Publication Date
CN116363068A true CN116363068A (en) 2023-06-30

Family

ID=86938955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310184311.7A Pending CN116363068A (en) 2023-02-28 2023-02-28 Video image quality evaluation method and electronic equipment

Country Status (1)

Country Link
CN (1) CN116363068A (en)

Similar Documents

Publication Publication Date Title
Ghadiyaram et al. In-capture mobile video distortions: A study of subjective behavior and objective algorithms
Ghadiyaram et al. Massive online crowdsourced study of subjective and objective picture quality
US10368123B2 (en) Information pushing method, terminal and server
CN114584849B (en) Video quality evaluation method, device, electronic equipment and computer storage medium
Nuutinen et al. CVD2014—A database for evaluating no-reference video quality assessment algorithms
Moorthy et al. Visual quality assessment algorithms: what does the future hold?
Xu et al. Perceptual quality assessment of internet videos
CN112818737B (en) Video identification method, device, storage medium and terminal
Li et al. Subjective and objective quality assessment of compressed screen content videos
CN112950596B (en) Tone mapping omnidirectional image quality evaluation method based on multiple areas and multiple levels
CN113923441A (en) Video quality evaluation method and device and electronic equipment
Shi et al. Study on subjective quality assessment of screen content images
CN113452944B (en) Picture display method of cloud mobile phone
Fremerey et al. Subjective test dataset and meta-data-based models for 360° streaming video quality
CN110674925A (en) No-reference VR video quality evaluation method based on 3D convolutional neural network
Mozhaeva et al. Constant subjective quality database: the research and device of generating video sequences of constant quality
WO2008081386A1 (en) Film cadence detection
Saha et al. Perceptual video quality assessment: The journey continues!
CN116363068A (en) Video image quality evaluation method and electronic equipment
Ying et al. Telepresence video quality assessment
Xiang et al. A deep learning-based no-reference quality metric for high-definition images compressed with HEVC
US20050267726A1 (en) Apparatus and method for prediction of image reality
Kundu Subjective and objective quality evaluation of synthetic and high dynamic range images
CN113420809A (en) Video quality evaluation method and device and electronic equipment
Han et al. Accuracy analysis on 360 virtual reality video quality assessment methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination