CN116416216A

CN116416216A - Quality evaluation method based on self-supervision feature extraction, storage medium and terminal

Info

Publication number: CN116416216A
Application number: CN202310188690.7A
Authority: CN
Inventors: 周泽宏; 周飞; 盛巍; 邱国平
Original assignee: Shenzhen University; Peng Cheng Laboratory
Current assignee: Shenzhen University; Peng Cheng Laboratory
Priority date: 2023-02-17
Filing date: 2023-02-17
Publication date: 2023-07-11

Abstract

The invention discloses a quality evaluation method based on self-supervision feature extraction, a storage medium and a terminal, and belongs to the technical field of digital image processing and computer vision. And obtaining a content characteristic vector according to the content characteristic, and obtaining a distortion characteristic vector according to the distortion characteristic. And splicing the content feature vector and the distortion feature vector to obtain a quality feature vector. The quality feature vector passes through the full connection layer to obtain a predicted quality score, and the predicted quality score is trained by sample data with subjective opinion score to obtain the trained predicted quality score. The invention optimizes the high-efficiency extraction of the quality related characteristics through the cooperation of the content self-encoder and the distortion self-encoder, plays a guiding role in the digital image and video production flow, and promotes the development of the digital image and video production industry.

Description

Quality evaluation method based on self-supervision feature extraction, storage medium and terminal

Technical Field

The invention belongs to the technical field of digital image processing and computer vision, and particularly relates to a quality evaluation method, a storage medium and a terminal based on self-supervision feature extraction.

Background

With the continuous development of socioeconomic and scientific technologies, the demands for digital images and videos with high resolution, high frame rate, wide brightness and wide color gamut are becoming more and more common while enjoying the convenience of digital media transmission. In order to improve the consumer experience of the public users, the digital image and video production industry continuously utilizes advanced hardware acquisition technology and digital image processing software technology to produce exquisite and good images and videos. The final experience that the produced digital images and video bring to the consumer is determined by the visual effect that it presents to the terminal device. Therefore, subjective visual quality assessment by consumers has instructive significance for the development and advancement of the digital image and video production industry.

The objective quality evaluation method aims at simulating a Human Visual System (HVS) by using a calculation model, and evaluating given images or videos according with human visual characteristics, so that the real-time performance is strong, the repetition rate is high and the cost performance is high. The feedback information is provided for the digital image and video production flow through a proper objective quality evaluation method, so that the rapid development of the production industry is promoted to meet the increasingly rising consumption demands of the masses, and the method is of significance in objective quality evaluation research.

The objective evaluation model can be divided into Full Reference (FR), half reference (RR) and No Reference (NR) evaluation models according to the required complete reference image or video, partial information of the reference image or video, and unnecessary reference information, respectively. The reference-free quality evaluation has larger research difficulty because no reference information is needed, but the application range is the widest. Compared with the quality evaluation method with reference, the non-reference quality evaluation algorithm needs to estimate the distortion characteristics of the distortion image only from the distortion image without the reference image, and takes the distortion characteristics as the standard of quality evaluation. As a previous study, earlier reference-free image quality evaluation algorithms have been generally limited in application range, starting from only a single distortion type, such as JPEG or JPEG2000 compression and blurring, or comprehensively considering and estimating quality by various factors affecting quality, such as sharpness, contrast, noise, blocking effect, etc. Researchers have then proposed a more general evaluation algorithm, which is achieved mainly by two aspects: based on Natural Scene Statistics (NSS) and based on dictionary construction. The first, also the most mainstream NSS in traditional algorithms, i.e. neural processing in HVS is considered to have been adapted to process visual information whose natural image features follow some typical statistical distribution. Once a natural image suffers from some distortion, the image quality suffers from variations, along with which the statistical distribution to which the features are subject will vary. Based on this, the extracted feature distribution statistics are correlated with quality, and the correlation is used to learn the feature-to-quality mapping, the most common mapping learning tool being Support Vector Regression (SVR). Common techniques for NSS-based quality feature extraction in recent years include local luminance normalization (MSCN), wavelet transform (DWT), discrete Cosine Transform (DCT), mixed domain transform, and the like. Common fitting distributions are defined by Generalized Gaussian Distribution (GGD), asymmetric Generalized Gaussian Distribution (AGGD), and the like. However, the above-described NSS-based method basically relies on the prediction of manually designed features and a priori statistical distributions, which is difficult to adapt to various or complex distortion modeling and rich image content information, and thus difficult to achieve better performance in more complex IQA quality assessment requirements. Another general non-reference quality evaluation algorithm is based on dictionary construction, i.e. input image blocks are encoded into features using a dictionary constructed by non-supervised learning, and then the mapping of features to quality scores is achieved. However, the construction of the dictionary is complicated, and a proper dictionary needs to be reconstructed for a new image set, so that the method is not suitable for practical application; meanwhile, dictionary coding is only carried out on image blocks, special designs on distortion characteristics and the like are not carried out, and efficiency is low. Therefore, it is difficult to achieve satisfactory performance in higher-demand quality evaluation tasks, whether NSS-based or dictionary-based evaluation algorithms.

As DNNs have achieved breakthrough performance improvements in image recognition, image classification tasks, researchers have begun to attempt to apply DNNs to quality evaluation tasks to achieve performance improvements. The most intuitive method is to use sample data with subjective score label MOS to perform end-to-end training of the network, and realize simultaneous optimization of feature extraction and score mapping. However, because the data with subjective score labels is scarce, the data volume of the maximum labeled image quality evaluation database in the prior art can only reach tens of thousands, so that under the end-to-end training strategy, it is difficult to design a network with sufficient depth to improve the generalization performance of the evaluation model. To alleviate such problems, researchers have proposed various learning strategies to make full use of existing tagged data. In addition to the data gain means commonly used in other tasks, more time researchers have attempted to utilize additional tasks to assist quality assessment tasks, including multi-task learning distortion types and degrees of distortion, gradient extraction, image restoration to obtain reference images, segmentation task assisted content perception, imageNet pre-training networks, multi-library training, meta-learning tasks, and so forth. Although these algorithms alleviate the problem of insufficient sample size to some extent, targeted training of networks with large-scale samples is still not achieved.

In order to design a more deep neural network while achieving a reference-free quality assessment algorithm with sufficiently strong generalization performance, some researchers began training feature extraction modules and fractional regression modules in the deep network separately. The feature extraction module performs deep training on a large-scale subjective score-free label sample in a weak supervision or unsupervised mode, and finally the fixed feature extractor is combined with the regression module to perform simple fine tuning on a small sample with subjective score label sample. Such algorithms are referred to herein as second order learning methods. The proxy label designed by the feature extraction module is characterized in that one class is quality correlation graph (such as SSIM graph) or pseudo label fraction obtained by the FR algorithm of the distorted image and the reference image, and the other class is that the distorted image pair obtains sequencing labels under the reference of known priori knowledge and FR algorithm result, and then pair training is carried out by utilizing a structure such as a twin network. Besides using the known prior as the basis of the ordering labels, the prior of the distortion type can also be used as the pre-training pre-positioned synthesized distortion feature extractor; or using distortion type prior as a label to perform feature comparison learning on the classification network. However, a priori based proxy tags are typically limited in that the distortion information is explicitly descriptive, which makes the marked data less likely to suffer from a complex diversity of distortions; the agent label based on the FR algorithm result depends on the accuracy and generalization capability of the selected FR algorithm, and once the FR algorithm is inaccurately estimated, the training quality of the feature extraction process is affected.

Therefore, the design can complete training on large-scale unlabeled sample data and effectively extract frames of quality related features, and meanwhile, the training process is not constrained by any subjective label or artificial design or artificial given label; and the quality characteristic extraction encoder obtained by training is utilized, and the high-efficiency non-reference quality evaluator obtained by further training regression has important significance.

In the conventional self-encoder, the input of the decoder is only from the self-encoder, the output of the decoder is input by the encoder as a reconstruction target, at this time, the self-encoder finishes simple dimension reduction in the encoding and decoding process, and does not involve extraction or separation of the content features and the distortion features, at this time, the extracted features belong to the mixed features, and the self-encoder cannot drive the self-encoder to pay more attention to the distortion information in the image.

Disclosure of Invention

The invention aims to provide a quality evaluation method, a storage medium and a terminal based on self-supervision feature extraction, so as to solve the technical problems.

Therefore, the invention provides a quality evaluation method based on self-supervision feature extraction, which comprises the following steps:

constructing a cooperative double self-encoder, wherein the cooperative double self-encoder subjected to self-supervision training extracts content characteristics from an input content image and distortion characteristics from an input distortion image, and the input content image and the input distortion image are identical or have identical contents;

Obtaining a content feature vector according to the content feature, and obtaining a distortion feature vector according to the distortion feature;

splicing the content feature vector and the distortion feature vector to obtain a quality feature vector;

the quality feature vector passes through the full connection layer to obtain a predicted quality score, and the predicted quality score is trained by sample data with subjective opinion score to obtain the trained predicted quality score.

Preferably, the constructed cooperative dual self-encoder includes a constructed content self-encoder and a constructed distortion self-encoder, wherein the constructed content self-encoder includes a constructed content encoder and a constructed content decoder, the constructed distortion self-encoder includes a constructed distortion encoder and a constructed distortion decoder, the content encoder extracts content features from an input content image, the content decoder decodes the content features to obtain a reconstructed content image, the distortion encoder extracts distortion features from the input distortion image, and the distortion decoder decodes the content features and the distortion features to obtain a reconstructed distortion image.

Preferably, the self-supervised training of the cooperative dual self-encoders includes self-supervised training of the content self-encoder and self-supervised training of the distortion self-encoder,

The self-supervised training of the content self-encoder includes:

training the content self-encoder by taking the undistorted image as training data, wherein the output constraint of the content decoder is that the undistorted image is;

training the content self-encoder by taking the distorted image as training data, taking an undistorted image sharing the same content with the distorted image as constraint output by a content decoder, and taking the content characteristics of the undistorted image as constraint conditions of the content characteristics of the distorted image;

the self-supervised training of the distortion self-encoder includes:

the distortion self-encoder is trained by taking the distortion image as training data, the output constraint of the distortion decoder is that the distortion image is self-encoded, and the loss function guides back propagation to optimize the distortion self-encoder.

Preferably, the output constraint functions of the content decoder and the distortion decoder are:

l _overall ＝μ×l _pixel (I _o ,I _r )+(1-μ)×l _percp (I _o ,I _r ) (1)

wherein, lovelall is the whole constraint, mu is the balance parameter, lpixel is the pixel-level constraint, io is the distorted or undistorted image, ir is the reconstructed content image or the reconstructed distorted image, lpercp is the perception constraint;

wherein N is the total number of pixels, and k is a certain pixel position index of the image;

wherein the method comprises the steps of

VGGNet network feature layer selected for the j-th layer, L is the total number of layers of the selected feature layer, C _j H _j W _j The output feature map size for the j-th feature layer.

Preferably, when the content self-encoder is self-supervised trained, the extracted content features are also sparsely constrained and/or distance constrained.

Preferably, the constructing the distortion encoder includes extracting different feature maps of a plurality of layers of the input distortion image, respectively feeding each feature map into the spatial pyramid pooling module for fusion to obtain a plurality of low-dimensional features with fixed lengths, and finally splicing the plurality of low-dimensional features to serve as final output of the distortion encoder.

Preferably, the obtaining the content feature vector according to the content feature includes obtaining the content feature vector by the content feature through a fusion network composed of a convolution layer and a global pooling layer.

Preferably, constructing the distortion decoder includes:

the distortion characteristics are subjected to a full connection layer to obtain extension characteristics, wherein the extension characteristics comprise distortion characteristics of each layer and associated information for representing the relationship between the distortion characteristics of each layer;

the extended features are divided into a plurality of decomposition features, which are sequentially used as one of the inputs of the sub-modulation residual blocks.

In addition, the present invention also provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the quality evaluation method based on self-supervised feature extraction as described above.

In addition, the invention also provides a terminal, which comprises: a processor and a memory having stored thereon a computer readable program executable by the processor; the processor, when executing the computer readable program, implements the steps in the quality assessment method based on self-supervised feature extraction as described above.

Compared with the prior art, the invention has the characteristics and beneficial effects that:

(1) The invention designs a collaborative double-self encoder, which firstly trains a content self encoder by taking high-quality and undistorted sample data as a main part to obtain a content encoder capable of encoding content characteristics weakly related to distortion information; and then training a distortion self-encoder by taking the distortion sample data as a main part, taking the content characteristics as auxiliary information to assist in extracting the distortion characteristics, and realizing the efficient extraction of the quality related characteristics through the cooperative optimization of the two self-encoders. The output of the cooperative dual self-encoder is supervised and trained by the input itself, and the process does not need any MOS tag, and belongs to a complete self-supervision mode. The invention utilizes the characteristic extraction mode based on self-supervision learning, reduces the dependence on the data sample with the subjective score label, simultaneously avoids the dependence of manual design or artificial selected agent labels, and utilizes rich and diverse training samples to complete training of the characteristic extractor.

(2) The invention designs a non-reference quality evaluation method based on self-supervision learning feature extraction, which utilizes a content encoder and a distortion encoder which are trained based on a quality feature extraction framework of a collaborative self-encoder to respectively extract content features and distortion features from an input sample, and designs a lightweight regression network based on the two features to predict quality scores. The invention uses a high-efficiency universal quality evaluation model, plays a certain role in guiding the digital image and video production process, and has a certain research significance in promoting the development of the digital image and video production industry.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without creative effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of a quality evaluation method based on self-supervision feature extraction of the present invention;

FIG. 2a is a scatter plot of a class of colors according to the distortion type of a test image;

FIG. 2b is a scatter plot of the classification of colors according to the quality score of the test image;

FIG. 3 is a graph comparing performance of a no-reference quality assessment algorithm;

FIG. 4 is a schematic diagram of the overall frame of a cooperative dual self-encoder of the present invention;

FIG. 5 is a schematic diagram of a network structure of a content self-encoder according to the present invention;

FIG. 6 is a schematic diagram of a network architecture of a distortion encoder according to the present invention;

FIG. 7 is a schematic diagram of a network architecture of a distortion decoder according to the present invention;

FIG. 8 is a schematic diagram of a non-reference quality assessment network framework based on self-supervised feature extraction;

fig. 9 is a schematic structural diagram of a terminal device provided by the present invention.

The drawings are marked: 11-content encoder, 12-content decoder, 21-distortion encoder, 22-distortion decoder, 30-processor, 31-display, 32-memory, 33-communication interface, 34-bus.

Detailed Description

The invention provides a quality evaluation method, a storage medium and a terminal based on self-supervision feature extraction, which are used for making the purposes, technical schemes and effects of the application clearer and more definite, and further detailed description of the application is provided below by referring to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Aiming at the problem that the existing quality evaluation accuracy with subjective score labels is insufficient, the traditional double-self-encoder extracts the characteristics of mixing content characteristics and distortion characteristics, and has no method for driving the characteristics to pay more attention to distortion information in images, the invention designs and extracts efficient and effective quality related characteristics.

As shown in fig. 1, the quality evaluation method based on self-supervision feature extraction includes the following steps:

s10, constructing a cooperative double self-encoder, wherein the cooperative double self-encoder subjected to self-supervision training extracts content characteristics from an input content image and distortion characteristics from an input distortion image, and the input content image and the input distortion image are identical or have identical contents.

Specifically, constructing the collaborative dual self encoder includes constructing a content self encoder and constructing a distortion self encoder, wherein constructing the content self encoder includes constructing the content encoder 11 and constructing the content decoder 12, constructing the distortion self encoder includes constructing the distortion encoder 21 and constructing the distortion decoder 22, the content encoder 11 extracts content features from the input content image, the content decoder 12 decodes the content features to obtain a reconstructed content image, the distortion encoder 21 extracts distortion features from the input distortion image, and the distortion decoder 22 decodes the content features and the distortion features to obtain the reconstructed distortion image.

Constructing the distortion encoder 21 includes extracting different feature maps of multiple layers of the input distortion image, respectively feeding each feature map into a spatial pyramid pooling module for fusion to obtain multiple low-dimensional features with fixed lengths, and finally splicing the multiple low-dimensional features to serve as final output of the distortion encoder 21.

Constructing the distortion decoder 22 includes: 1. the distortion characteristics are obtained through a full connection layer, and the extension characteristics comprise distortion characteristics of each layer and association information representing the relationship between the distortion characteristics of each layer. 2. The extended features are divided into a plurality of decomposed features which in turn serve as one of the inputs to the sub-modulated residual block.

The self-supervised training of the cooperative dual self-encoders includes self-supervised training of the content self-encoder and self-supervised training of the distortion self-encoder. Self-supervised training of content self-encoders includes: 1. the content self-encoder is trained with the undistorted image as training data, and the output of the content decoder 12 is constrained to the undistorted image itself. 2. The content self-encoder is trained with the distorted image as training data, and an undistorted image sharing the same content as the distorted image is used as a constraint of the output of the content decoder 12, and the content characteristics of the undistorted image are used as a constraint condition of the content characteristics of the distorted image. Self-supervised training of a distorted self-encoder includes: the distorted self-encoder is trained with the distorted image as training data, and the output constraint of the distorted decoder 22 is the distorted image itself, and the loss function directs back propagation to optimize the distorted self-encoder.

The output constraints of the content decoder 12 and the distortion decoder 22 are a function of:

l _overall ＝μ×l _pixel (I _o ,I _r )+(1-μ)×l _percp (I _o ,I _r ) (1)

wherein the method comprises the steps of

When the self-supervision training is carried out on the content self-encoder, the sparse constraint and/or the distance constraint are also carried out on the extracted content characteristics.

S20, obtaining a content feature vector according to the content features, and obtaining a distortion feature vector according to the distortion features. And obtaining a content feature vector according to the content features, wherein the content feature vector comprises the content features which pass through a fusion network consisting of a convolution layer and a global pooling layer, so as to obtain the content feature vector.

And S30, splicing the content feature vector and the distortion feature vector to obtain a quality feature vector.

S40, the quality feature vector passes through the full connection layer to obtain a predicted quality score, and the predicted quality score is trained by sample data with subjective opinion score to obtain the trained predicted quality score.

Efficient quality-related feature extraction can be achieved with a distortion encoder 21 in the distortion self-encoder. The performance of the distortion feature is verified using feature visualization tool t-SNE. In a specific embodiment, 779 distorted images and 29 undistorted images of the LIVE database are used as test samples. The dataset included 5 distortion types (gaussian blur (GB), white Noise (WN), JPEG compression, JP2K compression, and rayleigh fast decay (FF)), with corresponding quality scores for each image (the higher the score, the lower the quality). The feature two-dimensional visualization tool t-SNE can map a high-dimensional data to one point of a two-dimensional image in a self-supervision manner, and thus 779 points in total on a two-dimensional plane can be obtained. At this time, the scattered points are colored according to the distortion type and the quality score of the test image respectively, and a scattered point diagram shown in the figure 2 is obtained. As can be seen from fig. 2 (a), the undistorted image has less feature activation due to no distortion, and is distributed near the origin. Other scattered points with different distortion types are well clustered together, especially white noise and JPEG compression, which have the greatest impact on human visual perception. This illustrates that the extracted distortion characteristics may well characterize the distortion type. As can be seen from fig. 2 (b), the best quality image scatter distribution is concentrated in the center area, the closer the higher quality image scatter is to the center, the lower quality image scatter is to the center. This illustrates that the extracted distortion characteristics have a high correlation with the distortion score.

The quality evaluator performance trained by the invention obtains the industry leading level. In a specific embodiment, the quality evaluation model (QACoAE) proposed by the present invention is trained in five main stream image quality evaluation databases, 80% of the data of the databases are used as training sets, 20% of the data are used as test sets, and the two sets do not have shared image content. The larger the coefficient value, the better the linearity and monotonicity of the prediction score are respectively indicated, the Pearson Linear Correlation Coefficient (PLCC) and the Spearman Rank Correlation Coefficient (SRCC) between the prediction score and the subjective evaluation score sequence of the test set are used as measurement indexes of the prediction performance. The rating model was trained 10 times on each database and the average of PLCCs and SRCCs was calculated, as shown in figure 3, compared to classical and currently prevailing no-reference evaluation algorithms. It can be seen that the proposed algorithm achieves good performance, both in the composite and true distortion databases, and is optimized in the final weighted average performance comparison.

A specific framework for self-supervised feature extraction is a collaborative dual self-encoder, as shown in fig. 4, the present invention contemplates providing content and distortion information for the image separately, since the final visual quality of the image is related to its image content and the resulting distortion. First, the present invention designs a content self-encoder to encode image content information. In order to extract content features weakly related to distortion, the learning targets of the input images are all undistorted images. Another distortion self-encoder is then designed to encode the image distortion information. Unlike the conventional distortion self-encoder, the distortion decoder 22 of the distortion self-encoder inputs the characteristics of the content encoded from the content encoder 11 in addition to the characteristics of the distortion output from the distortion encoder 21. With the content features as the auxiliary information, the distortion encoder 21 can encode the distortion information with pertinence in the case of constraint extraction of the low-dimensional features. The content self-encoder and the distortion self-encoder in the invention are reconstructed cooperatively, so the novel framework is called a cooperative self-encoder. The framework can realize the extraction of content features and distortion features of an input image respectively, and the features are highly correlated with the image quality and can play a role in the subsequent non-reference quality evaluation. The specific structure and design of the two self-encoders are described below.

Content self-encoder

For conventional encoders, the extracted features are the primary information of the input data, such information being the content information for the image. The invention realizes the extraction of the content characteristics of the image based on the CNN network and the self-encoder with simple design. The network structure of the content self-encoder is shown in fig. 5, in this embodiment, the content characteristics are obtained first through the content encoder 11 composed of three convolutional layers and four basic residual blocks, and the content characteristics have 256 channels, so designed for better characterization of various rich image contents. The content features are then passed through a content decoder 12 consisting of four basic residual blocks and three layers of deconvolution layers to reconstruct the input image. It should be noted that fig. 5 is only an embodiment, and the content encoder 11 and/or the content decoder 12 are known in the prior art, and the content encoder 11 is not limited to be composed of three layers of convolution layers and four basic residual blocks, and the content decoder 12 is not limited to be composed of four basic residual blocks and three layers of deconvolution layers. And the content feature is not just an implementation with 256 channels.

In order to ensure that the network is content features for both undistorted images and features extracted from distorted images, the invention contemplates adding two constraints when training the network: the first constraint is a constraint on the output of the content decoder 12. The second constraint is a constraint on the content characteristics extracted by the content encoder 11. It should be noted that, in the case where the first constraint exists, the second constraint may be optionally added to enhance the constraint effect.

The present invention, whether undistorted or distorted, expects the extracted features to be content dependent and as far as possible undistorted, so the present invention uses only the undistorted image (or the original undistorted version of the distorted image) as a constraint on the decoder output. In the generating task, in order to generate a better visual reconstruction of an image, the first constraint is usually two types of constraints, namely a pixel level constraint and a perception constraint, to form an overall constraint, so that the reconstructed image better approximates to a target image, as shown in a formula (1):

l _overall ＝μ×l _pixel (I _o ,I _r )+(1-μ)×l _percp (I _o ,I _r ) (1)

where lovelall is the overall constraint, μ is the balance parameter of the pixel-level constraint and the perceptual constraint, typically obtained using layer-by-layer extracted feature alignment differences of a pre-training network such as VGGNet. lpixel is a pixel level constraint, most commonly with root Mean Square Error (MSE), io is a distorted or undistorted image, ir is a reconstructed content image or a reconstructed distorted image, lpercp is a perceptual constraint;

wherein the method comprises the steps of

Regarding the second constraint, first, to avoid overcomplete encoding due to high-dimensional characterization, the present invention performs sparse constraints on the extracted features, i.e., the average activation value of each channel of the constrained features is within a certain small value, in order to avoid direct pixel-wise simple replication of the content features. And when the input image of the content self-encoder is a distorted image, besides constraining the target image thereof to be an undistorted version thereof, the invention can also use the features extracted in the undistorted version of the distorted image to constrain the features extracted from the distorted image, and the loss function is obtained by calculating the distance between the features, namely the distance constraint.

Thus, by constraining the output characteristics of the content encoder 11 and the output characteristics of the content decoder 12 to output images, we can train a content encoder that can robustly encode the content information of the images, and the extracted content characteristics can play a positive role in the extraction of distortion characteristics and in the score prediction.

Distortion self-encoder

In order to achieve targeted extraction of distortion information of an input image, a special self-encoder needs to be designed. First, we design a multi-layer feature extraction distortion encoder 21, considering that the overall structure, local detail and local coincidence region of the image may be distorted. The specific structure of the distortion encoder 21 is shown in fig. 6, the network extracts the different feature maps of the four layers first, then sends the feature maps to the Spatial Pyramid Pooling (SPP) module for fusion to obtain low-dimensional features with fixed length, and finally the four features are spliced into feature vectors with length of 256 as the final output of the distortion encoder 21. The design of the multi-layer distortion feature extractor lays a foundation for the subsequent efficient distortion feature extraction.

Existing distortion encoders mainly have two forms: firstly, directly extracting the last layer as a characteristic encoder after passing through a multi-layer neural network; and secondly, extracting a multi-layer characteristic diagram of the neural network, and utilizing a simple global average pooling result as a characteristic encoder.

Different from the existing distortion encoder, the distortion encoder provided by the invention feeds the feature maps of different layers into a spatial pyramid pooling module for fusion so as to obtain low-dimensional features with fixed lengths. Please complement the two advantages of the distortion encoder of the present invention compared to the existing distortion encoder: firstly, the distortion encoder of the invention obtains better perception of distortion information of an image on local texture and global structure by extracting features at different layers. Secondly, the extracted features are subjected to spatial pyramid pooling, and multi-scale analysis is further carried out on information of different perception domains. Therefore, the features extracted by the distorted encoder of the present invention can better reflect the distorted information of the image compared to the existing distorted encoder. It should be noted that fig. 6 is only one implementation of the distortion encoder 21 of the present invention, and the feature extraction is not limited to the extraction of the input image into four layers, but the input image may be divided into several layers according to the need, and the length of the feature vector extracted by each layer is not limited to 64.

Next, the present invention designs a distortion decoder 22, the input of which is not only the characteristics obtained by the distortion encoder 21 but also the characteristics of the content obtained from the content encoder 11. In order for the distortion encoder 21 to extract highly characterized distortion features, it is necessary to have the content features well aided in the distortion decoder 22. We therefore propose a distortion decoder 22 that modulates the content characteristics based on the distortion characteristics, the network architecture of which is shown in fig. 7. First, similar to the content decoder 12, the content features are decoded via four sub-modulated residual blocks and three deconvolution layers. In each sub-modulation residual block, the distortion characteristics are used as modulation information to debug the content characteristics, and the distortion information is embedded in the content characteristics. Specifically, the distortion characteristics are first obtained through a Fully Connected (FC) layer to obtain extension characteristics, wherein the extension characteristics not only comprise distortion characteristics of each layer, but also comprise association information for representing the relationship between the distortion characteristics. The extended features are then further divided into four decomposed features, which in turn serve as one of the inputs to the sub-modulated residual block. In the prior art, the input feature is generally divided into a plurality of decomposition features directly. The distortion characteristics in the invention are obtained after passing through the full connection layer, and then the extension characteristics are divided into a plurality of decomposition characteristics, compared with the prior art, the invention has the advantages that: firstly, by setting different parameters of a full-connection layer, the characteristic length can be adaptively changed, and the most suitable decomposition characteristic length is obtained for subsequent reconstruction; secondly, original distortion features are spliced by a plurality of scale features of the distortion encoder, mutual information does not exist among the features, and certain information interaction can be carried out among the features through the full-connection layer, so that reconstruction of a distorted image is facilitated.

In each sub-modulation residual block, the decomposed distortion feature is copied pixel by pixel to obtain a distortion feature block consistent with the content feature size, and in the residual convolution structure of the content feature, the distortion feature block is spliced with the content feature block twice, so that good modulation of the content feature is realized. It should be noted that, those skilled in the art may divide the extension feature into multiple decomposition features as needed, and the number of deconvolution layers may be adjusted as needed. The sub-modulated residual block in fig. 7 is only one implementation of the present invention, and the present invention may also employ other forms of sub-modulated residual blocks.

For the constraint of the distortion self-encoder network, the input image is only required to be used as the output constraint of the distortion decoder, and the constraint loss function is the same as the first constraint of the content self-encoder, specifically, see the formula (1), the formula (2) and the formula (3). Under this constraint, the distortion encoder can pertinently encode the distortion characteristics with the aid of existing robust content characteristics.

Independent and collaborative training strategy

Complete training of the collaborative self-encoder requires a good learning strategy. The feature extraction framework proposed by the present invention requires two steps of training. The first step is training for the content self-encoder and the second step is training for the distortion self-encoder.

In training for a content self-encoder, a large number of undistorted images are first trained as training data, and the output constraint of the content decoder is the input image itself. Then using the distorted image as training input data, and using an undistorted image sharing the same content with the distorted image as constraint of the output of the content decoder, wherein the content characteristics of the undistorted image are used as constraint conditions of the content characteristics of the distorted image. Since an undistorted image is required as a constraint, the distorted image data put into this case can only be synthetic distortion. After the content self-encoder completes independent training, parameters thereof are fixed so as to ensure that the content encoder provides the constant content characteristics, thereby facilitating the training of subsequent modules.

In training for the distorted self-encoder, since the content encoder is already provided with the robust extracted content features, a large number of distorted images, including single distortion, composite distortion and true distortion images, can be used as training data of the distorted self-encoder. The output constraint of the distortion decoder is also the input image itself, and the loss function will guide the back propagation to optimize the distortion from the encoder. Through the two independent and collaborative training steps, the complete training of the collaborative self-encoder framework can be realized, and independent and efficient content characteristics and distortion characteristics can be extracted from any given image and play a positive role in subsequent quality score prediction.

Non-reference quality evaluation method based on self-supervision feature extraction

By proposing the above-described collaborative self-encoder based feature extractor training method, we can train to get the content encoder 11 and the distortion encoder 21 in large-scale unlabeled sample data. Since the image quality is related to both the content and distortion of the image, the information encoded by both encoders will be used as an efficient reference for the prediction quality score. However, the extracted content features belong to high-dimensional features, so that preliminary fusion is required before regression is performed. The non-reference quality evaluation network framework based on self-supervision feature extraction is shown in fig. 8. Firstly, the input image is subjected to a pre-trained content encoder 11 and a distortion encoder 21 to obtain content characteristics and distortion characteristics respectively, then the content characteristics are subjected to a fusion network consisting of a convolution layer and a global pooling layer to obtain content characteristic vectors, and then the content characteristic vectors are spliced with the distortion characteristic vectors to obtain quality characteristic vectors. The quality feature vector passes through three FC layers to obtain a quality prediction score. The whole network framework can be trained by using the existing database data with subjective score labels, a loss function is set to be MSE of the predicted score and the subjective score, and the predicted quality score after training is obtained.

Based on the above-described quality evaluation method based on self-supervision feature extraction, the present invention provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the quality evaluation method based on self-supervision feature extraction in the above-described embodiments.

Based on the quality evaluation method based on self-supervision feature extraction, the invention also provides a terminal, as shown in fig. 9, comprising at least one processor 30; a display screen 31; and a memory 32, which may also include a communication interface 33 and a bus 34. Wherein the processor 30, the display 31, the memory 32 and the communication interface 33 may communicate with each other via a bus 34. The display screen 31 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 33 may transmit information. The processor 30 may invoke logic instructions in the memory 32 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 32 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 32, as a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 30 executes functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 32.

The memory 32 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 32 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

In addition, the specific processes that the storage medium and the plurality of instruction processors in the terminal device load and execute are described in detail in the above method, and are not stated here.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A quality evaluation method based on self-supervision feature extraction is characterized by comprising the following steps:

2. The self-supervision feature extraction-based quality assessment method according to claim 1, wherein: the constructed cooperative dual self-encoder includes a constructed content self-encoder and a constructed distortion self-encoder, wherein the constructed content self-encoder includes a constructed content encoder and a constructed content decoder, the constructed distortion self-encoder includes a constructed distortion encoder and a constructed distortion decoder, the content encoder extracts content features from an input content image, the content decoder decodes the content features to obtain a reconstructed content image, the distortion encoder extracts distortion features from the input distortion image, and the distortion decoder decodes the content features and the distortion features to obtain a reconstructed distortion image.

3. The method for evaluating quality based on self-supervised feature extraction as recited in claim 2, wherein the self-supervised training of the collaborative dual self-encoders includes self-supervised training of the content self-encoder and self-supervised training of the distortion self-encoder,

the self-supervised training of the content self-encoder includes:

the self-supervised training of the distortion self-encoder includes:

4. A quality assessment method based on self-supervised feature extraction as recited in claim 3, wherein the output constraints of the content decoder and distortion decoder are functions of:

l _overall ＝μ×l _pixel (I _o ,I _r )+(1-μ)×l _percp (I _o ,I _r ) (1)

wherein the method comprises the steps of

5. The self-supervision feature extraction based quality assessment method according to claim 4, wherein: when the self-supervision training is carried out on the content self-encoder, the sparse constraint and/or the distance constraint are also carried out on the extracted content characteristics.

6. The quality evaluation method based on self-supervision feature extraction according to claim 2, wherein: the construction of the distortion encoder comprises the steps of firstly extracting different feature maps of a plurality of layers of input distortion images, then respectively feeding each feature map into a spatial pyramid pooling module for fusion to obtain a plurality of low-dimensional features with fixed lengths, and finally splicing the plurality of low-dimensional features to be used as the final output of the distortion encoder.

7. The self-supervision feature extraction-based quality assessment method according to claim 1, wherein: and obtaining the content feature vector according to the content feature, wherein the content feature vector comprises the content feature passing through a fusion network consisting of a convolution layer and a global pooling layer, so as to obtain the content feature vector.

8. The self-supervised feature extraction based quality assessment method of claim 2, wherein constructing a distortion decoder comprises:

9. A computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the self-supervised feature extraction based quality assessment method of any of claims 1-8.

10. A terminal, characterized by comprising: a processor and a memory having stored thereon a computer readable program executable by the processor; the processor, when executing the computer readable program, implements the steps in the quality assessment method based on self-supervised feature extraction as recited in any one of claims 1-8.