CN113128517B

CN113128517B - Tone mapping image mixed visual feature extraction model establishment and quality evaluation method

Info

Publication number: CN113128517B
Application number: CN202110300592.9A
Authority: CN
Inventors: 张敏; 许筱敏; 张汝雪; 石小妹; 冯筠
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2021-03-22
Filing date: 2021-03-22
Publication date: 2023-06-13
Anticipated expiration: 2041-03-22
Also published as: CN113128517A

Abstract

The invention discloses an image quality evaluation method, a modeling method and a system based on mixed visual characteristics, wherein the modeling method comprises the following steps: dividing the distorted image into a plurality of non-overlapping image blocks, sending the image blocks into a multi-scale feature fusion network to extract multi-scale content features of the image, calculating a gradient map corresponding to the distorted image, obtaining mixed visual perception features, and mapping the obtained features to human subjective scores by using support vector regression. The method provided by the invention designs a new multi-scale feature fusion network for expressing image quality layered degradation by combining a layered perception mechanism in a human visual system, and the network can more comprehensively express the distortion of an image; meanwhile, by combining the primary perception characteristics of human eyes, a double-tributary feature extraction model comprising an image stream and a gradient stream is constructed. The improved tone mapping image quality evaluation model can extract richer image quality perception characteristics and has better accuracy and universality.

Description

Tone mapping image mixed visual feature extraction model establishment and quality evaluation method

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a tone mapping image mixed visual feature extraction model establishment and quality evaluation method.

Background

With the development of digital imaging technology, high dynamic range (High Dynamic Range, HDR) images have evolved. Because the HDR image has wide dynamic range and rich real scene image details, the HDR image has great application value in the fields of video production, virtual reality, remote sensing detection, medical treatment, military and the like. However, HDR displays have not been popular so far, and conventional 8-bit display devices are mainly used in existing image processing systems, and HDR is far beyond what it can handle. Thus, the visualization of HDR images on conventional displays inevitably results in loss of image information and degradation of perceived quality. In order to visualize HDR images on standard 8-bit displays, various Tone Mapping Operators (TMOs) have been proposed to convert HDR images into low dynamic range (High Dynamic Range, LDR) images. However, as the conversion of dynamic range inevitably introduces complex distortion and leads to degradation of visual perception quality, an objective method is required to evaluate the quality of Tone-Mapped Images (TMIs). No-Reference Image Quality Assessment, NRIQA is a research task in the field of image processing aimed at designing a computational model which does not depend on any priori knowledge and can automatically evaluate image quality, and its research results quantify the image performance and provide important basis for research in other fields of image processing.

The existing reference-free tone mapping image quality evaluation method comprises two methods: the first is to design a manual feature descriptor to extract efficient image quality degradation features and then use a non-linear regression method (e.g., support vector regression (Support Vector Regression, SVR)) to regress the high-dimensional features to quality scores. Such methods are based on knowledge driven, requiring manual design of feature descriptors from human eye vision system (Human Visual System, HVS) or natural scene statistics (Natural Scene Statistics, NSS) features. However, it is difficult to design manual features that effectively represent no degradation in reference image quality.

Because of the rich and efficient feature representation capabilities of convolutional neural networks (Convolutional Neural Network, CNN), CNN-based NRIQA has been proposed, and such methods are data driven. In 2017, abhinau et al proposed to extract features of tone mapped images using a transfer learning method, and then map the extracted features to quality scores using SVR. In 2018, he et al considers the complexity of distortion in tone-mapped images, and should extract information of different scales and different levels when predicting image quality, so that extracting multi-scale and multi-layer features from a pre-trained deep convolutional neural network model constructs a new non-reference tone-mapped image quality evaluation method, and improves the performance of the method.

In summary, the existing reference-free tone mapping image quality evaluation method mainly has the following disadvantages:

(1) The current data driving-based method mainly uses the deep neural network output characteristics of transfer learning or pre-training to conduct quality prediction, but does not extract specific characteristics of TMI and fully considers image quality degradation, so that the model accuracy is not high.

(2) Neglecting the transition TMI due to dynamic range may create a halation effect that affects the image quality to some extent, making the image of the human eye vision system different from the real world, resulting in a model with less consistency of subjective perception by the human eye.

Disclosure of Invention

The invention aims to provide an image quality evaluation method, a modeling method and a system based on mixed visual characteristics, which are used for solving the problem that in the prior art, the accuracy of an evaluation model is not high due to insufficient consideration of the problem of image quality degradation.

In order to realize the tasks, the invention adopts the following technical scheme:

the method for establishing the tone mapping image mixed visual characteristic extraction model comprises the following steps:

step 1: obtaining a distorted image set and the quality fraction of each distorted image in the distorted image set, and calculating a gradient image corresponding to each distorted image through a Sobel operator to obtain a gradient image set; respectively blocking each distorted image in the distorted image set and each gradient image in the gradient image set to obtain a distorted image block set and a gradient image block set; the quality score of each distorted image block is made to be the quality score of the distorted image before the distorted image block is divided;

step 2: establishing a feature extraction network based on ResNet-50, taking a distorted image block set as a training set, taking the quality score of each distorted image block as a tag set, training the feature extraction network, and taking the trained feature extraction network as a feature extractor;

step 3: respectively inputting the distorted image block set and the gradient image block set obtained in the step 1 into the feature extractor obtained in the step 2 to perform feature extraction to respectively obtain multi-scale content features and primary visual features of each distorted image, and performing feature fusion on the scale content features and the primary visual features of each distorted image to obtain mixed visual features of each distorted image in the distorted image set;

step 4: and (3) establishing a support vector regressor, taking the mixed visual characteristics of all the distorted images obtained in the step (3) as a training set, taking the score of all the distorted images as a label set, taking the score of each distorted image as the average value of the quality scores of all the distorted image blocks contained in the distorted images, training the support vector regressor, and taking the trained support vector regressor as an image quality model.

Further, the feature extraction network comprises a residual block layer, a convolution layer and a global average pooling layer, wherein the residual block layer comprises a Conv1 layer, a Conv2 layer, a Conv3 layer, a Conv4 layer and a Conv5 layer, and the convolution layer comprises three 1×1 convolutions and one 3×3 convolutions.

Further, the primary visual features are extracted through Conv1 layer of the feature extraction network.

A tone-mapped image quality evaluation method comprising the steps of:

step one: obtaining a distortion image to be evaluated, calculating a gradient image of the distortion image to be evaluated through a Sobel operator, obtaining a gradient image to be evaluated, and respectively partitioning the distortion image to be evaluated and the gradient image to be evaluated to obtain a distortion image block set to be evaluated and a gradient image block set to be evaluated;

step two: respectively inputting the distortion image block set to be evaluated and the gradient image block set to be evaluated into a feature extractor obtained by adopting a tone mapping image mixed visual feature extraction model building method to obtain multi-scale content features and primary visual features of the distortion image to be evaluated, and carrying out feature fusion on the multi-scale content features and the primary visual features of the distortion image to be evaluated to obtain mixed visual features of the distortion image to be evaluated;

step three: and inputting the mixed visual characteristics of the distortion graph to be evaluated into an image quality model obtained by adopting a tone mapping image mixed visual characteristic extraction model building method, so as to obtain the quality fraction of the distortion graph to be evaluated.

Compared with the prior art, the invention has the following technical characteristics:

(1) The invention combines a layered perception mechanism in a human visual system, and utilizes the characteristic of image layered degradation to construct a multi-scale feature fusion network when the design without a reference frame is considered, thereby more comprehensively expressing the distortion of the image.

(2) The invention combines the primary perception characteristics of human vision to construct a double tributary feature extraction model, namely: image flow and gradient flow. Inputting a distortion figure into an image stream to extract multi-scale content characteristics; considering that the TMI may generate halation effect to cause the edge distortion of the image, a corresponding gradient map of the distortion map is added to extract the primary visual features so as to better express the edge distortion information.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a frame diagram of the modeling of the present invention;

FIG. 3 is an exemplary graph of image quality degradation;

FIG. 4 is a diagram of a multi-scale feature fusion network;

fig. 5 is a distortion map and a corresponding gradient map in an embodiment.

Detailed Description

Specific examples of the present invention are given below, and it should be noted that: (1) the present invention is not limited to the following specific examples; (2) When the model is trained, the data set is required to be divided into a training set and a testing set, wherein the training set is not all data in the data set but part of data, and the testing set is used for testing after the model is trained, so that a complete trained model is obtained; (3) The embodiment of the invention uses the Python language to realize the construction of the whole model.

The embodiment discloses a tone mapping image mixed visual feature extraction model establishment method, which comprises the following steps:

Specifically, every two image blocks in the distorted image block set are not overlapped with each other, and the gradient imageEvery two image blocks in the block set are not overlapped with each other, and the distorted image block set is expressed as x _i The gradient image block set is denoted as y _i I=1, 2. Cropping a plurality of non-overlapping image blocks from the distortion map is done on the one hand to increase the amount of data and on the other hand because direct resizing operations mask certain image artifacts, whereas cropping ensures that the perceived image quality is unchanged.

Specifically, the quality score of each distorted image block is denoted as f (x _i ) The average value of all the distortion image block scores contained by the quality scores of each distortion image is expressed as

M is the total number of distorted image blocks in a distorted image.

Specifically, the training phase of any distorted image X in the data set comprises the following substeps:

1) Clipping the distorted image X into 4 non-overlapping distorted image blocks X of 224X 224 ₁ ,x ₂ ,x ₃ ,x ₄ Since the quality score of each distorted image block is the score of the distorted image in which it is located, the quality score y of the distorted image X is obtained, and the distorted image block X ₁ ,x ₂ ,x ₃ ,x ₄ The mass fractions of (2) are y;

2) Will x ₁ ,x ₂ ,x ₃ ,x ₄ And its fraction { (x) ₁ ,y),(x ₂ ,y),(x ₃ ,y),(x ₄ Y) sending the feature extraction network training model and extracting the mixed visual features to train a support vector regression model;

in the test phase, a distorted image X ' is given, the true score of which is y ' =3.5, and in order to verify the performance of the proposed model, the distorted image X ' is first divided into image blocks X of the same size as the training phase ₁ ',x ₂ ',x ₃ ',x ₄ ' then, for each distorted image block, a hybrid visual feature is extracted, and then the extracted hybrid visual feature is mapped to using a trained quality prediction modelQuality score, obtaining prediction score y of each distorted image block ₁ '＝3.35,y ₂ '＝3.61,y ₃ '＝3.44,y ₄ ' = 3, 52, the quality scores of the 4 distorted image blocks are averaged as the predicted quality score of the final distortion map, i.e.:

is close to the real label 3.5 of the given distorted image.

Specifically, the feature extraction network comprises a residual block layer, a convolution layer and a global average pooling layer, wherein the residual block layer comprises a Conv1 layer, a Conv2 layer, a Conv3 layer, a Conv4 layer and a Conv5 layer, and the convolution layer comprises three 1×1 convolutions and one 3×3 convolutions.

The feature extraction network is built based on a hierarchical process of visual perception. Firstly, using a residual network ResNet-50 as a basic network for semantic feature extraction, and removing the last two layers of ResNet-50 to output a feature stream; secondly, outputting multi-scale features from four residual blocks of Conv2, conv3, conv4 and Conv5 in ResNet-50; next, the channel size is reduced by a 1 x 1 convolution and the spatial resolution is up-sampled by a factor of 2, the up-sampled feature map being fused with the corresponding higher-scale feature map by element-wise addition. Repeating the process until a feature map with the highest resolution is generated; finally, 3 x 3 convolution and global averaging pooling are used on the resulting feature map to obtain multi-scale content features.

Specifically, the network establishment and feature extraction in the step 2 includes the following sub-steps:

step 2.1, using a residual network ResNet-50 as a basic network for semantic feature extraction, using a pre-training model on an ImageNet as a network for initialization, visualizing five residual blocks of the ResNet-50 model, observing that distortion can affect features of different levels, and causing image quality degradation from the IQA angle;

step 2.2, outputting multi-scale features F1, F2, F3 and F4 from the last layer of the four residual blocks of Conv2, conv3, conv4 and Conv5 of ResNet-50 respectively;

step 2.3, the output multi-scale features are convolved by 1 multiplied by 1 to reduce the channel size, the spatial resolution is up-sampled by a factor of 2 to obtain a feature map, the up-sampled feature map is fused with the corresponding feature map with higher scale by element-by-element addition, and the process is repeated until the feature map with the highest resolution is generated;

and 2.4, carrying out convolution of 3 multiplied by 3 and global average pooling on the obtained characteristic diagram to obtain the multi-scale content characteristic.

Specifically, in step 1, a distorted image I (I, j) is input, and convolved by using a Sobel operator to obtain a gradient map M (I, j), wherein,

/>

H _x and H _y The horizontal and vertical components of the Sobel operator respectively,

in particular, the primary visual feature f _p Conv1 layer extraction through feature extraction network.

Specifically, in step 3, feature fusion between the scale content feature and the primary visual feature of each distorted image means: characterizing multi-scale content f _m And primary visual features f _p And performing cascading, namely transverse splicing, so as to obtain the mixed visual feature F after feature fusion.

Specifically, step 4 uses SVR to map the resulting hybrid visual feature F to a human subjective score MOS.

Specifically, the human subjective score MOS is a mass fraction, and may range from 1 to 100 or 1-8 depending on the data set and the evaluation criteria.

The embodiment also discloses a tone mapping image quality evaluation method, which comprises the following steps:

The embodiment also discloses an image quality evaluation system based on the mixed visual characteristics, which comprises a data acquisition and segmentation unit, a characteristic extractor establishing unit, a characteristic fusion unit, an image quality model establishing unit and an image quality scoring unit;

the data acquisition and segmentation unit is used for acquiring a distorted image set and the quality score of each distorted image in the distorted image set, and calculating a gradient image corresponding to each distorted image through a Sobel operator to obtain a gradient image set; respectively blocking each distorted image in the distorted image set and each gradient image in the gradient image set to obtain a distorted image block set and a gradient image block set; the quality score of each distorted image block is made to be the quality score of the distorted image before the distorted image block is divided;

the feature extractor establishing unit is used for establishing a feature extraction network based on ResNet-50, taking a distorted image block set as a training set, taking the quality score of each distorted image block as a tag set, training the feature extraction network, and taking the trained feature extraction network as a feature extractor;

the characteristic fusion unit is used for carrying out characteristic extraction on the distorted image block set and the gradient image block set which are obtained by the data acquisition and segmentation unit by the characteristic extractor respectively to obtain multi-scale content characteristics and primary visual characteristics of each distorted image, and carrying out characteristic fusion on the scale content characteristics and the primary visual characteristics of each distorted image to obtain mixed visual characteristics of each distorted image in the distorted image set;

the image quality model building unit is used for building a support vector regressor, taking the mixed visual characteristics of all the distorted images of the distorted image set obtained by the characteristic fusion unit as a training set, taking the score of all the distorted images as a label set, taking the score of each distorted image as the average value of the quality scores of all the distorted image blocks contained in the distorted image, training the support vector regressor, and taking the trained support vector regressor as an image quality model;

the image quality scoring unit is used for acquiring a distortion image to be evaluated, and acquiring a distortion image block set to be evaluated and a gradient image block set to be evaluated through the data acquisition and segmentation unit; respectively inputting the distortion image block set to be evaluated and the gradient image block set to be evaluated into a feature extractor to obtain multi-scale content features and primary visual features of the distortion image to be evaluated, and carrying out feature fusion on the multi-scale content features and the primary visual features of the distortion image to be evaluated to obtain mixed visual features of the distortion image to be evaluated; and the method is also used for inputting the mixed visual characteristics of the distortion figure to be evaluated into the image quality model to obtain the quality score of the distortion figure to be evaluated.

Example 1

In this embodiment, two data sets, i.e., a TMID data set and an ESPL-LIVE data set, are used to verify the performance of the method, where the TMID data set includes 120 images, and is divided into 15 groups, each group includes an HDR image and 8 corresponding TMIs generated from different TMOs, and the range of the quality scores of each distorted image in the TMID data set is 1-8. The ESPL-LIVE dataset contained 1811 HDR processed images generated by three processing algorithms (including TM, multi-exposure fusion and post-processing), with 747 TMIs applied in the experiment, and the range of quality scores for each distorted image in the ESPL-LIVE dataset was 1-100. The embodiment provides an image quality evaluation method, and on the basis of the above embodiment, the following technical features are disclosed:

specifically, in step 1, each image is divided into a plurality of non-overlapping image blocks with the same size, and each image block is resized to 224×224 to be sent into a multi-scale feature fusion network;

this example compares experimentally the seven NRIQA methods proposed by QAC, GM-LOG, BRISQUE, HIGRADE, chen, abhinau and He et al. The results of the experiment are shown in Table 1, wherein the Spearman correlation coefficient (Spearman rank correlation coefficient, SROCC) and Pearson correlation coefficient (Pearson Correlation Coefficient, PLCC) are the evaluation indexes of the experiment, the values are [0,1], and the higher the values are, the better the performance of the method is.

Table 1 comparison results between different methods

From the results in table 1, it can be seen that the NRIQA method proposed by the present invention achieves optimal performance and has relatively stable results compared to the other seven NRIQA methods.

To further demonstrate that the innovations presented in the present invention can have a beneficial effect on the final results, the relevant ablation experiments were performed in this example, and the results are shown in tables 2 and 3.

Table 2 analysis of performance under single-scale and multi-scale feature representation

Table 3 performance analysis under single tributary and double tributary models

Table 2 lists the performance under single-scale and multi-scale feature representations on TMID and ESPL-LIVE datasets. Conv2-5 represents the feature of extracting only this module during feature extractionSign f _m Representing multi-scale content features. As can be seen from table 2, the results with either scale alone are lower than the results with all scale features acting together; and the features of each scale are effective to the model and can be used as image quality perception features. The proposed multi-scale model takes into account the layered degradation to obtain the best results.

Table 3 lists the performance under the single tributary and double tributary models on TMID and ESPL-LIVE datasets. IS denotes an image stream, TS denotes a bi-tributary, where ts_m denotes a gradient stream also using multi-scale content features, and ts_p denotes a gradient stream using primary visual features. As can be seen from table 3, introducing a gradient stream into the NRIQA method can further enhance the performance of the method and further verify that image local structure statistics from primary visual features are highly correlated with image perception quality.

Fig. 5 shows different distortion maps and their corresponding gradient maps. Fig. 5 (a), (c), and (e) are tone mapped images of different distortions of the same scene, and fig. 5 (b), (d), and (f) are gradient diagrams corresponding thereto, respectively. From the map it can be seen that the gradient map clearly reflects the structural components of the image, such as the edges of the image. The tone-mapped images of the first and second lines, respectively, exhibit different degrees of halation, show overexposure, lose image detail and color information, thereby impeding recognition and affecting perceived quality, and the image of the third line appears more natural and recognizable, and it can be found that the gradient map can clearly reveal the edges of the image, reflecting the degree of distortion of the image.

Thus, the innovations presented in the present invention can have a beneficial effect on the final result, thereby further enhancing the performance of the tone-mapped image quality evaluation model.

Claims

1. A tone-mapped image quality evaluation method, comprising the steps of:

step three: inputting the mixed visual characteristics of the distortion graph to be evaluated into an image quality model obtained by adopting a tone mapping image mixed visual characteristic extraction model building method, so as to obtain the quality fraction of the distortion graph to be evaluated;

step 2: establishing a feature extraction network based on ResNet-50, taking a distorted image block set as a training set, taking the quality score of each distorted image block as a tag set, training the feature extraction network, and taking the trained feature extraction network as a feature extractor; the feature extraction network comprises a residual block layer, a convolution layer and a global average pooling layer, wherein the residual block layer comprises a Conv1 layer, a Conv2 layer, a Conv3 layer, a Conv4 layer and a Conv5 layer, and the convolution layer comprises three convolutions of 1 multiplied by 1 and one convolution of 3 multiplied by 3;

step 3: respectively inputting the distorted image block set and the gradient image block set obtained in the step 1 into the feature extractor obtained in the step 2 to perform feature extraction to respectively obtain multi-scale content features and primary visual features of each distorted image, and performing feature fusion on the scale content features and the primary visual features of each distorted image to obtain mixed visual features of each distorted image in the distorted image set; the primary visual features are extracted through Conv1 layers of a feature extraction network;