CN116403063A

CN116403063A - No-reference screen content image quality assessment method based on multi-region feature fusion

Info

Publication number: CN116403063A
Application number: CN202310398032.0A
Authority: CN
Inventors: 陈羽中; 陈友昆; 牛玉贞
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-07-07

Abstract

The invention relates to a non-reference screen content image quality assessment method based on multi-region feature fusion. Comprising the following steps: data preprocessing is carried out on data in the distorted screen content image data set; designing an adaptive feature extraction module, adaptively extracting different scale features of a text region and an image region in an image block of distorted screen content, and fusing the text region features and the image region features based on an attention mechanism; designing a local image information interaction module, enhancing information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and giving different attention weights to each image block; designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model; and inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model, and outputting a corresponding quality evaluation score.

Description

No-reference screen content image quality assessment method based on multi-region feature fusion

Technical Field

The invention relates to the field of image processing and computer vision, in particular to a non-reference screen content image quality assessment method based on multi-region feature fusion.

Background

In recent years, with rapid development of mobile devices, multimedia applications, and information dissemination technologies, remote computing and communication have been widely used, and screen content images generated by various electronic terminal devices, which change in real time, are increasingly frequently appearing in people's daily lives. The screen content image is a new type of image with more text, icons, graphics and special structural layout information and statistical features than traditional natural images, the number of sharp edges in the screen content image far exceeds that of the natural image, and the color information is less than that of the natural image. In the process of encoding, compressing and transmitting screen content images, various degrees of distortion are inevitably introduced due to technical or hardware limitations and the like, so that the image quality is reduced, and the user experience and the system interaction performance are further affected. In view of the demands of people for clear and high-quality images, an image quality evaluation method capable of effectively evaluating the perceived quality of screen content images is needed, which not only can be used as auxiliary reference information of some image restoration enhancement technologies, but also provides a feasible approach for designing and optimizing advanced image/video processing algorithms.

The conventional image quality evaluation methods can be classified into subjective evaluation and objective evaluation according to the evaluation subject. Subjective evaluation means that image quality is scored by a person who is the final recipient of image information, but the subjective evaluation method is easily affected by subject's own subject consciousness, and in the age of explosive increase of the current data amount, subjective evaluation of tens of thousands of image data is not realistic, so it is difficult to satisfy the demand of real-time application. The objective image quality evaluation method is to extract and analyze the relevant characteristics of the distorted image, construct a corresponding mathematical model and calculate the quality evaluation score of the distorted image. Compared with the subjective evaluation method, the process does not need to score images through a large number of subjects, but replaces a human visual system by a computer to realize automatic and efficient quality evaluation, so that the application requirements under the background of big data can be met. According to different degrees of the demand for the reference image information in the image perception quality evaluation process, objective quality evaluation methods can be divided into three types of methods of full reference, half reference and no reference, and the dependence degree of the three types of methods on the reference image information is sequentially reduced. However, in practical application, it is often difficult to obtain a reference image without distortion, so the reference-free image quality evaluation method is more practical and has wider development prospect.

With the continuous development of deep learning technology, many screen content image quality evaluation models based on convolutional neural networks have appeared in recent years. Considering that deep learning is a data driving method, and the existing screen content image dataset usually only contains a small number of distorted screen content images, so that the effect of data enhancement is mainly achieved by blocking the screen content images, but a single image block cannot fully characterize the quality of the whole distorted image, and because a large number of text and image areas exist in the screen content images at the same time, image blocks of different contents can generate large quality differences even under the condition of the same distortion effect. Therefore, the influence difference of the image area and the text area in the screen content image on the whole visual quality of the image needs to be fully considered, and a weight strategy is designed by combining the degree of the difference, so that the image characteristics of the image area and the text area are effectively fused, and the accuracy of the non-reference screen content image quality evaluation model is further improved.

Disclosure of Invention

The invention aims to provide a multi-region feature fusion-based reference-free screen content image quality assessment method, which is based on the idea of multi-region local feature fusion, and features of a plurality of local image blocks are fused to represent the overall quality of an image, so that quality score deviation caused by training by using a single image block is reduced. Meanwhile, when the convolution layer of the convolution neural network is designed, different characteristics of a text region and an image region in the screen content image can be extracted more effectively by using convolution kernels with different sizes, and the text characteristics and the image characteristics with different scales can be effectively fused by using an attention mechanism, so that attention weights with different degrees can be given to each image block. Overall, the method can achieve higher subjective and objective visual perception consistency than other methods.

In order to achieve the above purpose, the technical scheme of the invention is as follows: a multi-region feature fusion-based reference-free screen content image quality assessment method comprises the following steps:

step S1, data preprocessing is carried out on data in distorted screen content image data sets, firstly image blocks are cut out from each distorted screen content image, then the data sets are divided into training sets and test sets, and finally data enhancement is carried out on the data in the training sets;

s2, designing an adaptive feature extraction module, wherein the adaptive feature extraction module can adaptively extract different scale features of a text region and an image region in an image block of distorted screen content, and fuse the features of the text region and the features of the image region based on an attention mechanism;

s3, designing a local image information interaction module, wherein the module enhances information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and endows different attention weights to each image block;

s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model based on multi-region feature fusion;

And S5, inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting a corresponding quality evaluation score.

In an embodiment of the present invention, the step S1 is specifically implemented as follows:

step S11, firstly, clipping image blocks of each distorted screen content image I in the distorted screen content image dataset; specifically, each distorted screen content image I is divided into four areas of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each area, respectively denoted as I ₁ 、I ₂ 、I ₃ And I ₄ Wherein H and W represent the height and width of the image block, respectively;

step S12, dividing the distorted screen content images in the distorted screen content image data set into a training set and a testing set according to a preset proportion;

step S13, for each distorted screen content image I in the training set _train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing to complete data enhancement operation, and a distorted screen content image block for training is obtained

And->

For each distorted screen content image I in the test set _test The four cut image blocks are subjected to the same normalization processing to obtain a distorted screen content image block for testing >

And->

In an embodiment of the present invention, the step S2 is specifically implemented as follows:

step S21, designing textThe feature extraction submodule consists of a convolution layer with a convolution kernel size of 3 multiplied by 3, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; feature extraction is carried out on a text region in a distorted screen content image block by using a convolution layer with a convolution kernel size of 3 multiplied by 3, and a feature diagram input by a text feature extraction submodule is recorded as F _t The size of the catalyst is H _t ×W _t ×C _t Wherein H is _t 、W _t And C _t Respectively represent the input feature graphs F _t Height, width and number of channels; specifically, first, feature map F _t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' _t1 Its dimension is H _t ×W _t ×C' _t Wherein H is _t 、W _t And C' _t Respectively represent the intermediate characteristic diagrams F' _t1 Height, width and number of channels; then input the characteristic diagram F _t Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' _t2 Its dimension is H _t ×W _t ×C' _t And intermediate feature map F' _t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' _t1 And intermediate feature map F' _t2 Adding, and obtaining the output characteristic F 'of the text characteristic extraction submodule through the LeakyReLU activation function' _t Its dimension is H _t ×W _t ×C' _t The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F′ _t1 ＝BN(Conv2(LeakyReLU(BN(Conv1(F _t )))))

F′ _t2 ＝BN(Conv3(F _t )))

wherein Conv1 represents a convolution kernel size of 3×3Conv2 (x) and Conv3 (x) represent two convolution layers with a convolution kernel size of 1 x 1,

representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;

s22, designing an image feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 5 multiplied by 5, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; performing feature extraction on an image area in the distorted screen content image block by using a convolution layer with a convolution kernel size of 5×5; recording the characteristic diagram input by the image characteristic extraction submodule as F _p The size of the catalyst is H _p ×W _p ×C _p Wherein H is _p 、W _p And C _p Respectively represent the input feature graphs F _p Height, width and number of channels; specifically, first, feature map F _p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, a LeakyReLU activation function, a convolution layer with convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' _p1 Its dimension is H _p ×W _p ×C' _p Wherein H is _p 、W _p And C' _p Respectively represent the intermediate characteristic diagrams F' _p1 Height, width and number of channels; then input the characteristic diagram F _p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' _p2 Its dimension is H _p ×W _p ×C' _p And intermediate feature map F' _p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' _p1 And intermediate feature map F' _p2 Adding, and obtaining the output characteristic F 'of the image characteristic extraction submodule through the LeakyReLU activation function' _p Its dimension is H _p ×W _p ×C' _p The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F′ _p1 ＝BN(Conv2(LeakyReLU(BN(Conv1(F _p )))))

F′ _p2 ＝BN(Conv3(F _p )))

wherein Conv1' represents a convolution layer having a convolution kernel size of 5 x 5, conv2 and Conv3 represent two convolution layers having a convolution kernel size of 1 x 1,

step S23, designing an attention feature fusion submodule, wherein the module consists of four convolution layers with the convolution kernel size of 1 multiplied by 1, a global average pooling layer, two ReLU activation functions, a Sigmoid activation function and four batch normalization layers; the attention feature fusion submodule can fuse text features and image features with different scales through learning, and two features input by the attention feature fusion submodule are recorded as F' _t And F' _p The sizes of the two are H _a ×W _a ×C _a Wherein H is _a 、W _a And C _a Respectively represent the input feature images F' _t And F' _p Height, width and number of channels; specifically, first, two input features are added pixel by pixel to obtain an intermediate feature map F _b The size of the catalyst is H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram F _b Respectively inputting into local attention extraction branch and global attention extraction branch for different attention feature extraction, wherein the local attention extraction branch is formed by serially connecting a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer, and the global attention extraction branch is formed by sequentially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function and a convolution kernel sizeThe convolution layers with the size of 1 multiplied by 1 and the batch normalization layers are connected in series; record intermediate feature map F _b The output after the local attention extraction branch is characterized by F _local The output after global attention extraction branch is characterized by F _global The sizes of the two are H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the Feature F is then followed _local And feature F _global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the learnable weight lambda is combined with the input characteristic F' _t And F' _p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' _b The size of the catalyst is H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F _local ＝BN(Conv _{2_a} (ReLU(BN(Conv _{1_a} (F _b )))))

F _global ＝BN(Conv _{4_a} (ReLU(BN(Conv _{3_a} (GAP(F _b ))))))

wherein Conv _{1_a} (*)、Conv _{2_a} (*)、Conv _{3_a} ()' and Conv _{4_a} The four convolution layers with a convolution kernel size of 1 x 1 are shown,

representing matrix addition operation, BN (·) representing batch normalization operation, GAP (·) representing global average pooling operation, reLU (·) representing ReLU activation function, sigmoid (·) representing Sigmoid activation function, λ being a learnable weight output via the network;

step S24, designing a channel attention submodule for enhancing feature representation and acquiring key feature channel information of input features, wherein the module consists of two convolution layers with convolution kernel size of 1 multiplied by 1, a ReLU activation function and a Sigmoid activation function, and the feature diagram input by the channel attention submodule is F _c The size of the catalyst is H _c ×W _c ×C _c Wherein H is _c 、W _c And C _c Respectively represent the input feature graphs F _c Height, width and number of channels; specifically, the input features F are first aggregated using a global averaging pooling operation _c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F _c Multiplying by element to obtain final output F 'of the channel attention submodule' _c The size of the catalyst is H _c ×W _c ×C _c And input of a feature map F _c Is the same in dimension; the specific calculation formula is as follows:

F′ _c ＝Sigmoid(Conv _{2_b} (ReLU(Conv _{1_b} (GQP(F _c )))))⊙F _c

wherein GAP represents a global average pooling operation, conv _{1_b} ()' and Conv _{2_b} "(. Times.) represent two convolution layers with a convolution kernel size of 1×1," represents matrix multiplication, sigmoid (·) represents Sigmoid activation function, and ReLU (·) represents ReLU activation function;

step S25, designing a self-adaptive feature extraction module, wherein the self-adaptive feature extraction module comprises four text feature extraction sub-modules in step S21, four image feature extraction sub-modules in step S22, four attention feature fusion sub-modules in step S23, one channel attention sub-module in step S24 and eight space average pooling layers with step length of 2; the self-adaptive feature extraction module respectively carries out self-adaptive multi-scale feature extraction on text privileges and image features of the image blocks of the input distorted screen content through text feature extraction branches and image feature extraction branches, and fuses different types of image features through an attention mechanism; specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer;

The size of the input distorted screen content image block is recorded as H multiplied by W multiplied by 3, the distorted screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features, and the multi-level text features output after the input distorted screen content image block passes through the text feature extraction branch are recorded as follows

Wherein the characteristic diagram

Is +.>

Feature map->

Is +.>

Feature map->

Is +.>

Feature map->

Is +.>

C' =64; recording the input distortion screen content image block passing diagramThe multi-level image feature outputted after the image feature extraction branch is +.>

Wherein the characteristic diagram->

Is +.>

Feature map->

Is of the size of

Feature map->

Is +.>

Feature map->

Is +.>

C' =64; then the multi-level text feature->

And corresponding multilevel image features->

Respectively inputting the three main features into four attention feature fusion sub-modules, and fusing text features and image features to obtain fused multi-level trunk features +.>

Wherein the characteristic diagram->

Is of the size of

Feature map->

Is +.>

Feature map->

Is +.>

Feature map->

Is +.>

C' =64; then>

Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain multi-scale image-text feature representation F 'of the input image block' _tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:

wherein Concat (·) represents the stitching operation of the features, GAP (·) represents the global average pooling operation; finally, the fused multi-scale image-text characteristic F 'is processed' _tp Inputting the information into a channel attention sub-module to capture key information among different channels to obtain final output characteristics F of the adaptive characteristic extraction module _tp Then feature F _tp Flattened into a one-dimensional vector of dimension size 1 x D, where D represents the dimension of each image block, d=960.

In an embodiment of the present invention, the step S3 is specifically implemented as follows:

designing a local image information interaction module which consists of four full-connection layers and a Softmax function; the local image information interaction module adopts a self-attention mechanism to enhance information interaction among different image block characteristics, so that each image block is endowed with different attention degrees to better aggregate the local characteristics of each image block; specifically, the input characteristic of the local image information interaction module is F _l The size is n×d, where N represents the number of image blocks, n=4; d represents the dimension of each image block, d=960; first, input feature F _l Input into three fully connected layers, thereby generating three new intermediate features F _Q 、F _K And F _V The dimensions are all N x D ', where D' represents the intermediate feature F _Q 、F _K And F _V Dimension of the second dimension, D' =480; then, in the intermediate feature F _Q And F _K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F _V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the two-dimensional feature matrix S into a full connection layer to obtain a feature matrix F _s The dimension size is N multiplied by D; finally, the characteristic matrix F _s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual _l Adding to obtain final output F 'of the local image information interaction module' _l The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F _Q ＝Linear1(F _l )

F _K ＝Linear2(F _l )

F _V ＝Linear3(F _l )

F _s ＝Linear4(S)

wherein, linear1 (, linear2 (, linear3 (, linear4 ()) represents four fully connected layers, softmax (·) represents the Softmax function, transose (·) represents the Transpose operation of the two-dimensional matrix,

representing matrix multiplication, ++>

Representing matrix addition, alpha represents a learnable scale parameter for fusion, F' _l Output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the input characteristics F _l Is the same.

In an embodiment of the present invention, the step S4 is specifically implemented as follows:

step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer; taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13

And->

As input to the network, the dimensions are all h×w×3; specifically, first, four image blocks are input to four adaptive bits, respectivelyIn the sign extraction module, the ith input image block is marked for extracting multi-scale image-text characteristics of each image block>

The output characteristic after passing through the ith self-adaptive characteristic extraction module is F _i The dimension sizes are all 1 xD; then four one-dimensional output features F _i Splicing the two-dimensional feature vectors to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D; then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks to obtain a final output characteristic F of the network _out The size of the fusion feature is NxD, and the dimension of the fusion feature is the same as that of the initial fusion feature F;

Step S42, outputting the characteristic F to the network obtained in step S41 _out Performing dimension transformation operation, flattening the dimension transformation operation to form a one-dimensional feature vector, wherein the dimension size of the one-dimensional feature vector is changed from N multiplied by D to 1 multiplied by C, and C=N multiplied by D; then the flattened one-dimensional feature vector is input into a full connection layer to obtain the quality evaluation score F of the distorted screen content image _score The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F _score ＝Linear(Reshape(F _out ))

wherein Linear represents a fully connected layer and Reshape represents dimension transformation operation;

step S43, designing a loss function of the non-reference image quality evaluation network based on multi-region feature fusion, which is specifically as follows:

wherein n is the number of samples in the training set, y _i Representing the true quality score of the ith distorted screen content image,

representing the predicted quality fraction of the ith distorted screen content image output through the network;

and S44, repeating the steps S41 to S43 by taking a batch as a unit until the loss value calculated in the step S43 converges and tends to be stable, saving network parameters, and completing the training process of the non-reference image quality evaluation network based on multi-region feature fusion.

In an embodiment of the present invention, the step S5 is specifically implemented as follows:

four image blocks corresponding to each distorted screen content image in the test set obtained in step S13

And->

And inputting the quality evaluation scores into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting corresponding quality evaluation scores.

Compared with the prior art, the invention has the following beneficial effects: the invention aims to solve the quality score deviation problem caused by the block training of the distorted screen content image by the image quality evaluation model based on the convolutional neural network and the characteristic extraction and fusion problem of different types of areas in the screen content image. In order to solve the problems, the invention provides a multi-region feature fusion-based reference-free screen content image quality evaluation method, which is used for representing the overall quality of an image by fusing local features of a plurality of regions of a distorted image and carrying out self-adaptive feature extraction by using convolution kernels with different sizes for different types of regions. In addition, the method also uses an attention mechanism to carry out self-adaptive fusion on different types of image features and enhances information interaction among various local image blocks, thereby effectively improving the accuracy of the image quality assessment model without reference screen content.

Drawings

FIG. 1 is a flow chart of a method according to an embodiment of the invention.

Fig. 2 is a diagram illustrating a network model structure according to an embodiment of the present invention.

Fig. 3 is a block diagram of a text feature extraction submodule according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of an image feature extraction submodule according to an embodiment of the present invention.

FIG. 5 is a diagram of a attention feature fusion submodule architecture according to an embodiment of the present invention.

Fig. 6 is a block diagram of an adaptive feature extraction module according to an embodiment of the invention.

Fig. 7 is a block diagram of a local image information interaction module according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is specifically described below with reference to the accompanying drawings.

The invention provides a multi-region feature fusion-based reference-free screen content image quality assessment method, which is shown in fig. 1 and comprises the following steps:

s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model;

FIG. 2 is a diagram of a network model constructed by the method of the present invention.

Further, step S1 includes the steps of:

step S11, firstly, clipping the image block for each distorted image I in the distorted screen content image dataset. Specifically, each distorted image I is divided into four regions of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each region, denoted as I ₁ 、I ₂ 、I ₃ And I ₄ Wherein H and W represent the height and width of the image block, respectively;

s12, dividing images in the distorted screen content image dataset into a training set and a testing set according to a certain proportion;

Step S13, for each distorted screen content image I in the training set _train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing, so that data enhancement operation is completed, and a distorted screen content image block for training is obtained

And->

For each distorted screen content image I in the test set _test The four image blocks after clipping are subjected to the same normalization processing, thereby obtaining a distorted screen content image block for testing +.>

And

further, step S2 includes the steps of:

step S21, designing a text feature extraction submodule, wherein the submodule is composed of a convolution layer with a convolution kernel size of 3 multiplied by 3, and a text feature extraction submodule,Two convolution layers with a convolution kernel size of 1×1, two LeakyReLU activation functions, and three batch normalization layers. Since smaller convolution kernels may better capture local detail information of an image, such as characters, lines, etc., a convolution layer with a convolution kernel size of 3 x 3 is used to perform feature extraction on text regions in the image. Recording the characteristic diagram input by the module as F _t The size of the catalyst is H _t ×W _t ×C _t Wherein H is _t 、W _t And C _t Respectively represent the input feature graphs F _t Height, width and number of channels. Specifically, first, feature map F _t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' _t1 Its dimension is H _t ×W _t ×C' _t Wherein H is _t 、W _t And C' _t Respectively represent the intermediate characteristic diagrams F' _t1 Height, width and number of channels; then input the characteristic diagram F _t Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' _t2 Its dimension is H _t ×W _t ×C' _t And intermediate feature map F' _t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' _t1 And feature map F' _t2 Adding, and then obtaining the output feature F 'of the text feature extraction submodule through a LeakyReLU activation function' _t Its dimension is H _t ×W _t ×C' _t . The specific calculation formula is as follows:

F′ _t1 ＝BN(Conv2(LeakyReLU(BN(Conv1(F _t )))))

F′ _t2 ＝BN(Conv3(F _t )))

wherein Conv1 represents a convolution kernel size of 3×3, conv2 and Conv3 represent two convolution layers of convolution kernel size 1 x 1,

in step S22, an image feature extraction submodule is designed, as shown in fig. 4, and the submodule is composed of one convolution layer with a convolution kernel size of 5×5, two convolution layers with a convolution kernel size of 1×1, two LeakyReLU activation functions and three batch normalization layers. Since a larger convolution kernel is more suitable for capturing visual features of an image, such as overall texture and color of the image, under a larger receptive field, feature extraction is performed on the image region using a convolution layer having a convolution kernel size of 5 x 5. Recording the characteristic diagram input by the module as F _p The size of the catalyst is H _p ×W _p ×C _p Wherein H is _p 、W _p And C _p Respectively represent the input feature graphs F _p Height, width and number of channels. Specifically, first, feature map F _p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, a LeakyReLU activation function, a convolution layer with convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' _p1 Its dimension is H _p ×W _p ×C' _p Wherein H is _p 、W _p And C' _p Respectively represent the intermediate characteristic diagrams F' _p1 Height, width and number of channels; then input the characteristic diagram F _p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' _p2 Its dimension is H _p ×W _p ×C' _p And intermediate feature map F' _p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' _p1 And feature map F' _p2 Adding, and then obtaining the output characteristic F 'of the image characteristic extraction submodule through a LeakyReLU activation function' _p Its dimension is H _p ×W _p ×C' _p . The specific calculation formula is as follows:

F′ _p1 ＝BN(Conv2(LeakyReLU(BN(Conv1′(F _p )))))

F′ _p2 ＝BN((Conv3(F _p )))

wherein BN (Conv 1' () represents one convolution layer with a convolution kernel size of 5×5, BN (Conv 2 ()) and Conv3 ()) represent two convolution layers with a convolution kernel size of 1×1,

in step S23, a attention feature fusion submodule is designed, as shown in fig. 5, and the submodule is composed of four convolution layers with convolution kernel size of 1×1, one global average pooling layer, two ReLU activation functions, one Sigmoid activation function and four batch normalization layers. The attention feature fusion submodule can effectively fuse text features and image features with different scales through learning, so that the model is more focused on key information of the feature map, and the generalization performance of the model is improved. Two characteristic diagrams input by the module are recorded as F' _t And F' _p The sizes of the two are H _a ×W _a ×C _a Wherein H is _a 、W _a And C _a Respectively represent the input feature images F' _t And F' _p Height, width and number of channels. Specifically, first, two input feature maps are added pixel by pixel to obtain an intermediate feature map F _b The size of the catalyst is H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the Then map F _b Respectively inputting into local attention extraction branch and global attention extraction branch for different attention feature extraction, wherein the local attention extraction branch sequentially comprises a convolution layer with convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with convolution kernel size of 1×1, and The global attention extraction branch is formed by serially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and the batch normalization layer. Record intermediate feature map F _b The output after the local attention extraction branch is characterized by F _local The output after global attention extraction branch is characterized by F _global The sizes of the two are H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the Then, feature map F _local And feature map F _global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the weight lambda is combined with the input characteristic diagram F' _t And F' _p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' _b The size of the catalyst is H _a ×W _a ×C _a . The specific calculation formula is as follows:

F _local ＝BN(Conv _{2_a} (ReLU(BN(Conv _{1_a} (F _b )))))

F _global ＝BN(Conv _{4_a} (ReLU(BN(Conv _{3_a} (GAP(F _b ))))))

representation matrix additionPerforming a method operation, wherein BN (with) represents a batch normalization operation, GAP (with) represents a global average pooling operation, reLU (with) represents a ReLU activation function, sigmoid (with) represents a Sigmoid activation function, and lambda is a learnable weight output through a network;

step S24, designing a channel attention sub-module for enhancing the feature representation and acquiring key feature channel information of the input features. The module consists of two convolution layers with convolution kernel size of 1×1, a ReLU activation function and a Sigmoid activation function. Recording the characteristic diagram input by the module as F _c The size of the catalyst is H _c ×W _c ×C _c Wherein H is _c 、W _c And C _c Respectively represent the input feature graphs F _c Height, width and number of channels. Specifically, the input features F are first aggregated using a global averaging pooling operation _c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F _c Element-wise multiplication is performed to obtain the final output F 'of the channel attention sub-module' _c The size of the catalyst is H _c ×W _c ×C _c And input of a feature map F _c Is the same. The specific calculation formula is as follows:

F′ _c ＝Sigmoid(Conv _{2_b} (ReLU(Conv _{1_b} (GAP(F _c )))))⊙F _c

step S25, designing an adaptive feature extraction module, as shown in FIG. 6, wherein the module is composed of four text feature extraction sub-modules described in step S21, four image feature extraction sub-modules described in step S22, four attention feature fusion sub-modules described in step S23, one channel attention sub-module described in step S24 and eight spatial averaging pooling layers with step length of 2. The self-adaptive feature extraction module can respectively carry out self-adaptive multi-scale feature extraction on text privileges and image features of an image block of input screen content through a text feature extraction branch and an image feature extraction branch, and effectively fuses different types of image features through an attention mechanism so as to improve modeling capability of a model. Specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer.

The size of the input screen content image block is H multiplied by W multiplied by 3, and the screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features. The input image block is recorded with the multi-level text characteristics output after text characteristic extraction branches as follows

Wherein the characteristic diagram->

Is of the size of

Feature map->

Is +.>

Feature map->

Is +.>

Feature map->

Is +.>

C' =64; the input image block is recorded with the multi-level image characteristics output after the image characteristic extraction branch is that

Wherein the characteristic diagram->

Is +.>

Feature map->

Is +.>

Feature map

Is +.>

Feature map->

Is +.>

C' =64; the text feature is then->

Corresponding image feature->

Wherein the characteristic diagram->

Is +.>

Feature map

Is +.>

Feature map->

Is +.>

Feature map->

Is +.>

C' =64; then>

Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain a multi-scale image-text feature representation F 'of the input image block' _tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:

wherein Concat (·) represents the stitching operation of the features and GAP (·) represents the global average pooling operation. Finally, the fused multi-rulerDegree graphic character F' _tp Inputting the information into a channel attention sub-module to capture key information among different channels so as to obtain final output characteristics F of the adaptive characteristic extraction module _tp Then feature F _tp Flattened into a one-dimensional vector with dimensions of 1 x D (where d=960).

Further, step S3 includes the steps of:

a local image information interaction module is designed, as shown in fig. 7, and the module is composed of four full connection layers and a Softmax function. The local image information interaction module adopts a self-attention mechanism to enhance the information interaction between different image block characteristics, so that each image block is endowed with different attention degrees, and the local characteristics of each image block are better aggregated. Specifically, the input characteristic of the module is recorded as F _l The size is n×d (where N represents the number of image blocks, n= 4;D represents the dimension of each image block, and d=960). First, input feature F _l Input into three fully connected layers, thereby generating three new intermediate features F _Q 、F _K And F _V The dimensions are all V x D '(where D' represents the intermediate feature F) _Q 、F _K And F _V Dimension in the second dimension, D' =480); then, in the intermediate feature F _Q And F _K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F _V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the feature S into a full connection layer to obtain a feature matrix F _s The dimension size is Nxd; finally, feature F _s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual _l Adding to obtain final output F 'of the local image information interaction module' _l . The specific calculation formula is as follows:

F _Q ＝Linear1(F _l )

F _K ＝Linear2(F _l )

F _V ＝Linear3(F _l )

F _s ＝Linear4(S)

representing matrix multiplication, ++>

Further, step S4 includes the steps of:

step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer. Taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13

And->

As a network inputThe dimensions were H.times.W.times.3. Specifically, firstly, four distorted image blocks are respectively input into four self-adaptive feature extraction modules to extract multi-scale image-text features of each image block, and the ith input image block is recorded->

The output characteristic after passing through the ith self-adaptive characteristic extraction module is F _i (i=1, 2,3, 4) having dimensions of 1×d; then four one-dimensional output features F _i Splicing to form a two-dimensional feature vector to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D (N represents the number of image blocks, and N=4); then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks so as to obtain a final output characteristic F of the network _out The size is NxD, which is the same dimension as the initial fusion feature F.

Step S42, outputting the characteristic F to the network obtained in step S41 _out A dimension transformation operation is performed to flatten it into a one-dimensional feature vector whose dimension size is changed from n×d to 1×c (where c=n×d). Then the flattened one-dimensional characteristic vector is input into a full connection layer, so as to obtain the quality evaluation score F of the distorted screen content image _score . The specific calculation formula is as follows:

F _score ＝Linear(Reshape(F _out ))

where Linear represents a fully connected layer and Reshape represents a dimension transformation operation.

representing the predicted quality fraction of the i-th distorted screen content image output via the network.

Further, step S5 is implemented as follows:

And->

The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims

1. The reference-free screen content image quality assessment method based on multi-region feature fusion is characterized by comprising the following steps of:

2. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 1, wherein step S1 is specifically implemented as follows:

And->

For each distorted screen content image I in the test set _test The four cut image blocks are subjected to the same normalization processing to obtain a distorted screen content image block for testing>

And->

3. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 2, wherein step S2 is specifically implemented as follows:

s21, designing a text feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 3 multiplied by 3, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; feature extraction is carried out on a text region in a distorted screen content image block by using a convolution layer with a convolution kernel size of 3 multiplied by 3, and a feature diagram input by a text feature extraction submodule is recorded as F _t The size of the catalyst is H _t ×W _t ×C _t Wherein H is _t 、W _t And C _t Respectively represent the input feature graphs F _t Height, width and number of channels; specifically, first, feature map F _t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' _t1 Its dimension is H _t ×W _t ×C' _t Wherein H is _t 、W _t And C' _t Respectively represent the intermediate characteristic diagrams F' _t1 Height, width and number of channels; then input the characteristic diagram F _t Sequentially input to a convolution kernelResidual feature extraction is carried out on a convolution layer with the size of 1 multiplied by 1 and a batch normalization layer to obtain an intermediate feature map F' _t2 Its dimension is H _t ×W _t ×C' _t And intermediate feature map F' _t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' _t1 And intermediate feature map F' _t2 Adding, and obtaining the output characteristic F 'of the text characteristic extraction submodule through the LeakyReLU activation function' _t Its dimension is H _t ×W _t ×C' _t The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F′ _t1 ＝BW(Conv2(LeakyReLU(BW(Conv1(F _t )))))

F′ _t2 ＝BN((Conv3(F _t )))

wherein Conv1 (x) represents one convolution layer with a convolution kernel size of 3 x 3, conv2 (x) and Conv3 (x) represent two convolution layers with a convolution kernel size of 1 x 1,

S22, designing an image feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 5 multiplied by 5, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; performing feature extraction on an image area in the distorted screen content image block by using a convolution layer with a convolution kernel size of 5×5; recording the characteristic diagram input by the image characteristic extraction submodule as F _p The size of the catalyst is H _p ×W _p ×C _p Wherein H is _p 、W _p And C _p Respectively represent the input feature graphs F _p Height, width and number of channels; specifically, first, feature map F _p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, and a LeakyReLU laserPerforming preliminary feature extraction in the living function, a convolution layer with the convolution kernel size of 1 multiplied by 1 and a batch normalization layer to obtain an intermediate feature map F' _p1 Its dimension is H _p ×W _p ×C' _p Wherein H is _p 、W _p And C' _p Respectively represent the intermediate characteristic diagrams F' _p1 Height, width and number of channels; then input the characteristic diagram F _p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' _p2 Its dimension is H _p ×W _p ×C' _p And intermediate feature map F' _p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' _p1 And intermediate feature map F' _p2 Adding, and obtaining the output characteristic F 'of the image characteristic extraction submodule through the LeakyReLU activation function' _p Its dimension is H _p ×W _p ×C' _p The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F′ _p1 ＝BN(Conv2(LeakyReLU(BN(Conv1′(F _p )))))

F′ _p2 ＝BN((Conv3(F _p )))

step S23, designing an attention feature fusion submodule, wherein the module consists of four convolution layers with the convolution kernel size of 1 multiplied by 1, a global average pooling layer, two ReLU activation functions, a Sigmoid activation function and four batch normalization layers; the attention characteristic fusion submodule can learn different scalesThe text features and the image features are fused, and the two features input by the attention feature fusion submodule are F' _t And F' _p The sizes of the two are H _a ×W _a ×C _a Wherein H is _a 、W _a And C _a Respectively represent the input feature images F' _t And F' _p Height, width and number of channels; specifically, first, two input features are added pixel by pixel to obtain an intermediate feature map F _b The size of the catalyst is H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram F _b The method comprises the steps of respectively inputting the local attention extraction branch and the global attention extraction branch to carry out different attention characteristic extraction, wherein the local attention extraction branch is formed by serially connecting a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer in sequence, and the global attention extraction branch is formed by serially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer in sequence; record intermediate feature map F _b The output after the local attention extraction branch is characterized by F _local The output after global attention extraction branch is characterized by F _global The sizes of the two are H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the Feature F is then followed _local And feature F _global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the learnable weight lambda is combined with the input characteristic F' _t And F' _p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' _b The size of the catalyst is H _a ×W _a ×C _a The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F _local ＝BN(Conv _{2_a} (ReLU(BN(Conv _{1_a} (F _b )))))

F _global ＝BN(Conv _{4_a} (ReLU(BN(Conv _{3_a} (GAP(F _b ))))))

F′ _c ＝Sigmoid(Conv _{2_b} (ReLU(Conv _{1_b} (GAP(F _c )))))⊙F _c

Wherein the characteristic diagram->

Is +.>

Feature map->

Is +.>

Feature map->

Is +.>

Feature map

Is +.>

C' =64; the multi-level image characteristics output by the image blocks of the input distorted screen content after the image characteristics are extracted and branched are recorded as +.>

Wherein the characteristic diagram->

Is +.>

Feature map->

Is of the size of

Feature map->

Is +.>

Feature map->

Is +.>

C' =64; then the multi-level text feature->

And corresponding multilevel image features->

Wherein the characteristic diagram->

Is of the size of

Feature map->

Is +.>

Feature map->

Is +.>

Feature map->

Is +.>

C' =64; then>

4. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 3, wherein step S3 is specifically implemented as follows:

designing a local image information interaction module which consists of four full-connection layers and a Softmax function; the local image information interaction module adopts self-attentionThe mechanism enhances the information interaction between the features of different image blocks, so that each image block is endowed with different attention degrees to better aggregate the local features of each image block; specifically, the input characteristic of the local image information interaction module is F _l The size is n×d, where N represents the number of image blocks, n=4; d represents the dimension of each image block, d=960; first, input feature F _l Input into three fully connected layers, thereby generating three new intermediate features F _Q 、F _K And F _V The dimensions are all N x D ', where D' represents the intermediate feature F _Q 、F _K And F _V Dimension of the second dimension, D' =480; then, in the intermediate feature F _Q And F _K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F _V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the two-dimensional feature matrix S into a full connection layer to obtain a feature matrix F _s The dimension size is N multiplied by D; finally, the characteristic matrix F _s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual _l Adding to obtain final output F 'of the local image information interaction module' _l The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:

F _O ＝Linear1(F _l )

F _K ＝Linear2(F _l )

F _V ＝Ltnear3(F _l )

F _s ＝Linear4(S)

representing matrix multiplication, ++>

Representing matrix addition operation, alpha represents a learnable proportion parameter for fusion, F' l represents output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the output characteristics are F _l Is the same.

5. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 4, wherein step S4 is specifically implemented as follows:

And->

As input to the network, the dimensions are all h×w×3; specifically, firstly, four image blocks are respectively input into four self-adaptive feature extraction modules to extract multi-scale image-text features of each image block, and the ith input image block is recorded->

F _score ＝Linear(Reshape(F _out ))

6. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 2, wherein step S5 is specifically implemented as follows:

And->