CN116403063A - No-reference screen content image quality assessment method based on multi-region feature fusion - Google Patents

No-reference screen content image quality assessment method based on multi-region feature fusion Download PDF

Info

Publication number
CN116403063A
CN116403063A CN202310398032.0A CN202310398032A CN116403063A CN 116403063 A CN116403063 A CN 116403063A CN 202310398032 A CN202310398032 A CN 202310398032A CN 116403063 A CN116403063 A CN 116403063A
Authority
CN
China
Prior art keywords
feature
image
screen content
input
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310398032.0A
Other languages
Chinese (zh)
Inventor
陈羽中
陈友昆
牛玉贞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202310398032.0A priority Critical patent/CN116403063A/en
Publication of CN116403063A publication Critical patent/CN116403063A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The invention relates to a non-reference screen content image quality assessment method based on multi-region feature fusion. Comprising the following steps: data preprocessing is carried out on data in the distorted screen content image data set; designing an adaptive feature extraction module, adaptively extracting different scale features of a text region and an image region in an image block of distorted screen content, and fusing the text region features and the image region features based on an attention mechanism; designing a local image information interaction module, enhancing information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and giving different attention weights to each image block; designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model; and inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model, and outputting a corresponding quality evaluation score.

Description

No-reference screen content image quality assessment method based on multi-region feature fusion
Technical Field
The invention relates to the field of image processing and computer vision, in particular to a non-reference screen content image quality assessment method based on multi-region feature fusion.
Background
In recent years, with rapid development of mobile devices, multimedia applications, and information dissemination technologies, remote computing and communication have been widely used, and screen content images generated by various electronic terminal devices, which change in real time, are increasingly frequently appearing in people's daily lives. The screen content image is a new type of image with more text, icons, graphics and special structural layout information and statistical features than traditional natural images, the number of sharp edges in the screen content image far exceeds that of the natural image, and the color information is less than that of the natural image. In the process of encoding, compressing and transmitting screen content images, various degrees of distortion are inevitably introduced due to technical or hardware limitations and the like, so that the image quality is reduced, and the user experience and the system interaction performance are further affected. In view of the demands of people for clear and high-quality images, an image quality evaluation method capable of effectively evaluating the perceived quality of screen content images is needed, which not only can be used as auxiliary reference information of some image restoration enhancement technologies, but also provides a feasible approach for designing and optimizing advanced image/video processing algorithms.
The conventional image quality evaluation methods can be classified into subjective evaluation and objective evaluation according to the evaluation subject. Subjective evaluation means that image quality is scored by a person who is the final recipient of image information, but the subjective evaluation method is easily affected by subject's own subject consciousness, and in the age of explosive increase of the current data amount, subjective evaluation of tens of thousands of image data is not realistic, so it is difficult to satisfy the demand of real-time application. The objective image quality evaluation method is to extract and analyze the relevant characteristics of the distorted image, construct a corresponding mathematical model and calculate the quality evaluation score of the distorted image. Compared with the subjective evaluation method, the process does not need to score images through a large number of subjects, but replaces a human visual system by a computer to realize automatic and efficient quality evaluation, so that the application requirements under the background of big data can be met. According to different degrees of the demand for the reference image information in the image perception quality evaluation process, objective quality evaluation methods can be divided into three types of methods of full reference, half reference and no reference, and the dependence degree of the three types of methods on the reference image information is sequentially reduced. However, in practical application, it is often difficult to obtain a reference image without distortion, so the reference-free image quality evaluation method is more practical and has wider development prospect.
With the continuous development of deep learning technology, many screen content image quality evaluation models based on convolutional neural networks have appeared in recent years. Considering that deep learning is a data driving method, and the existing screen content image dataset usually only contains a small number of distorted screen content images, so that the effect of data enhancement is mainly achieved by blocking the screen content images, but a single image block cannot fully characterize the quality of the whole distorted image, and because a large number of text and image areas exist in the screen content images at the same time, image blocks of different contents can generate large quality differences even under the condition of the same distortion effect. Therefore, the influence difference of the image area and the text area in the screen content image on the whole visual quality of the image needs to be fully considered, and a weight strategy is designed by combining the degree of the difference, so that the image characteristics of the image area and the text area are effectively fused, and the accuracy of the non-reference screen content image quality evaluation model is further improved.
Disclosure of Invention
The invention aims to provide a multi-region feature fusion-based reference-free screen content image quality assessment method, which is based on the idea of multi-region local feature fusion, and features of a plurality of local image blocks are fused to represent the overall quality of an image, so that quality score deviation caused by training by using a single image block is reduced. Meanwhile, when the convolution layer of the convolution neural network is designed, different characteristics of a text region and an image region in the screen content image can be extracted more effectively by using convolution kernels with different sizes, and the text characteristics and the image characteristics with different scales can be effectively fused by using an attention mechanism, so that attention weights with different degrees can be given to each image block. Overall, the method can achieve higher subjective and objective visual perception consistency than other methods.
In order to achieve the above purpose, the technical scheme of the invention is as follows: a multi-region feature fusion-based reference-free screen content image quality assessment method comprises the following steps:
step S1, data preprocessing is carried out on data in distorted screen content image data sets, firstly image blocks are cut out from each distorted screen content image, then the data sets are divided into training sets and test sets, and finally data enhancement is carried out on the data in the training sets;
s2, designing an adaptive feature extraction module, wherein the adaptive feature extraction module can adaptively extract different scale features of a text region and an image region in an image block of distorted screen content, and fuse the features of the text region and the features of the image region based on an attention mechanism;
s3, designing a local image information interaction module, wherein the module enhances information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and endows different attention weights to each image block;
s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model based on multi-region feature fusion;
And S5, inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting a corresponding quality evaluation score.
In an embodiment of the present invention, the step S1 is specifically implemented as follows:
step S11, firstly, clipping image blocks of each distorted screen content image I in the distorted screen content image dataset; specifically, each distorted screen content image I is divided into four areas of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each area, respectively denoted as I 1 、I 2 、I 3 And I 4 Wherein H and W represent the height and width of the image block, respectively;
step S12, dividing the distorted screen content images in the distorted screen content image data set into a training set and a testing set according to a preset proportion;
step S13, for each distorted screen content image I in the training set train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing to complete data enhancement operation, and a distorted screen content image block for training is obtained
Figure BDA0004178390730000031
Figure BDA0004178390730000032
And->
Figure BDA0004178390730000033
For each distorted screen content image I in the test set test The four cut image blocks are subjected to the same normalization processing to obtain a distorted screen content image block for testing >
Figure BDA0004178390730000034
And->
Figure BDA0004178390730000035
In an embodiment of the present invention, the step S2 is specifically implemented as follows:
step S21, designing textThe feature extraction submodule consists of a convolution layer with a convolution kernel size of 3 multiplied by 3, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; feature extraction is carried out on a text region in a distorted screen content image block by using a convolution layer with a convolution kernel size of 3 multiplied by 3, and a feature diagram input by a text feature extraction submodule is recorded as F t The size of the catalyst is H t ×W t ×C t Wherein H is t 、W t And C t Respectively represent the input feature graphs F t Height, width and number of channels; specifically, first, feature map F t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' t1 Its dimension is H t ×W t ×C' t Wherein H is t 、W t And C' t Respectively represent the intermediate characteristic diagrams F' t1 Height, width and number of channels; then input the characteristic diagram F t Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' t2 Its dimension is H t ×W t ×C' t And intermediate feature map F' t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' t1 And intermediate feature map F' t2 Adding, and obtaining the output characteristic F 'of the text characteristic extraction submodule through the LeakyReLU activation function' t Its dimension is H t ×W t ×C' t The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ t1 =BN(Conv2(LeakyReLU(BN(Conv1(F t )))))
F′ t2 =BN(Conv3(F t )))
Figure BDA0004178390730000036
wherein Conv1 represents a convolution kernel size of 3×3Conv2 (x) and Conv3 (x) represent two convolution layers with a convolution kernel size of 1 x 1,
Figure BDA0004178390730000037
representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
s22, designing an image feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 5 multiplied by 5, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; performing feature extraction on an image area in the distorted screen content image block by using a convolution layer with a convolution kernel size of 5×5; recording the characteristic diagram input by the image characteristic extraction submodule as F p The size of the catalyst is H p ×W p ×C p Wherein H is p 、W p And C p Respectively represent the input feature graphs F p Height, width and number of channels; specifically, first, feature map F p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, a LeakyReLU activation function, a convolution layer with convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' p1 Its dimension is H p ×W p ×C' p Wherein H is p 、W p And C' p Respectively represent the intermediate characteristic diagrams F' p1 Height, width and number of channels; then input the characteristic diagram F p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' p2 Its dimension is H p ×W p ×C' p And intermediate feature map F' p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' p1 And intermediate feature map F' p2 Adding, and obtaining the output characteristic F 'of the image characteristic extraction submodule through the LeakyReLU activation function' p Its dimension is H p ×W p ×C' p The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ p1 =BN(Conv2(LeakyReLU(BN(Conv1(F p )))))
F′ p2 =BN(Conv3(F p )))
Figure BDA0004178390730000041
wherein Conv1' represents a convolution layer having a convolution kernel size of 5 x 5, conv2 and Conv3 represent two convolution layers having a convolution kernel size of 1 x 1,
Figure BDA0004178390730000042
representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
step S23, designing an attention feature fusion submodule, wherein the module consists of four convolution layers with the convolution kernel size of 1 multiplied by 1, a global average pooling layer, two ReLU activation functions, a Sigmoid activation function and four batch normalization layers; the attention feature fusion submodule can fuse text features and image features with different scales through learning, and two features input by the attention feature fusion submodule are recorded as F' t And F' p The sizes of the two are H a ×W a ×C a Wherein H is a 、W a And C a Respectively represent the input feature images F' t And F' p Height, width and number of channels; specifically, first, two input features are added pixel by pixel to obtain an intermediate feature map F b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram F b Respectively inputting into local attention extraction branch and global attention extraction branch for different attention feature extraction, wherein the local attention extraction branch is formed by serially connecting a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer, and the global attention extraction branch is formed by sequentially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function and a convolution kernel sizeThe convolution layers with the size of 1 multiplied by 1 and the batch normalization layers are connected in series; record intermediate feature map F b The output after the local attention extraction branch is characterized by F local The output after global attention extraction branch is characterized by F global The sizes of the two are H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Feature F is then followed local And feature F global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the learnable weight lambda is combined with the input characteristic F' t And F' p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
Figure BDA0004178390730000051
F local =BN(Conv 2_a (ReLU(BN(Conv 1_a (F b )))))
F global =BN(Conv 4_a (ReLU(BN(Conv 3_a (GAP(F b ))))))
Figure BDA0004178390730000052
Figure BDA0004178390730000053
wherein Conv 1_a (*)、Conv 2_a (*)、Conv 3_a ()' and Conv 4_a The four convolution layers with a convolution kernel size of 1 x 1 are shown,
Figure BDA0004178390730000054
representing matrix addition operation, BN (·) representing batch normalization operation, GAP (·) representing global average pooling operation, reLU (·) representing ReLU activation function, sigmoid (·) representing Sigmoid activation function, λ being a learnable weight output via the network;
step S24, designing a channel attention submodule for enhancing feature representation and acquiring key feature channel information of input features, wherein the module consists of two convolution layers with convolution kernel size of 1 multiplied by 1, a ReLU activation function and a Sigmoid activation function, and the feature diagram input by the channel attention submodule is F c The size of the catalyst is H c ×W c ×C c Wherein H is c 、W c And C c Respectively represent the input feature graphs F c Height, width and number of channels; specifically, the input features F are first aggregated using a global averaging pooling operation c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F c Multiplying by element to obtain final output F 'of the channel attention submodule' c The size of the catalyst is H c ×W c ×C c And input of a feature map F c Is the same in dimension; the specific calculation formula is as follows:
F′ c =Sigmoid(Conv 2_b (ReLU(Conv 1_b (GQP(F c )))))⊙F c
wherein GAP represents a global average pooling operation, conv 1_b ()' and Conv 2_b "(. Times.) represent two convolution layers with a convolution kernel size of 1×1," represents matrix multiplication, sigmoid (·) represents Sigmoid activation function, and ReLU (·) represents ReLU activation function;
step S25, designing a self-adaptive feature extraction module, wherein the self-adaptive feature extraction module comprises four text feature extraction sub-modules in step S21, four image feature extraction sub-modules in step S22, four attention feature fusion sub-modules in step S23, one channel attention sub-module in step S24 and eight space average pooling layers with step length of 2; the self-adaptive feature extraction module respectively carries out self-adaptive multi-scale feature extraction on text privileges and image features of the image blocks of the input distorted screen content through text feature extraction branches and image feature extraction branches, and fuses different types of image features through an attention mechanism; specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer;
The size of the input distorted screen content image block is recorded as H multiplied by W multiplied by 3, the distorted screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features, and the multi-level text features output after the input distorted screen content image block passes through the text feature extraction branch are recorded as follows
Figure BDA0004178390730000061
Wherein the characteristic diagram
Figure BDA0004178390730000062
Is +.>
Figure BDA0004178390730000063
Feature map->
Figure BDA0004178390730000064
Is +.>
Figure BDA0004178390730000065
Feature map->
Figure BDA0004178390730000066
Is +.>
Figure BDA0004178390730000067
Feature map->
Figure BDA0004178390730000068
Is +.>
Figure BDA0004178390730000069
C' =64; recording the input distortion screen content image block passing diagramThe multi-level image feature outputted after the image feature extraction branch is +.>
Figure BDA00041783907300000610
Wherein the characteristic diagram->
Figure BDA00041783907300000611
Is +.>
Figure BDA00041783907300000612
Feature map->
Figure BDA00041783907300000613
Is of the size of
Figure BDA00041783907300000614
Feature map->
Figure BDA00041783907300000615
Is +.>
Figure BDA00041783907300000616
Feature map->
Figure BDA00041783907300000617
Is +.>
Figure BDA00041783907300000618
C' =64; then the multi-level text feature->
Figure BDA00041783907300000619
And corresponding multilevel image features->
Figure BDA00041783907300000620
Respectively inputting the three main features into four attention feature fusion sub-modules, and fusing text features and image features to obtain fused multi-level trunk features +.>
Figure BDA00041783907300000621
Wherein the characteristic diagram->
Figure BDA00041783907300000622
Is of the size of
Figure BDA00041783907300000623
Feature map->
Figure BDA00041783907300000624
Is +.>
Figure BDA00041783907300000625
Feature map->
Figure BDA00041783907300000626
Is +.>
Figure BDA00041783907300000627
Feature map->
Figure BDA00041783907300000628
Is +.>
Figure BDA00041783907300000629
C' =64; then>
Figure BDA00041783907300000630
Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain multi-scale image-text feature representation F 'of the input image block' tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:
Figure BDA00041783907300000631
wherein Concat (·) represents the stitching operation of the features, GAP (·) represents the global average pooling operation; finally, the fused multi-scale image-text characteristic F 'is processed' tp Inputting the information into a channel attention sub-module to capture key information among different channels to obtain final output characteristics F of the adaptive characteristic extraction module tp Then feature F tp Flattened into a one-dimensional vector of dimension size 1 x D, where D represents the dimension of each image block, d=960.
In an embodiment of the present invention, the step S3 is specifically implemented as follows:
designing a local image information interaction module which consists of four full-connection layers and a Softmax function; the local image information interaction module adopts a self-attention mechanism to enhance information interaction among different image block characteristics, so that each image block is endowed with different attention degrees to better aggregate the local characteristics of each image block; specifically, the input characteristic of the local image information interaction module is F l The size is n×d, where N represents the number of image blocks, n=4; d represents the dimension of each image block, d=960; first, input feature F l Input into three fully connected layers, thereby generating three new intermediate features F Q 、F K And F V The dimensions are all N x D ', where D' represents the intermediate feature F Q 、F K And F V Dimension of the second dimension, D' =480; then, in the intermediate feature F Q And F K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the two-dimensional feature matrix S into a full connection layer to obtain a feature matrix F s The dimension size is N multiplied by D; finally, the characteristic matrix F s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual l Adding to obtain final output F 'of the local image information interaction module' l The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F Q =Linear1(F l )
F K =Linear2(F l )
F V =Linear3(F l )
Figure BDA0004178390730000071
Figure BDA0004178390730000072
F s =Linear4(S)
Figure BDA0004178390730000073
wherein, linear1 (, linear2 (, linear3 (, linear4 ()) represents four fully connected layers, softmax (·) represents the Softmax function, transose (·) represents the Transpose operation of the two-dimensional matrix,
Figure BDA0004178390730000074
representing matrix multiplication, ++>
Figure BDA0004178390730000075
Representing matrix addition, alpha represents a learnable scale parameter for fusion, F' l Output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the input characteristics F l Is the same.
In an embodiment of the present invention, the step S4 is specifically implemented as follows:
step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer; taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13
Figure BDA0004178390730000076
And->
Figure BDA0004178390730000077
As input to the network, the dimensions are all h×w×3; specifically, first, four image blocks are input to four adaptive bits, respectivelyIn the sign extraction module, the ith input image block is marked for extracting multi-scale image-text characteristics of each image block>
Figure BDA0004178390730000078
The output characteristic after passing through the ith self-adaptive characteristic extraction module is F i The dimension sizes are all 1 xD; then four one-dimensional output features F i Splicing the two-dimensional feature vectors to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D; then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks to obtain a final output characteristic F of the network out The size of the fusion feature is NxD, and the dimension of the fusion feature is the same as that of the initial fusion feature F;
Step S42, outputting the characteristic F to the network obtained in step S41 out Performing dimension transformation operation, flattening the dimension transformation operation to form a one-dimensional feature vector, wherein the dimension size of the one-dimensional feature vector is changed from N multiplied by D to 1 multiplied by C, and C=N multiplied by D; then the flattened one-dimensional feature vector is input into a full connection layer to obtain the quality evaluation score F of the distorted screen content image score The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F score =Linear(Reshape(F out ))
wherein Linear represents a fully connected layer and Reshape represents dimension transformation operation;
step S43, designing a loss function of the non-reference image quality evaluation network based on multi-region feature fusion, which is specifically as follows:
Figure BDA0004178390730000081
wherein n is the number of samples in the training set, y i Representing the true quality score of the ith distorted screen content image,
Figure BDA0004178390730000082
representing the predicted quality fraction of the ith distorted screen content image output through the network;
and S44, repeating the steps S41 to S43 by taking a batch as a unit until the loss value calculated in the step S43 converges and tends to be stable, saving network parameters, and completing the training process of the non-reference image quality evaluation network based on multi-region feature fusion.
In an embodiment of the present invention, the step S5 is specifically implemented as follows:
four image blocks corresponding to each distorted screen content image in the test set obtained in step S13
Figure BDA0004178390730000083
And->
Figure BDA0004178390730000084
And inputting the quality evaluation scores into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting corresponding quality evaluation scores.
Compared with the prior art, the invention has the following beneficial effects: the invention aims to solve the quality score deviation problem caused by the block training of the distorted screen content image by the image quality evaluation model based on the convolutional neural network and the characteristic extraction and fusion problem of different types of areas in the screen content image. In order to solve the problems, the invention provides a multi-region feature fusion-based reference-free screen content image quality evaluation method, which is used for representing the overall quality of an image by fusing local features of a plurality of regions of a distorted image and carrying out self-adaptive feature extraction by using convolution kernels with different sizes for different types of regions. In addition, the method also uses an attention mechanism to carry out self-adaptive fusion on different types of image features and enhances information interaction among various local image blocks, thereby effectively improving the accuracy of the image quality assessment model without reference screen content.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the invention.
Fig. 2 is a diagram illustrating a network model structure according to an embodiment of the present invention.
Fig. 3 is a block diagram of a text feature extraction submodule according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of an image feature extraction submodule according to an embodiment of the present invention.
FIG. 5 is a diagram of a attention feature fusion submodule architecture according to an embodiment of the present invention.
Fig. 6 is a block diagram of an adaptive feature extraction module according to an embodiment of the invention.
Fig. 7 is a block diagram of a local image information interaction module according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is specifically described below with reference to the accompanying drawings.
The invention provides a multi-region feature fusion-based reference-free screen content image quality assessment method, which is shown in fig. 1 and comprises the following steps:
step S1, data preprocessing is carried out on data in distorted screen content image data sets, firstly image blocks are cut out from each distorted screen content image, then the data sets are divided into training sets and test sets, and finally data enhancement is carried out on the data in the training sets;
s2, designing an adaptive feature extraction module, wherein the adaptive feature extraction module can adaptively extract different scale features of a text region and an image region in an image block of distorted screen content, and fuse the features of the text region and the features of the image region based on an attention mechanism;
S3, designing a local image information interaction module, wherein the module enhances information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and endows different attention weights to each image block;
s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model;
and S5, inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting a corresponding quality evaluation score.
FIG. 2 is a diagram of a network model constructed by the method of the present invention.
Further, step S1 includes the steps of:
step S11, firstly, clipping the image block for each distorted image I in the distorted screen content image dataset. Specifically, each distorted image I is divided into four regions of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each region, denoted as I 1 、I 2 、I 3 And I 4 Wherein H and W represent the height and width of the image block, respectively;
s12, dividing images in the distorted screen content image dataset into a training set and a testing set according to a certain proportion;
Step S13, for each distorted screen content image I in the training set train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing, so that data enhancement operation is completed, and a distorted screen content image block for training is obtained
Figure BDA0004178390730000091
Figure BDA0004178390730000092
And->
Figure BDA0004178390730000093
For each distorted screen content image I in the test set test The four image blocks after clipping are subjected to the same normalization processing, thereby obtaining a distorted screen content image block for testing +.>
Figure BDA0004178390730000094
And
Figure BDA0004178390730000095
further, step S2 includes the steps of:
step S21, designing a text feature extraction submodule, wherein the submodule is composed of a convolution layer with a convolution kernel size of 3 multiplied by 3, and a text feature extraction submodule,Two convolution layers with a convolution kernel size of 1×1, two LeakyReLU activation functions, and three batch normalization layers. Since smaller convolution kernels may better capture local detail information of an image, such as characters, lines, etc., a convolution layer with a convolution kernel size of 3 x 3 is used to perform feature extraction on text regions in the image. Recording the characteristic diagram input by the module as F t The size of the catalyst is H t ×W t ×C t Wherein H is t 、W t And C t Respectively represent the input feature graphs F t Height, width and number of channels. Specifically, first, feature map F t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' t1 Its dimension is H t ×W t ×C' t Wherein H is t 、W t And C' t Respectively represent the intermediate characteristic diagrams F' t1 Height, width and number of channels; then input the characteristic diagram F t Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' t2 Its dimension is H t ×W t ×C' t And intermediate feature map F' t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' t1 And feature map F' t2 Adding, and then obtaining the output feature F 'of the text feature extraction submodule through a LeakyReLU activation function' t Its dimension is H t ×W t ×C' t . The specific calculation formula is as follows:
F′ t1 =BN(Conv2(LeakyReLU(BN(Conv1(F t )))))
F′ t2 =BN(Conv3(F t )))
Figure BDA0004178390730000101
wherein Conv1 represents a convolution kernel size of 3×3, conv2 and Conv3 represent two convolution layers of convolution kernel size 1 x 1,
Figure BDA0004178390730000102
representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
in step S22, an image feature extraction submodule is designed, as shown in fig. 4, and the submodule is composed of one convolution layer with a convolution kernel size of 5×5, two convolution layers with a convolution kernel size of 1×1, two LeakyReLU activation functions and three batch normalization layers. Since a larger convolution kernel is more suitable for capturing visual features of an image, such as overall texture and color of the image, under a larger receptive field, feature extraction is performed on the image region using a convolution layer having a convolution kernel size of 5 x 5. Recording the characteristic diagram input by the module as F p The size of the catalyst is H p ×W p ×C p Wherein H is p 、W p And C p Respectively represent the input feature graphs F p Height, width and number of channels. Specifically, first, feature map F p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, a LeakyReLU activation function, a convolution layer with convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' p1 Its dimension is H p ×W p ×C' p Wherein H is p 、W p And C' p Respectively represent the intermediate characteristic diagrams F' p1 Height, width and number of channels; then input the characteristic diagram F p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' p2 Its dimension is H p ×W p ×C' p And intermediate feature map F' p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' p1 And feature map F' p2 Adding, and then obtaining the output characteristic F 'of the image characteristic extraction submodule through a LeakyReLU activation function' p Its dimension is H p ×W p ×C' p . The specific calculation formula is as follows:
F′ p1 =BN(Conv2(LeakyReLU(BN(Conv1′(F p )))))
F′ p2 =BN((Conv3(F p )))
Figure BDA0004178390730000111
wherein BN (Conv 1' () represents one convolution layer with a convolution kernel size of 5×5, BN (Conv 2 ()) and Conv3 ()) represent two convolution layers with a convolution kernel size of 1×1,
Figure BDA0004178390730000112
Representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
in step S23, a attention feature fusion submodule is designed, as shown in fig. 5, and the submodule is composed of four convolution layers with convolution kernel size of 1×1, one global average pooling layer, two ReLU activation functions, one Sigmoid activation function and four batch normalization layers. The attention feature fusion submodule can effectively fuse text features and image features with different scales through learning, so that the model is more focused on key information of the feature map, and the generalization performance of the model is improved. Two characteristic diagrams input by the module are recorded as F' t And F' p The sizes of the two are H a ×W a ×C a Wherein H is a 、W a And C a Respectively represent the input feature images F' t And F' p Height, width and number of channels. Specifically, first, two input feature maps are added pixel by pixel to obtain an intermediate feature map F b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then map F b Respectively inputting into local attention extraction branch and global attention extraction branch for different attention feature extraction, wherein the local attention extraction branch sequentially comprises a convolution layer with convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with convolution kernel size of 1×1, and The global attention extraction branch is formed by serially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and the batch normalization layer. Record intermediate feature map F b The output after the local attention extraction branch is characterized by F local The output after global attention extraction branch is characterized by F global The sizes of the two are H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then, feature map F local And feature map F global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the weight lambda is combined with the input characteristic diagram F' t And F' p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' b The size of the catalyst is H a ×W a ×C a . The specific calculation formula is as follows:
Figure BDA0004178390730000113
F local =BN(Conv 2_a (ReLU(BN(Conv 1_a (F b )))))
F global =BN(Conv 4_a (ReLU(BN(Conv 3_a (GAP(F b ))))))
Figure BDA0004178390730000121
Figure BDA0004178390730000122
wherein Conv 1_a (*)、Conv 2_a (*)、Conv 3_a ()' and Conv 4_a The four convolution layers with a convolution kernel size of 1 x 1 are shown,
Figure BDA0004178390730000123
representation matrix additionPerforming a method operation, wherein BN (with) represents a batch normalization operation, GAP (with) represents a global average pooling operation, reLU (with) represents a ReLU activation function, sigmoid (with) represents a Sigmoid activation function, and lambda is a learnable weight output through a network;
step S24, designing a channel attention sub-module for enhancing the feature representation and acquiring key feature channel information of the input features. The module consists of two convolution layers with convolution kernel size of 1×1, a ReLU activation function and a Sigmoid activation function. Recording the characteristic diagram input by the module as F c The size of the catalyst is H c ×W c ×C c Wherein H is c 、W c And C c Respectively represent the input feature graphs F c Height, width and number of channels. Specifically, the input features F are first aggregated using a global averaging pooling operation c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F c Element-wise multiplication is performed to obtain the final output F 'of the channel attention sub-module' c The size of the catalyst is H c ×W c ×C c And input of a feature map F c Is the same. The specific calculation formula is as follows:
F′ c =Sigmoid(Conv 2_b (ReLU(Conv 1_b (GAP(F c )))))⊙F c
wherein GAP represents a global average pooling operation, conv 1_b ()' and Conv 2_b "(. Times.) represent two convolution layers with a convolution kernel size of 1×1," represents matrix multiplication, sigmoid (·) represents Sigmoid activation function, and ReLU (·) represents ReLU activation function;
step S25, designing an adaptive feature extraction module, as shown in FIG. 6, wherein the module is composed of four text feature extraction sub-modules described in step S21, four image feature extraction sub-modules described in step S22, four attention feature fusion sub-modules described in step S23, one channel attention sub-module described in step S24 and eight spatial averaging pooling layers with step length of 2. The self-adaptive feature extraction module can respectively carry out self-adaptive multi-scale feature extraction on text privileges and image features of an image block of input screen content through a text feature extraction branch and an image feature extraction branch, and effectively fuses different types of image features through an attention mechanism so as to improve modeling capability of a model. Specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer.
The size of the input screen content image block is H multiplied by W multiplied by 3, and the screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features. The input image block is recorded with the multi-level text characteristics output after text characteristic extraction branches as follows
Figure BDA0004178390730000131
Wherein the characteristic diagram->
Figure BDA0004178390730000132
Is of the size of
Figure BDA0004178390730000133
Feature map->
Figure BDA0004178390730000134
Is +.>
Figure BDA0004178390730000135
Feature map->
Figure BDA0004178390730000136
Is +.>
Figure BDA0004178390730000137
Feature map->
Figure BDA0004178390730000138
Is +.>
Figure BDA0004178390730000139
C' =64; the input image block is recorded with the multi-level image characteristics output after the image characteristic extraction branch is that
Figure BDA00041783907300001310
Wherein the characteristic diagram->
Figure BDA00041783907300001311
Is +.>
Figure BDA00041783907300001312
Feature map->
Figure BDA00041783907300001313
Is +.>
Figure BDA00041783907300001314
Feature map
Figure BDA00041783907300001315
Is +.>
Figure BDA00041783907300001316
Feature map->
Figure BDA00041783907300001317
Is +.>
Figure BDA00041783907300001318
C' =64; the text feature is then->
Figure BDA00041783907300001319
Corresponding image feature->
Figure BDA00041783907300001320
Respectively inputting the three main features into four attention feature fusion sub-modules, and fusing text features and image features to obtain fused multi-level trunk features +.>
Figure BDA00041783907300001321
Wherein the characteristic diagram->
Figure BDA00041783907300001322
Is +.>
Figure BDA00041783907300001323
Feature map
Figure BDA00041783907300001324
Is +.>
Figure BDA00041783907300001325
Feature map->
Figure BDA00041783907300001326
Is +.>
Figure BDA00041783907300001327
Feature map->
Figure BDA00041783907300001328
Is +.>
Figure BDA00041783907300001329
C' =64; then>
Figure BDA00041783907300001330
Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain a multi-scale image-text feature representation F 'of the input image block' tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:
Figure BDA00041783907300001331
wherein Concat (·) represents the stitching operation of the features and GAP (·) represents the global average pooling operation. Finally, the fused multi-rulerDegree graphic character F' tp Inputting the information into a channel attention sub-module to capture key information among different channels so as to obtain final output characteristics F of the adaptive characteristic extraction module tp Then feature F tp Flattened into a one-dimensional vector with dimensions of 1 x D (where d=960).
Further, step S3 includes the steps of:
a local image information interaction module is designed, as shown in fig. 7, and the module is composed of four full connection layers and a Softmax function. The local image information interaction module adopts a self-attention mechanism to enhance the information interaction between different image block characteristics, so that each image block is endowed with different attention degrees, and the local characteristics of each image block are better aggregated. Specifically, the input characteristic of the module is recorded as F l The size is n×d (where N represents the number of image blocks, n= 4;D represents the dimension of each image block, and d=960). First, input feature F l Input into three fully connected layers, thereby generating three new intermediate features F Q 、F K And F V The dimensions are all V x D '(where D' represents the intermediate feature F) Q 、F K And F V Dimension in the second dimension, D' =480); then, in the intermediate feature F Q And F K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the feature S into a full connection layer to obtain a feature matrix F s The dimension size is Nxd; finally, feature F s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual l Adding to obtain final output F 'of the local image information interaction module' l . The specific calculation formula is as follows:
F Q =Linear1(F l )
F K =Linear2(F l )
F V =Linear3(F l )
Figure BDA0004178390730000141
Figure BDA0004178390730000142
F s =Linear4(S)
Figure BDA0004178390730000143
wherein, linear1 (, linear2 (, linear3 (, linear4 ()) represents four fully connected layers, softmax (·) represents the Softmax function, transose (·) represents the Transpose operation of the two-dimensional matrix,
Figure BDA0004178390730000144
representing matrix multiplication, ++>
Figure BDA0004178390730000145
Representing matrix addition, alpha represents a learnable scale parameter for fusion, F' l Output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the input characteristics F l Is the same.
Further, step S4 includes the steps of:
step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer. Taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13
Figure BDA0004178390730000146
And->
Figure BDA0004178390730000147
As a network inputThe dimensions were H.times.W.times.3. Specifically, firstly, four distorted image blocks are respectively input into four self-adaptive feature extraction modules to extract multi-scale image-text features of each image block, and the ith input image block is recorded->
Figure BDA0004178390730000148
The output characteristic after passing through the ith self-adaptive characteristic extraction module is F i (i=1, 2,3, 4) having dimensions of 1×d; then four one-dimensional output features F i Splicing to form a two-dimensional feature vector to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D (N represents the number of image blocks, and N=4); then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks so as to obtain a final output characteristic F of the network out The size is NxD, which is the same dimension as the initial fusion feature F.
Step S42, outputting the characteristic F to the network obtained in step S41 out A dimension transformation operation is performed to flatten it into a one-dimensional feature vector whose dimension size is changed from n×d to 1×c (where c=n×d). Then the flattened one-dimensional characteristic vector is input into a full connection layer, so as to obtain the quality evaluation score F of the distorted screen content image score . The specific calculation formula is as follows:
F score =Linear(Reshape(F out ))
where Linear represents a fully connected layer and Reshape represents a dimension transformation operation.
Step S43, designing a loss function of the non-reference image quality evaluation network based on multi-region feature fusion, which is specifically as follows:
Figure BDA0004178390730000151
wherein n is the number of samples in the training set, y i Representing the true quality score of the ith distorted screen content image,
Figure BDA0004178390730000152
representing the predicted quality fraction of the i-th distorted screen content image output via the network.
And S44, repeating the steps S41 to S43 by taking a batch as a unit until the loss value calculated in the step S43 converges and tends to be stable, saving network parameters, and completing the training process of the non-reference image quality evaluation network based on multi-region feature fusion.
Further, step S5 is implemented as follows:
four image blocks corresponding to each distorted screen content image in the test set obtained in step S13
Figure BDA0004178390730000153
And->
Figure BDA0004178390730000154
And inputting the quality evaluation scores into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting corresponding quality evaluation scores.
The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.

Claims (6)

1. The reference-free screen content image quality assessment method based on multi-region feature fusion is characterized by comprising the following steps of:
step S1, data preprocessing is carried out on data in distorted screen content image data sets, firstly image blocks are cut out from each distorted screen content image, then the data sets are divided into training sets and test sets, and finally data enhancement is carried out on the data in the training sets;
s2, designing an adaptive feature extraction module, wherein the adaptive feature extraction module can adaptively extract different scale features of a text region and an image region in an image block of distorted screen content, and fuse the features of the text region and the features of the image region based on an attention mechanism;
S3, designing a local image information interaction module, wherein the module enhances information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and endows different attention weights to each image block;
s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model based on multi-region feature fusion;
and S5, inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting a corresponding quality evaluation score.
2. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 1, wherein step S1 is specifically implemented as follows:
step S11, firstly, clipping image blocks of each distorted screen content image I in the distorted screen content image dataset; specifically, each distorted screen content image I is divided into four areas of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each area, respectively denoted as I 1 、I 2 、I 3 And I 4 Wherein H and W represent the height and width of the image block, respectively;
step S12, dividing the distorted screen content images in the distorted screen content image data set into a training set and a testing set according to a preset proportion;
step S13, for each distorted screen content image I in the training set train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing to complete data enhancement operation, and a distorted screen content image block for training is obtained
Figure FDA0004178390720000011
Figure FDA0004178390720000012
And->
Figure FDA0004178390720000013
For each distorted screen content image I in the test set test The four cut image blocks are subjected to the same normalization processing to obtain a distorted screen content image block for testing>
Figure FDA0004178390720000014
And->
Figure FDA0004178390720000015
3. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 2, wherein step S2 is specifically implemented as follows:
s21, designing a text feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 3 multiplied by 3, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; feature extraction is carried out on a text region in a distorted screen content image block by using a convolution layer with a convolution kernel size of 3 multiplied by 3, and a feature diagram input by a text feature extraction submodule is recorded as F t The size of the catalyst is H t ×W t ×C t Wherein H is t 、W t And C t Respectively represent the input feature graphs F t Height, width and number of channels; specifically, first, feature map F t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' t1 Its dimension is H t ×W t ×C' t Wherein H is t 、W t And C' t Respectively represent the intermediate characteristic diagrams F' t1 Height, width and number of channels; then input the characteristic diagram F t Sequentially input to a convolution kernelResidual feature extraction is carried out on a convolution layer with the size of 1 multiplied by 1 and a batch normalization layer to obtain an intermediate feature map F' t2 Its dimension is H t ×W t ×C' t And intermediate feature map F' t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' t1 And intermediate feature map F' t2 Adding, and obtaining the output characteristic F 'of the text characteristic extraction submodule through the LeakyReLU activation function' t Its dimension is H t ×W t ×C' t The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ t1 =BW(Conv2(LeakyReLU(BW(Conv1(F t )))))
F′ t2 =BN((Conv3(F t )))
Figure FDA0004178390720000021
wherein Conv1 (x) represents one convolution layer with a convolution kernel size of 3 x 3, conv2 (x) and Conv3 (x) represent two convolution layers with a convolution kernel size of 1 x 1,
Figure FDA0004178390720000022
representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
S22, designing an image feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 5 multiplied by 5, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; performing feature extraction on an image area in the distorted screen content image block by using a convolution layer with a convolution kernel size of 5×5; recording the characteristic diagram input by the image characteristic extraction submodule as F p The size of the catalyst is H p ×W p ×C p Wherein H is p 、W p And C p Respectively represent the input feature graphs F p Height, width and number of channels; specifically, first, feature map F p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, and a LeakyReLU laserPerforming preliminary feature extraction in the living function, a convolution layer with the convolution kernel size of 1 multiplied by 1 and a batch normalization layer to obtain an intermediate feature map F' p1 Its dimension is H p ×W p ×C' p Wherein H is p 、W p And C' p Respectively represent the intermediate characteristic diagrams F' p1 Height, width and number of channels; then input the characteristic diagram F p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' p2 Its dimension is H p ×W p ×C' p And intermediate feature map F' p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' p1 And intermediate feature map F' p2 Adding, and obtaining the output characteristic F 'of the image characteristic extraction submodule through the LeakyReLU activation function' p Its dimension is H p ×W p ×C' p The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ p1 =BN(Conv2(LeakyReLU(BN(Conv1′(F p )))))
F′ p2 =BN((Conv3(F p )))
Figure FDA0004178390720000031
wherein Conv1' represents a convolution layer having a convolution kernel size of 5 x 5, conv2 and Conv3 represent two convolution layers having a convolution kernel size of 1 x 1,
Figure FDA0004178390720000032
representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
step S23, designing an attention feature fusion submodule, wherein the module consists of four convolution layers with the convolution kernel size of 1 multiplied by 1, a global average pooling layer, two ReLU activation functions, a Sigmoid activation function and four batch normalization layers; the attention characteristic fusion submodule can learn different scalesThe text features and the image features are fused, and the two features input by the attention feature fusion submodule are F' t And F' p The sizes of the two are H a ×W a ×C a Wherein H is a 、W a And C a Respectively represent the input feature images F' t And F' p Height, width and number of channels; specifically, first, two input features are added pixel by pixel to obtain an intermediate feature map F b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram F b The method comprises the steps of respectively inputting the local attention extraction branch and the global attention extraction branch to carry out different attention characteristic extraction, wherein the local attention extraction branch is formed by serially connecting a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer in sequence, and the global attention extraction branch is formed by serially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer in sequence; record intermediate feature map F b The output after the local attention extraction branch is characterized by F local The output after global attention extraction branch is characterized by F global The sizes of the two are H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Feature F is then followed local And feature F global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the learnable weight lambda is combined with the input characteristic F' t And F' p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
Figure FDA0004178390720000033
F local =BN(Conv 2_a (ReLU(BN(Conv 1_a (F b )))))
F global =BN(Conv 4_a (ReLU(BN(Conv 3_a (GAP(F b ))))))
Figure FDA0004178390720000041
Figure FDA0004178390720000042
Wherein Conv 1_a (*)、Conv 2_a (*)、Conv 3_a ()' and Conv 4_a The four convolution layers with a convolution kernel size of 1 x 1 are shown,
Figure FDA0004178390720000043
representing matrix addition operation, BN (·) representing batch normalization operation, GAP (·) representing global average pooling operation, reLU (·) representing ReLU activation function, sigmoid (·) representing Sigmoid activation function, λ being a learnable weight output via the network;
step S24, designing a channel attention submodule for enhancing feature representation and acquiring key feature channel information of input features, wherein the module consists of two convolution layers with convolution kernel size of 1 multiplied by 1, a ReLU activation function and a Sigmoid activation function, and the feature diagram input by the channel attention submodule is F c The size of the catalyst is H c ×W c ×C c Wherein H is c 、W c And C c Respectively represent the input feature graphs F c Height, width and number of channels; specifically, the input features F are first aggregated using a global averaging pooling operation c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F c Multiplying by element to obtain final output F 'of the channel attention submodule' c The size of the catalyst is H c ×W c ×C c And input of a feature map F c Is the same in dimension; the specific calculation formula is as follows:
F′ c =Sigmoid(Conv 2_b (ReLU(Conv 1_b (GAP(F c )))))⊙F c
wherein GAP represents a global average pooling operation, conv 1_b ()' and Conv 2_b "(. Times.) represent two convolution layers with a convolution kernel size of 1×1," represents matrix multiplication, sigmoid (·) represents Sigmoid activation function, and ReLU (·) represents ReLU activation function;
step S25, designing a self-adaptive feature extraction module, wherein the self-adaptive feature extraction module comprises four text feature extraction sub-modules in step S21, four image feature extraction sub-modules in step S22, four attention feature fusion sub-modules in step S23, one channel attention sub-module in step S24 and eight space average pooling layers with step length of 2; the self-adaptive feature extraction module respectively carries out self-adaptive multi-scale feature extraction on text privileges and image features of the image blocks of the input distorted screen content through text feature extraction branches and image feature extraction branches, and fuses different types of image features through an attention mechanism; specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer;
The size of the input distorted screen content image block is recorded as H multiplied by W multiplied by 3, the distorted screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features, and the multi-level text features output after the input distorted screen content image block passes through the text feature extraction branch are recorded as follows
Figure FDA0004178390720000051
Wherein the characteristic diagram->
Figure FDA0004178390720000052
Is +.>
Figure FDA0004178390720000053
Feature map->
Figure FDA0004178390720000054
Is +.>
Figure FDA0004178390720000055
Feature map->
Figure FDA0004178390720000056
Is +.>
Figure FDA0004178390720000057
Feature map
Figure FDA0004178390720000058
Is +.>
Figure FDA0004178390720000059
C' =64; the multi-level image characteristics output by the image blocks of the input distorted screen content after the image characteristics are extracted and branched are recorded as +.>
Figure FDA00041783907200000510
Wherein the characteristic diagram->
Figure FDA00041783907200000511
Is +.>
Figure FDA00041783907200000512
Feature map->
Figure FDA00041783907200000513
Is of the size of
Figure FDA00041783907200000514
Feature map->
Figure FDA00041783907200000515
Is +.>
Figure FDA00041783907200000516
Feature map->
Figure FDA00041783907200000517
Is +.>
Figure FDA00041783907200000518
C' =64; then the multi-level text feature->
Figure FDA00041783907200000519
And corresponding multilevel image features->
Figure FDA00041783907200000520
Respectively inputting the three main features into four attention feature fusion sub-modules, and fusing text features and image features to obtain fused multi-level trunk features +.>
Figure FDA00041783907200000521
Wherein the characteristic diagram->
Figure FDA00041783907200000522
Is of the size of
Figure FDA00041783907200000523
Feature map->
Figure FDA00041783907200000524
Is +.>
Figure FDA00041783907200000525
Feature map->
Figure FDA00041783907200000526
Is +.>
Figure FDA00041783907200000527
Feature map->
Figure FDA00041783907200000528
Is +.>
Figure FDA00041783907200000529
C' =64; then>
Figure FDA00041783907200000530
Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain multi-scale image-text feature representation F 'of the input image block' tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:
Figure FDA00041783907200000531
wherein Concat (·) represents the stitching operation of the features, GAP (·) represents the global average pooling operation; finally, the fused multi-scale image-text characteristic F 'is processed' tp Inputting the information into a channel attention sub-module to capture key information among different channels to obtain final output characteristics F of the adaptive characteristic extraction module tp Then feature F tp Flattened into a one-dimensional vector of dimension size 1 x D, where D represents the dimension of each image block, d=960.
4. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 3, wherein step S3 is specifically implemented as follows:
designing a local image information interaction module which consists of four full-connection layers and a Softmax function; the local image information interaction module adopts self-attentionThe mechanism enhances the information interaction between the features of different image blocks, so that each image block is endowed with different attention degrees to better aggregate the local features of each image block; specifically, the input characteristic of the local image information interaction module is F l The size is n×d, where N represents the number of image blocks, n=4; d represents the dimension of each image block, d=960; first, input feature F l Input into three fully connected layers, thereby generating three new intermediate features F Q 、F K And F V The dimensions are all N x D ', where D' represents the intermediate feature F Q 、F K And F V Dimension of the second dimension, D' =480; then, in the intermediate feature F Q And F K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the two-dimensional feature matrix S into a full connection layer to obtain a feature matrix F s The dimension size is N multiplied by D; finally, the characteristic matrix F s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual l Adding to obtain final output F 'of the local image information interaction module' l The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F O =Linear1(F l )
F K =Linear2(F l )
F V =Ltnear3(F l )
Figure FDA0004178390720000061
Figure FDA0004178390720000062
F s =Linear4(S)
Figure FDA0004178390720000063
wherein, linear1 (, linear2 (, linear3 (, linear4 ()) represents four fully connected layers, softmax (·) represents the Softmax function, transose (·) represents the Transpose operation of the two-dimensional matrix,
Figure FDA0004178390720000064
representing matrix multiplication, ++>
Figure FDA0004178390720000065
Representing matrix addition operation, alpha represents a learnable proportion parameter for fusion, F' l represents output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the output characteristics are F l Is the same.
5. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 4, wherein step S4 is specifically implemented as follows:
step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer; taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13
Figure FDA0004178390720000066
And->
Figure FDA0004178390720000067
As input to the network, the dimensions are all h×w×3; specifically, firstly, four image blocks are respectively input into four self-adaptive feature extraction modules to extract multi-scale image-text features of each image block, and the ith input image block is recorded->
Figure FDA0004178390720000068
The output characteristic after passing through the ith self-adaptive characteristic extraction module is F i The dimension sizes are all 1 xD; then four one-dimensional output features F i Splicing the two-dimensional feature vectors to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D; then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks to obtain a final output characteristic F of the network out The size of the fusion feature is NxD, and the dimension of the fusion feature is the same as that of the initial fusion feature F;
step S42, outputting the characteristic F to the network obtained in step S41 out Performing dimension transformation operation, flattening the dimension transformation operation to form a one-dimensional feature vector, wherein the dimension size of the one-dimensional feature vector is changed from N multiplied by D to 1 multiplied by C, and C=N multiplied by D; then the flattened one-dimensional feature vector is input into a full connection layer to obtain the quality evaluation score F of the distorted screen content image score The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F score =Linear(Reshape(F out ))
wherein Linear represents a fully connected layer and Reshape represents dimension transformation operation;
step S43, designing a loss function of the non-reference image quality evaluation network based on multi-region feature fusion, which is specifically as follows:
Figure FDA0004178390720000071
wherein n is the number of samples in the training set, y i Representing the true quality score of the ith distorted screen content image,
Figure FDA0004178390720000074
representing the predicted quality fraction of the ith distorted screen content image output through the network;
and S44, repeating the steps S41 to S43 by taking a batch as a unit until the loss value calculated in the step S43 converges and tends to be stable, saving network parameters, and completing the training process of the non-reference image quality evaluation network based on multi-region feature fusion.
6. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 2, wherein step S5 is specifically implemented as follows:
Four image blocks corresponding to each distorted screen content image in the test set obtained in step S13
Figure FDA0004178390720000072
And->
Figure FDA0004178390720000073
And inputting the quality evaluation scores into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting corresponding quality evaluation scores.
CN202310398032.0A 2023-04-14 2023-04-14 No-reference screen content image quality assessment method based on multi-region feature fusion Pending CN116403063A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310398032.0A CN116403063A (en) 2023-04-14 2023-04-14 No-reference screen content image quality assessment method based on multi-region feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310398032.0A CN116403063A (en) 2023-04-14 2023-04-14 No-reference screen content image quality assessment method based on multi-region feature fusion

Publications (1)

Publication Number Publication Date
CN116403063A true CN116403063A (en) 2023-07-07

Family

ID=87010188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310398032.0A Pending CN116403063A (en) 2023-04-14 2023-04-14 No-reference screen content image quality assessment method based on multi-region feature fusion

Country Status (1)

Country Link
CN (1) CN116403063A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738325A (en) * 2023-08-16 2023-09-12 湖北工业大学 Method and system for identifying lower limb exoskeleton movement pattern based on DenseNet-LSTM network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116738325A (en) * 2023-08-16 2023-09-12 湖北工业大学 Method and system for identifying lower limb exoskeleton movement pattern based on DenseNet-LSTM network

Similar Documents

Publication Publication Date Title
CN113240580B (en) Lightweight image super-resolution reconstruction method based on multi-dimensional knowledge distillation
CN108510485B (en) Non-reference image quality evaluation method based on convolutional neural network
Zhang et al. Hierarchical feature fusion with mixed convolution attention for single image dehazing
CN108171701B (en) Significance detection method based on U network and counterstudy
CN103996192B (en) Non-reference image quality evaluation method based on high-quality natural image statistical magnitude model
CN107977932A (en) It is a kind of based on can differentiate attribute constraint generation confrontation network face image super-resolution reconstruction method
Yan et al. Deep objective quality assessment driven single image super-resolution
Chen et al. Locally GAN-generated face detection based on an improved Xception
CN110728209A (en) Gesture recognition method and device, electronic equipment and storage medium
CN111582044A (en) Face recognition method based on convolutional neural network and attention model
CN113284100A (en) Image quality evaluation method based on recovery image to mixed domain attention mechanism
Chen et al. Naturalization module in neural networks for screen content image quality assessment
CN112232325B (en) Sample data processing method and device, storage medium and electronic equipment
CN112257741B (en) Method for detecting generative anti-false picture based on complex neural network
CN111882516B (en) Image quality evaluation method based on visual saliency and deep neural network
CN116403063A (en) No-reference screen content image quality assessment method based on multi-region feature fusion
US20220301106A1 (en) Training method and apparatus for image processing model, and image processing method and apparatus
CN108492275B (en) No-reference stereo image quality evaluation method based on deep neural network
CN111652238B (en) Multi-model integration method and system
CN115936980B (en) Image processing method and device, electronic equipment and storage medium
CN112785498B (en) Pathological image superscore modeling method based on deep learning
CN110427892B (en) CNN face expression feature point positioning method based on depth-layer autocorrelation fusion
Fan et al. Image inpainting based on structural constraint and multi-scale feature fusion
CN111539420B (en) Panoramic image saliency prediction method and system based on attention perception features
KR102340387B1 (en) Method of learning brain connectivity and system threrfor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination