CN116403063A - No-reference screen content image quality assessment method based on multi-region feature fusion - Google Patents
No-reference screen content image quality assessment method based on multi-region feature fusion Download PDFInfo
- Publication number
- CN116403063A CN116403063A CN202310398032.0A CN202310398032A CN116403063A CN 116403063 A CN116403063 A CN 116403063A CN 202310398032 A CN202310398032 A CN 202310398032A CN 116403063 A CN116403063 A CN 116403063A
- Authority
- CN
- China
- Prior art keywords
- feature
- image
- screen content
- input
- convolution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 78
- 238000001303 quality assessment method Methods 0.000 title claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 126
- 238000013441 quality evaluation Methods 0.000 claims abstract description 49
- 230000003993 interaction Effects 0.000 claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 30
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 230000003044 adaptive effect Effects 0.000 claims abstract description 13
- 230000002708 enhancing effect Effects 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 72
- 230000004913 activation Effects 0.000 claims description 53
- 238000010606 normalization Methods 0.000 claims description 53
- 239000011159 matrix material Substances 0.000 claims description 40
- 238000000034 method Methods 0.000 claims description 38
- 238000010586 diagram Methods 0.000 claims description 36
- 238000011176 pooling Methods 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 21
- 239000003054 catalyst Substances 0.000 claims description 18
- 238000012360 testing method Methods 0.000 claims description 18
- 239000013598 vector Substances 0.000 claims description 14
- 238000012545 processing Methods 0.000 claims description 8
- 230000009466 transformation Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims description 4
- 239000000463 material Substances 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 description 7
- 238000011161 development Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000004438 eyesight Effects 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000007474 system interaction Effects 0.000 description 1
- 230000016776 visual perception Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Abstract
The invention relates to a non-reference screen content image quality assessment method based on multi-region feature fusion. Comprising the following steps: data preprocessing is carried out on data in the distorted screen content image data set; designing an adaptive feature extraction module, adaptively extracting different scale features of a text region and an image region in an image block of distorted screen content, and fusing the text region features and the image region features based on an attention mechanism; designing a local image information interaction module, enhancing information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and giving different attention weights to each image block; designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model; and inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model, and outputting a corresponding quality evaluation score.
Description
Technical Field
The invention relates to the field of image processing and computer vision, in particular to a non-reference screen content image quality assessment method based on multi-region feature fusion.
Background
In recent years, with rapid development of mobile devices, multimedia applications, and information dissemination technologies, remote computing and communication have been widely used, and screen content images generated by various electronic terminal devices, which change in real time, are increasingly frequently appearing in people's daily lives. The screen content image is a new type of image with more text, icons, graphics and special structural layout information and statistical features than traditional natural images, the number of sharp edges in the screen content image far exceeds that of the natural image, and the color information is less than that of the natural image. In the process of encoding, compressing and transmitting screen content images, various degrees of distortion are inevitably introduced due to technical or hardware limitations and the like, so that the image quality is reduced, and the user experience and the system interaction performance are further affected. In view of the demands of people for clear and high-quality images, an image quality evaluation method capable of effectively evaluating the perceived quality of screen content images is needed, which not only can be used as auxiliary reference information of some image restoration enhancement technologies, but also provides a feasible approach for designing and optimizing advanced image/video processing algorithms.
The conventional image quality evaluation methods can be classified into subjective evaluation and objective evaluation according to the evaluation subject. Subjective evaluation means that image quality is scored by a person who is the final recipient of image information, but the subjective evaluation method is easily affected by subject's own subject consciousness, and in the age of explosive increase of the current data amount, subjective evaluation of tens of thousands of image data is not realistic, so it is difficult to satisfy the demand of real-time application. The objective image quality evaluation method is to extract and analyze the relevant characteristics of the distorted image, construct a corresponding mathematical model and calculate the quality evaluation score of the distorted image. Compared with the subjective evaluation method, the process does not need to score images through a large number of subjects, but replaces a human visual system by a computer to realize automatic and efficient quality evaluation, so that the application requirements under the background of big data can be met. According to different degrees of the demand for the reference image information in the image perception quality evaluation process, objective quality evaluation methods can be divided into three types of methods of full reference, half reference and no reference, and the dependence degree of the three types of methods on the reference image information is sequentially reduced. However, in practical application, it is often difficult to obtain a reference image without distortion, so the reference-free image quality evaluation method is more practical and has wider development prospect.
With the continuous development of deep learning technology, many screen content image quality evaluation models based on convolutional neural networks have appeared in recent years. Considering that deep learning is a data driving method, and the existing screen content image dataset usually only contains a small number of distorted screen content images, so that the effect of data enhancement is mainly achieved by blocking the screen content images, but a single image block cannot fully characterize the quality of the whole distorted image, and because a large number of text and image areas exist in the screen content images at the same time, image blocks of different contents can generate large quality differences even under the condition of the same distortion effect. Therefore, the influence difference of the image area and the text area in the screen content image on the whole visual quality of the image needs to be fully considered, and a weight strategy is designed by combining the degree of the difference, so that the image characteristics of the image area and the text area are effectively fused, and the accuracy of the non-reference screen content image quality evaluation model is further improved.
Disclosure of Invention
The invention aims to provide a multi-region feature fusion-based reference-free screen content image quality assessment method, which is based on the idea of multi-region local feature fusion, and features of a plurality of local image blocks are fused to represent the overall quality of an image, so that quality score deviation caused by training by using a single image block is reduced. Meanwhile, when the convolution layer of the convolution neural network is designed, different characteristics of a text region and an image region in the screen content image can be extracted more effectively by using convolution kernels with different sizes, and the text characteristics and the image characteristics with different scales can be effectively fused by using an attention mechanism, so that attention weights with different degrees can be given to each image block. Overall, the method can achieve higher subjective and objective visual perception consistency than other methods.
In order to achieve the above purpose, the technical scheme of the invention is as follows: a multi-region feature fusion-based reference-free screen content image quality assessment method comprises the following steps:
step S1, data preprocessing is carried out on data in distorted screen content image data sets, firstly image blocks are cut out from each distorted screen content image, then the data sets are divided into training sets and test sets, and finally data enhancement is carried out on the data in the training sets;
s2, designing an adaptive feature extraction module, wherein the adaptive feature extraction module can adaptively extract different scale features of a text region and an image region in an image block of distorted screen content, and fuse the features of the text region and the features of the image region based on an attention mechanism;
s3, designing a local image information interaction module, wherein the module enhances information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and endows different attention weights to each image block;
s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model based on multi-region feature fusion;
And S5, inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting a corresponding quality evaluation score.
In an embodiment of the present invention, the step S1 is specifically implemented as follows:
step S11, firstly, clipping image blocks of each distorted screen content image I in the distorted screen content image dataset; specifically, each distorted screen content image I is divided into four areas of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each area, respectively denoted as I 1 、I 2 、I 3 And I 4 Wherein H and W represent the height and width of the image block, respectively;
step S12, dividing the distorted screen content images in the distorted screen content image data set into a training set and a testing set according to a preset proportion;
step S13, for each distorted screen content image I in the training set train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing to complete data enhancement operation, and a distorted screen content image block for training is obtained And->For each distorted screen content image I in the test set test The four cut image blocks are subjected to the same normalization processing to obtain a distorted screen content image block for testing >And->
In an embodiment of the present invention, the step S2 is specifically implemented as follows:
step S21, designing textThe feature extraction submodule consists of a convolution layer with a convolution kernel size of 3 multiplied by 3, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; feature extraction is carried out on a text region in a distorted screen content image block by using a convolution layer with a convolution kernel size of 3 multiplied by 3, and a feature diagram input by a text feature extraction submodule is recorded as F t The size of the catalyst is H t ×W t ×C t Wherein H is t 、W t And C t Respectively represent the input feature graphs F t Height, width and number of channels; specifically, first, feature map F t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' t1 Its dimension is H t ×W t ×C' t Wherein H is t 、W t And C' t Respectively represent the intermediate characteristic diagrams F' t1 Height, width and number of channels; then input the characteristic diagram F t Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' t2 Its dimension is H t ×W t ×C' t And intermediate feature map F' t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' t1 And intermediate feature map F' t2 Adding, and obtaining the output characteristic F 'of the text characteristic extraction submodule through the LeakyReLU activation function' t Its dimension is H t ×W t ×C' t The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ t1 =BN(Conv2(LeakyReLU(BN(Conv1(F t )))))
F′ t2 =BN(Conv3(F t )))
wherein Conv1 represents a convolution kernel size of 3×3Conv2 (x) and Conv3 (x) represent two convolution layers with a convolution kernel size of 1 x 1,representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
s22, designing an image feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 5 multiplied by 5, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; performing feature extraction on an image area in the distorted screen content image block by using a convolution layer with a convolution kernel size of 5×5; recording the characteristic diagram input by the image characteristic extraction submodule as F p The size of the catalyst is H p ×W p ×C p Wherein H is p 、W p And C p Respectively represent the input feature graphs F p Height, width and number of channels; specifically, first, feature map F p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, a LeakyReLU activation function, a convolution layer with convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' p1 Its dimension is H p ×W p ×C' p Wherein H is p 、W p And C' p Respectively represent the intermediate characteristic diagrams F' p1 Height, width and number of channels; then input the characteristic diagram F p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' p2 Its dimension is H p ×W p ×C' p And intermediate feature map F' p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' p1 And intermediate feature map F' p2 Adding, and obtaining the output characteristic F 'of the image characteristic extraction submodule through the LeakyReLU activation function' p Its dimension is H p ×W p ×C' p The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ p1 =BN(Conv2(LeakyReLU(BN(Conv1(F p )))))
F′ p2 =BN(Conv3(F p )))
wherein Conv1' represents a convolution layer having a convolution kernel size of 5 x 5, conv2 and Conv3 represent two convolution layers having a convolution kernel size of 1 x 1,representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
step S23, designing an attention feature fusion submodule, wherein the module consists of four convolution layers with the convolution kernel size of 1 multiplied by 1, a global average pooling layer, two ReLU activation functions, a Sigmoid activation function and four batch normalization layers; the attention feature fusion submodule can fuse text features and image features with different scales through learning, and two features input by the attention feature fusion submodule are recorded as F' t And F' p The sizes of the two are H a ×W a ×C a Wherein H is a 、W a And C a Respectively represent the input feature images F' t And F' p Height, width and number of channels; specifically, first, two input features are added pixel by pixel to obtain an intermediate feature map F b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram F b Respectively inputting into local attention extraction branch and global attention extraction branch for different attention feature extraction, wherein the local attention extraction branch is formed by serially connecting a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer, and the global attention extraction branch is formed by sequentially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function and a convolution kernel sizeThe convolution layers with the size of 1 multiplied by 1 and the batch normalization layers are connected in series; record intermediate feature map F b The output after the local attention extraction branch is characterized by F local The output after global attention extraction branch is characterized by F global The sizes of the two are H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Feature F is then followed local And feature F global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the learnable weight lambda is combined with the input characteristic F' t And F' p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F local =BN(Conv 2_a (ReLU(BN(Conv 1_a (F b )))))
F global =BN(Conv 4_a (ReLU(BN(Conv 3_a (GAP(F b ))))))
wherein Conv 1_a (*)、Conv 2_a (*)、Conv 3_a ()' and Conv 4_a The four convolution layers with a convolution kernel size of 1 x 1 are shown,representing matrix addition operation, BN (·) representing batch normalization operation, GAP (·) representing global average pooling operation, reLU (·) representing ReLU activation function, sigmoid (·) representing Sigmoid activation function, λ being a learnable weight output via the network;
step S24, designing a channel attention submodule for enhancing feature representation and acquiring key feature channel information of input features, wherein the module consists of two convolution layers with convolution kernel size of 1 multiplied by 1, a ReLU activation function and a Sigmoid activation function, and the feature diagram input by the channel attention submodule is F c The size of the catalyst is H c ×W c ×C c Wherein H is c 、W c And C c Respectively represent the input feature graphs F c Height, width and number of channels; specifically, the input features F are first aggregated using a global averaging pooling operation c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F c Multiplying by element to obtain final output F 'of the channel attention submodule' c The size of the catalyst is H c ×W c ×C c And input of a feature map F c Is the same in dimension; the specific calculation formula is as follows:
F′ c =Sigmoid(Conv 2_b (ReLU(Conv 1_b (GQP(F c )))))⊙F c
wherein GAP represents a global average pooling operation, conv 1_b ()' and Conv 2_b "(. Times.) represent two convolution layers with a convolution kernel size of 1×1," represents matrix multiplication, sigmoid (·) represents Sigmoid activation function, and ReLU (·) represents ReLU activation function;
step S25, designing a self-adaptive feature extraction module, wherein the self-adaptive feature extraction module comprises four text feature extraction sub-modules in step S21, four image feature extraction sub-modules in step S22, four attention feature fusion sub-modules in step S23, one channel attention sub-module in step S24 and eight space average pooling layers with step length of 2; the self-adaptive feature extraction module respectively carries out self-adaptive multi-scale feature extraction on text privileges and image features of the image blocks of the input distorted screen content through text feature extraction branches and image feature extraction branches, and fuses different types of image features through an attention mechanism; specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer;
The size of the input distorted screen content image block is recorded as H multiplied by W multiplied by 3, the distorted screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features, and the multi-level text features output after the input distorted screen content image block passes through the text feature extraction branch are recorded as followsWherein the characteristic diagramIs +.>Feature map->Is +.>Feature map->Is +.>Feature map->Is +.>C' =64; recording the input distortion screen content image block passing diagramThe multi-level image feature outputted after the image feature extraction branch is +.>Wherein the characteristic diagram->Is +.>Feature map->Is of the size ofFeature map->Is +.>Feature map->Is +.>C' =64; then the multi-level text feature->And corresponding multilevel image features->Respectively inputting the three main features into four attention feature fusion sub-modules, and fusing text features and image features to obtain fused multi-level trunk features +.>Wherein the characteristic diagram->Is of the size ofFeature map->Is +.>Feature map->Is +.>Feature map->Is +.>C' =64; then>Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain multi-scale image-text feature representation F 'of the input image block' tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:
wherein Concat (·) represents the stitching operation of the features, GAP (·) represents the global average pooling operation; finally, the fused multi-scale image-text characteristic F 'is processed' tp Inputting the information into a channel attention sub-module to capture key information among different channels to obtain final output characteristics F of the adaptive characteristic extraction module tp Then feature F tp Flattened into a one-dimensional vector of dimension size 1 x D, where D represents the dimension of each image block, d=960.
In an embodiment of the present invention, the step S3 is specifically implemented as follows:
designing a local image information interaction module which consists of four full-connection layers and a Softmax function; the local image information interaction module adopts a self-attention mechanism to enhance information interaction among different image block characteristics, so that each image block is endowed with different attention degrees to better aggregate the local characteristics of each image block; specifically, the input characteristic of the local image information interaction module is F l The size is n×d, where N represents the number of image blocks, n=4; d represents the dimension of each image block, d=960; first, input feature F l Input into three fully connected layers, thereby generating three new intermediate features F Q 、F K And F V The dimensions are all N x D ', where D' represents the intermediate feature F Q 、F K And F V Dimension of the second dimension, D' =480; then, in the intermediate feature F Q And F K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the two-dimensional feature matrix S into a full connection layer to obtain a feature matrix F s The dimension size is N multiplied by D; finally, the characteristic matrix F s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual l Adding to obtain final output F 'of the local image information interaction module' l The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F Q =Linear1(F l )
F K =Linear2(F l )
F V =Linear3(F l )
F s =Linear4(S)
wherein, linear1 (, linear2 (, linear3 (, linear4 ()) represents four fully connected layers, softmax (·) represents the Softmax function, transose (·) represents the Transpose operation of the two-dimensional matrix,representing matrix multiplication, ++>Representing matrix addition, alpha represents a learnable scale parameter for fusion, F' l Output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the input characteristics F l Is the same.
In an embodiment of the present invention, the step S4 is specifically implemented as follows:
step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer; taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13And->As input to the network, the dimensions are all h×w×3; specifically, first, four image blocks are input to four adaptive bits, respectivelyIn the sign extraction module, the ith input image block is marked for extracting multi-scale image-text characteristics of each image block>The output characteristic after passing through the ith self-adaptive characteristic extraction module is F i The dimension sizes are all 1 xD; then four one-dimensional output features F i Splicing the two-dimensional feature vectors to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D; then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks to obtain a final output characteristic F of the network out The size of the fusion feature is NxD, and the dimension of the fusion feature is the same as that of the initial fusion feature F;
Step S42, outputting the characteristic F to the network obtained in step S41 out Performing dimension transformation operation, flattening the dimension transformation operation to form a one-dimensional feature vector, wherein the dimension size of the one-dimensional feature vector is changed from N multiplied by D to 1 multiplied by C, and C=N multiplied by D; then the flattened one-dimensional feature vector is input into a full connection layer to obtain the quality evaluation score F of the distorted screen content image score The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F score =Linear(Reshape(F out ))
wherein Linear represents a fully connected layer and Reshape represents dimension transformation operation;
step S43, designing a loss function of the non-reference image quality evaluation network based on multi-region feature fusion, which is specifically as follows:
wherein n is the number of samples in the training set, y i Representing the true quality score of the ith distorted screen content image,representing the predicted quality fraction of the ith distorted screen content image output through the network;
and S44, repeating the steps S41 to S43 by taking a batch as a unit until the loss value calculated in the step S43 converges and tends to be stable, saving network parameters, and completing the training process of the non-reference image quality evaluation network based on multi-region feature fusion.
In an embodiment of the present invention, the step S5 is specifically implemented as follows:
four image blocks corresponding to each distorted screen content image in the test set obtained in step S13 And->And inputting the quality evaluation scores into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting corresponding quality evaluation scores.
Compared with the prior art, the invention has the following beneficial effects: the invention aims to solve the quality score deviation problem caused by the block training of the distorted screen content image by the image quality evaluation model based on the convolutional neural network and the characteristic extraction and fusion problem of different types of areas in the screen content image. In order to solve the problems, the invention provides a multi-region feature fusion-based reference-free screen content image quality evaluation method, which is used for representing the overall quality of an image by fusing local features of a plurality of regions of a distorted image and carrying out self-adaptive feature extraction by using convolution kernels with different sizes for different types of regions. In addition, the method also uses an attention mechanism to carry out self-adaptive fusion on different types of image features and enhances information interaction among various local image blocks, thereby effectively improving the accuracy of the image quality assessment model without reference screen content.
Drawings
FIG. 1 is a flow chart of a method according to an embodiment of the invention.
Fig. 2 is a diagram illustrating a network model structure according to an embodiment of the present invention.
Fig. 3 is a block diagram of a text feature extraction submodule according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of an image feature extraction submodule according to an embodiment of the present invention.
FIG. 5 is a diagram of a attention feature fusion submodule architecture according to an embodiment of the present invention.
Fig. 6 is a block diagram of an adaptive feature extraction module according to an embodiment of the invention.
Fig. 7 is a block diagram of a local image information interaction module according to an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is specifically described below with reference to the accompanying drawings.
The invention provides a multi-region feature fusion-based reference-free screen content image quality assessment method, which is shown in fig. 1 and comprises the following steps:
step S1, data preprocessing is carried out on data in distorted screen content image data sets, firstly image blocks are cut out from each distorted screen content image, then the data sets are divided into training sets and test sets, and finally data enhancement is carried out on the data in the training sets;
s2, designing an adaptive feature extraction module, wherein the adaptive feature extraction module can adaptively extract different scale features of a text region and an image region in an image block of distorted screen content, and fuse the features of the text region and the features of the image region based on an attention mechanism;
S3, designing a local image information interaction module, wherein the module enhances information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and endows different attention weights to each image block;
s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model;
and S5, inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting a corresponding quality evaluation score.
FIG. 2 is a diagram of a network model constructed by the method of the present invention.
Further, step S1 includes the steps of:
step S11, firstly, clipping the image block for each distorted image I in the distorted screen content image dataset. Specifically, each distorted image I is divided into four regions of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each region, denoted as I 1 、I 2 、I 3 And I 4 Wherein H and W represent the height and width of the image block, respectively;
s12, dividing images in the distorted screen content image dataset into a training set and a testing set according to a certain proportion;
Step S13, for each distorted screen content image I in the training set train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing, so that data enhancement operation is completed, and a distorted screen content image block for training is obtained And->For each distorted screen content image I in the test set test The four image blocks after clipping are subjected to the same normalization processing, thereby obtaining a distorted screen content image block for testing +.>And
further, step S2 includes the steps of:
step S21, designing a text feature extraction submodule, wherein the submodule is composed of a convolution layer with a convolution kernel size of 3 multiplied by 3, and a text feature extraction submodule,Two convolution layers with a convolution kernel size of 1×1, two LeakyReLU activation functions, and three batch normalization layers. Since smaller convolution kernels may better capture local detail information of an image, such as characters, lines, etc., a convolution layer with a convolution kernel size of 3 x 3 is used to perform feature extraction on text regions in the image. Recording the characteristic diagram input by the module as F t The size of the catalyst is H t ×W t ×C t Wherein H is t 、W t And C t Respectively represent the input feature graphs F t Height, width and number of channels. Specifically, first, feature map F t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' t1 Its dimension is H t ×W t ×C' t Wherein H is t 、W t And C' t Respectively represent the intermediate characteristic diagrams F' t1 Height, width and number of channels; then input the characteristic diagram F t Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' t2 Its dimension is H t ×W t ×C' t And intermediate feature map F' t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' t1 And feature map F' t2 Adding, and then obtaining the output feature F 'of the text feature extraction submodule through a LeakyReLU activation function' t Its dimension is H t ×W t ×C' t . The specific calculation formula is as follows:
F′ t1 =BN(Conv2(LeakyReLU(BN(Conv1(F t )))))
F′ t2 =BN(Conv3(F t )))
wherein Conv1 represents a convolution kernel size of 3×3, conv2 and Conv3 represent two convolution layers of convolution kernel size 1 x 1,representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
in step S22, an image feature extraction submodule is designed, as shown in fig. 4, and the submodule is composed of one convolution layer with a convolution kernel size of 5×5, two convolution layers with a convolution kernel size of 1×1, two LeakyReLU activation functions and three batch normalization layers. Since a larger convolution kernel is more suitable for capturing visual features of an image, such as overall texture and color of the image, under a larger receptive field, feature extraction is performed on the image region using a convolution layer having a convolution kernel size of 5 x 5. Recording the characteristic diagram input by the module as F p The size of the catalyst is H p ×W p ×C p Wherein H is p 、W p And C p Respectively represent the input feature graphs F p Height, width and number of channels. Specifically, first, feature map F p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, a LeakyReLU activation function, a convolution layer with convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' p1 Its dimension is H p ×W p ×C' p Wherein H is p 、W p And C' p Respectively represent the intermediate characteristic diagrams F' p1 Height, width and number of channels; then input the characteristic diagram F p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' p2 Its dimension is H p ×W p ×C' p And intermediate feature map F' p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' p1 And feature map F' p2 Adding, and then obtaining the output characteristic F 'of the image characteristic extraction submodule through a LeakyReLU activation function' p Its dimension is H p ×W p ×C' p . The specific calculation formula is as follows:
F′ p1 =BN(Conv2(LeakyReLU(BN(Conv1′(F p )))))
F′ p2 =BN((Conv3(F p )))
wherein BN (Conv 1' () represents one convolution layer with a convolution kernel size of 5×5, BN (Conv 2 ()) and Conv3 ()) represent two convolution layers with a convolution kernel size of 1×1, Representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
in step S23, a attention feature fusion submodule is designed, as shown in fig. 5, and the submodule is composed of four convolution layers with convolution kernel size of 1×1, one global average pooling layer, two ReLU activation functions, one Sigmoid activation function and four batch normalization layers. The attention feature fusion submodule can effectively fuse text features and image features with different scales through learning, so that the model is more focused on key information of the feature map, and the generalization performance of the model is improved. Two characteristic diagrams input by the module are recorded as F' t And F' p The sizes of the two are H a ×W a ×C a Wherein H is a 、W a And C a Respectively represent the input feature images F' t And F' p Height, width and number of channels. Specifically, first, two input feature maps are added pixel by pixel to obtain an intermediate feature map F b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then map F b Respectively inputting into local attention extraction branch and global attention extraction branch for different attention feature extraction, wherein the local attention extraction branch sequentially comprises a convolution layer with convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with convolution kernel size of 1×1, and The global attention extraction branch is formed by serially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and the batch normalization layer. Record intermediate feature map F b The output after the local attention extraction branch is characterized by F local The output after global attention extraction branch is characterized by F global The sizes of the two are H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then, feature map F local And feature map F global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the weight lambda is combined with the input characteristic diagram F' t And F' p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' b The size of the catalyst is H a ×W a ×C a . The specific calculation formula is as follows:
F local =BN(Conv 2_a (ReLU(BN(Conv 1_a (F b )))))
F global =BN(Conv 4_a (ReLU(BN(Conv 3_a (GAP(F b ))))))
wherein Conv 1_a (*)、Conv 2_a (*)、Conv 3_a ()' and Conv 4_a The four convolution layers with a convolution kernel size of 1 x 1 are shown,representation matrix additionPerforming a method operation, wherein BN (with) represents a batch normalization operation, GAP (with) represents a global average pooling operation, reLU (with) represents a ReLU activation function, sigmoid (with) represents a Sigmoid activation function, and lambda is a learnable weight output through a network;
step S24, designing a channel attention sub-module for enhancing the feature representation and acquiring key feature channel information of the input features. The module consists of two convolution layers with convolution kernel size of 1×1, a ReLU activation function and a Sigmoid activation function. Recording the characteristic diagram input by the module as F c The size of the catalyst is H c ×W c ×C c Wherein H is c 、W c And C c Respectively represent the input feature graphs F c Height, width and number of channels. Specifically, the input features F are first aggregated using a global averaging pooling operation c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F c Element-wise multiplication is performed to obtain the final output F 'of the channel attention sub-module' c The size of the catalyst is H c ×W c ×C c And input of a feature map F c Is the same. The specific calculation formula is as follows:
F′ c =Sigmoid(Conv 2_b (ReLU(Conv 1_b (GAP(F c )))))⊙F c
wherein GAP represents a global average pooling operation, conv 1_b ()' and Conv 2_b "(. Times.) represent two convolution layers with a convolution kernel size of 1×1," represents matrix multiplication, sigmoid (·) represents Sigmoid activation function, and ReLU (·) represents ReLU activation function;
step S25, designing an adaptive feature extraction module, as shown in FIG. 6, wherein the module is composed of four text feature extraction sub-modules described in step S21, four image feature extraction sub-modules described in step S22, four attention feature fusion sub-modules described in step S23, one channel attention sub-module described in step S24 and eight spatial averaging pooling layers with step length of 2. The self-adaptive feature extraction module can respectively carry out self-adaptive multi-scale feature extraction on text privileges and image features of an image block of input screen content through a text feature extraction branch and an image feature extraction branch, and effectively fuses different types of image features through an attention mechanism so as to improve modeling capability of a model. Specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer.
The size of the input screen content image block is H multiplied by W multiplied by 3, and the screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features. The input image block is recorded with the multi-level text characteristics output after text characteristic extraction branches as followsWherein the characteristic diagram->Is of the size ofFeature map->Is +.>Feature map->Is +.>Feature map->Is +.>C' =64; the input image block is recorded with the multi-level image characteristics output after the image characteristic extraction branch is thatWherein the characteristic diagram->Is +.>Feature map->Is +.>Feature mapIs +.>Feature map->Is +.>C' =64; the text feature is then->Corresponding image feature->Respectively inputting the three main features into four attention feature fusion sub-modules, and fusing text features and image features to obtain fused multi-level trunk features +.>Wherein the characteristic diagram->Is +.>Feature mapIs +.>Feature map->Is +.>Feature map->Is +.>C' =64; then>Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain a multi-scale image-text feature representation F 'of the input image block' tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:
wherein Concat (·) represents the stitching operation of the features and GAP (·) represents the global average pooling operation. Finally, the fused multi-rulerDegree graphic character F' tp Inputting the information into a channel attention sub-module to capture key information among different channels so as to obtain final output characteristics F of the adaptive characteristic extraction module tp Then feature F tp Flattened into a one-dimensional vector with dimensions of 1 x D (where d=960).
Further, step S3 includes the steps of:
a local image information interaction module is designed, as shown in fig. 7, and the module is composed of four full connection layers and a Softmax function. The local image information interaction module adopts a self-attention mechanism to enhance the information interaction between different image block characteristics, so that each image block is endowed with different attention degrees, and the local characteristics of each image block are better aggregated. Specifically, the input characteristic of the module is recorded as F l The size is n×d (where N represents the number of image blocks, n= 4;D represents the dimension of each image block, and d=960). First, input feature F l Input into three fully connected layers, thereby generating three new intermediate features F Q 、F K And F V The dimensions are all V x D '(where D' represents the intermediate feature F) Q 、F K And F V Dimension in the second dimension, D' =480); then, in the intermediate feature F Q And F K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the feature S into a full connection layer to obtain a feature matrix F s The dimension size is Nxd; finally, feature F s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual l Adding to obtain final output F 'of the local image information interaction module' l . The specific calculation formula is as follows:
F Q =Linear1(F l )
F K =Linear2(F l )
F V =Linear3(F l )
F s =Linear4(S)
wherein, linear1 (, linear2 (, linear3 (, linear4 ()) represents four fully connected layers, softmax (·) represents the Softmax function, transose (·) represents the Transpose operation of the two-dimensional matrix,representing matrix multiplication, ++>Representing matrix addition, alpha represents a learnable scale parameter for fusion, F' l Output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the input characteristics F l Is the same.
Further, step S4 includes the steps of:
step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer. Taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13And->As a network inputThe dimensions were H.times.W.times.3. Specifically, firstly, four distorted image blocks are respectively input into four self-adaptive feature extraction modules to extract multi-scale image-text features of each image block, and the ith input image block is recorded->The output characteristic after passing through the ith self-adaptive characteristic extraction module is F i (i=1, 2,3, 4) having dimensions of 1×d; then four one-dimensional output features F i Splicing to form a two-dimensional feature vector to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D (N represents the number of image blocks, and N=4); then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks so as to obtain a final output characteristic F of the network out The size is NxD, which is the same dimension as the initial fusion feature F.
Step S42, outputting the characteristic F to the network obtained in step S41 out A dimension transformation operation is performed to flatten it into a one-dimensional feature vector whose dimension size is changed from n×d to 1×c (where c=n×d). Then the flattened one-dimensional characteristic vector is input into a full connection layer, so as to obtain the quality evaluation score F of the distorted screen content image score . The specific calculation formula is as follows:
F score =Linear(Reshape(F out ))
where Linear represents a fully connected layer and Reshape represents a dimension transformation operation.
Step S43, designing a loss function of the non-reference image quality evaluation network based on multi-region feature fusion, which is specifically as follows:
wherein n is the number of samples in the training set, y i Representing the true quality score of the ith distorted screen content image,representing the predicted quality fraction of the i-th distorted screen content image output via the network.
And S44, repeating the steps S41 to S43 by taking a batch as a unit until the loss value calculated in the step S43 converges and tends to be stable, saving network parameters, and completing the training process of the non-reference image quality evaluation network based on multi-region feature fusion.
Further, step S5 is implemented as follows:
four image blocks corresponding to each distorted screen content image in the test set obtained in step S13And->And inputting the quality evaluation scores into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting corresponding quality evaluation scores.
The above is a preferred embodiment of the present invention, and all changes made according to the technical solution of the present invention belong to the protection scope of the present invention when the generated functional effects do not exceed the scope of the technical solution of the present invention.
Claims (6)
1. The reference-free screen content image quality assessment method based on multi-region feature fusion is characterized by comprising the following steps of:
step S1, data preprocessing is carried out on data in distorted screen content image data sets, firstly image blocks are cut out from each distorted screen content image, then the data sets are divided into training sets and test sets, and finally data enhancement is carried out on the data in the training sets;
s2, designing an adaptive feature extraction module, wherein the adaptive feature extraction module can adaptively extract different scale features of a text region and an image region in an image block of distorted screen content, and fuse the features of the text region and the features of the image region based on an attention mechanism;
S3, designing a local image information interaction module, wherein the module enhances information interaction between any two image blocks in the distorted screen content image by introducing a self-attention mechanism, and endows different attention weights to each image block;
s4, designing a non-reference image quality evaluation network based on multi-region feature fusion, and training to obtain a non-reference screen content image quality evaluation model based on multi-region feature fusion;
and S5, inputting the distortion screen content image to be detected into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting a corresponding quality evaluation score.
2. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 1, wherein step S1 is specifically implemented as follows:
step S11, firstly, clipping image blocks of each distorted screen content image I in the distorted screen content image dataset; specifically, each distorted screen content image I is divided into four areas of upper left, upper right, lower left and lower right, and then an image block of h×w size is randomly cut out from each area, respectively denoted as I 1 、I 2 、I 3 And I 4 Wherein H and W represent the height and width of the image block, respectively;
step S12, dividing the distorted screen content images in the distorted screen content image data set into a training set and a testing set according to a preset proportion;
step S13, for each distorted screen content image I in the training set train The four cut image blocks are subjected to unified horizontal random overturning and normalization processing to complete data enhancement operation, and a distorted screen content image block for training is obtained And->For each distorted screen content image I in the test set test The four cut image blocks are subjected to the same normalization processing to obtain a distorted screen content image block for testing>And->
3. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 2, wherein step S2 is specifically implemented as follows:
s21, designing a text feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 3 multiplied by 3, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; feature extraction is carried out on a text region in a distorted screen content image block by using a convolution layer with a convolution kernel size of 3 multiplied by 3, and a feature diagram input by a text feature extraction submodule is recorded as F t The size of the catalyst is H t ×W t ×C t Wherein H is t 、W t And C t Respectively represent the input feature graphs F t Height, width and number of channels; specifically, first, feature map F t Sequentially inputting into a convolution layer with a convolution kernel size of 3×3, a batch normalization layer, a LeakyReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for preliminary feature extraction to obtain an intermediate feature map F' t1 Its dimension is H t ×W t ×C' t Wherein H is t 、W t And C' t Respectively represent the intermediate characteristic diagrams F' t1 Height, width and number of channels; then input the characteristic diagram F t Sequentially input to a convolution kernelResidual feature extraction is carried out on a convolution layer with the size of 1 multiplied by 1 and a batch normalization layer to obtain an intermediate feature map F' t2 Its dimension is H t ×W t ×C' t And intermediate feature map F' t1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' t1 And intermediate feature map F' t2 Adding, and obtaining the output characteristic F 'of the text characteristic extraction submodule through the LeakyReLU activation function' t Its dimension is H t ×W t ×C' t The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ t1 =BW(Conv2(LeakyReLU(BW(Conv1(F t )))))
F′ t2 =BN((Conv3(F t )))
wherein Conv1 (x) represents one convolution layer with a convolution kernel size of 3 x 3, conv2 (x) and Conv3 (x) represent two convolution layers with a convolution kernel size of 1 x 1,representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
S22, designing an image feature extraction submodule, wherein the submodule consists of a convolution layer with a convolution kernel size of 5 multiplied by 5, two convolution layers with a convolution kernel size of 1 multiplied by 1, two LeakyReLU activation functions and three batch normalization layers; performing feature extraction on an image area in the distorted screen content image block by using a convolution layer with a convolution kernel size of 5×5; recording the characteristic diagram input by the image characteristic extraction submodule as F p The size of the catalyst is H p ×W p ×C p Wherein H is p 、W p And C p Respectively represent the input feature graphs F p Height, width and number of channels; specifically, first, feature map F p Sequentially inputting into a convolution layer with convolution kernel size of 5×5, a batch normalization layer, and a LeakyReLU laserPerforming preliminary feature extraction in the living function, a convolution layer with the convolution kernel size of 1 multiplied by 1 and a batch normalization layer to obtain an intermediate feature map F' p1 Its dimension is H p ×W p ×C' p Wherein H is p 、W p And C' p Respectively represent the intermediate characteristic diagrams F' p1 Height, width and number of channels; then input the characteristic diagram F p Sequentially inputting into a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer for residual feature extraction to obtain an intermediate feature map F' p2 Its dimension is H p ×W p ×C' p And intermediate feature map F' p1 The dimensions of (2) are the same; finally, connecting the intermediate feature map F 'through residual error' p1 And intermediate feature map F' p2 Adding, and obtaining the output characteristic F 'of the image characteristic extraction submodule through the LeakyReLU activation function' p Its dimension is H p ×W p ×C' p The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F′ p1 =BN(Conv2(LeakyReLU(BN(Conv1′(F p )))))
F′ p2 =BN((Conv3(F p )))
wherein Conv1' represents a convolution layer having a convolution kernel size of 5 x 5, conv2 and Conv3 represent two convolution layers having a convolution kernel size of 1 x 1,representing matrix addition operations, leakyReLU (·) representing LeakyReLU activation functions, BN (·) representing batch normalization operations;
step S23, designing an attention feature fusion submodule, wherein the module consists of four convolution layers with the convolution kernel size of 1 multiplied by 1, a global average pooling layer, two ReLU activation functions, a Sigmoid activation function and four batch normalization layers; the attention characteristic fusion submodule can learn different scalesThe text features and the image features are fused, and the two features input by the attention feature fusion submodule are F' t And F' p The sizes of the two are H a ×W a ×C a Wherein H is a 、W a And C a Respectively represent the input feature images F' t And F' p Height, width and number of channels; specifically, first, two input features are added pixel by pixel to obtain an intermediate feature map F b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Then the intermediate feature diagram F b The method comprises the steps of respectively inputting the local attention extraction branch and the global attention extraction branch to carry out different attention characteristic extraction, wherein the local attention extraction branch is formed by serially connecting a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer in sequence, and the global attention extraction branch is formed by serially connecting a global average pooling layer, a convolution layer with a convolution kernel size of 1×1, a batch normalization layer, a ReLU activation function, a convolution layer with a convolution kernel size of 1×1 and a batch normalization layer in sequence; record intermediate feature map F b The output after the local attention extraction branch is characterized by F local The output after global attention extraction branch is characterized by F global The sizes of the two are H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the Feature F is then followed local And feature F global Adding pixel by pixel, and obtaining corresponding learnable weight lambda through a Sigmoid function; finally, the learnable weight lambda is combined with the input characteristic F' t And F' p The final output F 'of the attention characteristic fusion submodule is obtained after weighted fusion is carried out' b The size of the catalyst is H a ×W a ×C a The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F local =BN(Conv 2_a (ReLU(BN(Conv 1_a (F b )))))
F global =BN(Conv 4_a (ReLU(BN(Conv 3_a (GAP(F b ))))))
Wherein Conv 1_a (*)、Conv 2_a (*)、Conv 3_a ()' and Conv 4_a The four convolution layers with a convolution kernel size of 1 x 1 are shown,representing matrix addition operation, BN (·) representing batch normalization operation, GAP (·) representing global average pooling operation, reLU (·) representing ReLU activation function, sigmoid (·) representing Sigmoid activation function, λ being a learnable weight output via the network;
step S24, designing a channel attention submodule for enhancing feature representation and acquiring key feature channel information of input features, wherein the module consists of two convolution layers with convolution kernel size of 1 multiplied by 1, a ReLU activation function and a Sigmoid activation function, and the feature diagram input by the channel attention submodule is F c The size of the catalyst is H c ×W c ×C c Wherein H is c 、W c And C c Respectively represent the input feature graphs F c Height, width and number of channels; specifically, the input features F are first aggregated using a global averaging pooling operation c Then, firstly performing dimension reduction operation through a convolution layer with the size of 1 multiplied by 1, then performing dimension lifting operation through a convolution layer with the size of 1 multiplied by 1, then obtaining corresponding channel attention weight through a Sigmoid function, and finally, combining the channel attention weight with an input characteristic F c Multiplying by element to obtain final output F 'of the channel attention submodule' c The size of the catalyst is H c ×W c ×C c And input of a feature map F c Is the same in dimension; the specific calculation formula is as follows:
F′ c =Sigmoid(Conv 2_b (ReLU(Conv 1_b (GAP(F c )))))⊙F c
wherein GAP represents a global average pooling operation, conv 1_b ()' and Conv 2_b "(. Times.) represent two convolution layers with a convolution kernel size of 1×1," represents matrix multiplication, sigmoid (·) represents Sigmoid activation function, and ReLU (·) represents ReLU activation function;
step S25, designing a self-adaptive feature extraction module, wherein the self-adaptive feature extraction module comprises four text feature extraction sub-modules in step S21, four image feature extraction sub-modules in step S22, four attention feature fusion sub-modules in step S23, one channel attention sub-module in step S24 and eight space average pooling layers with step length of 2; the self-adaptive feature extraction module respectively carries out self-adaptive multi-scale feature extraction on text privileges and image features of the image blocks of the input distorted screen content through text feature extraction branches and image feature extraction branches, and fuses different types of image features through an attention mechanism; specifically, the text feature extraction branch sequentially consists of four repeated serial combinations of a text feature extraction submodule and a space pooling layer, and the image feature extraction branch sequentially consists of four repeated serial combinations of an image feature extraction submodule and a space pooling layer;
The size of the input distorted screen content image block is recorded as H multiplied by W multiplied by 3, the distorted screen content image block is firstly respectively input into a text feature extraction branch and an image feature extraction branch to extract multi-scale text features and image features, and the multi-level text features output after the input distorted screen content image block passes through the text feature extraction branch are recorded as followsWherein the characteristic diagram->Is +.>Feature map->Is +.>Feature map->Is +.>Feature mapIs +.>C' =64; the multi-level image characteristics output by the image blocks of the input distorted screen content after the image characteristics are extracted and branched are recorded as +.>Wherein the characteristic diagram->Is +.>Feature map->Is of the size ofFeature map->Is +.>Feature map->Is +.>C' =64; then the multi-level text feature->And corresponding multilevel image features->Respectively inputting the three main features into four attention feature fusion sub-modules, and fusing text features and image features to obtain fused multi-level trunk features +.>Wherein the characteristic diagram->Is of the size ofFeature map->Is +.>Feature map->Is +.>Feature map->Is +.>C' =64; then>Respectively executing global average pooling operation, and then performing feature stitching along the channel direction to obtain multi-scale image-text feature representation F 'of the input image block' tp The size of the material is 1 multiplied by 15C', and the specific calculation formula is as follows:
wherein Concat (·) represents the stitching operation of the features, GAP (·) represents the global average pooling operation; finally, the fused multi-scale image-text characteristic F 'is processed' tp Inputting the information into a channel attention sub-module to capture key information among different channels to obtain final output characteristics F of the adaptive characteristic extraction module tp Then feature F tp Flattened into a one-dimensional vector of dimension size 1 x D, where D represents the dimension of each image block, d=960.
4. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 3, wherein step S3 is specifically implemented as follows:
designing a local image information interaction module which consists of four full-connection layers and a Softmax function; the local image information interaction module adopts self-attentionThe mechanism enhances the information interaction between the features of different image blocks, so that each image block is endowed with different attention degrees to better aggregate the local features of each image block; specifically, the input characteristic of the local image information interaction module is F l The size is n×d, where N represents the number of image blocks, n=4; d represents the dimension of each image block, d=960; first, input feature F l Input into three fully connected layers, thereby generating three new intermediate features F Q 、F K And F V The dimensions are all N x D ', where D' represents the intermediate feature F Q 、F K And F V Dimension of the second dimension, D' =480; then, in the intermediate feature F Q And F K Performing a matrix multiplication operation and applying a Softmax function to generate an attention map a having dimensions N x N; then intermediate feature F V Performing matrix multiplication operation on the two characteristics with the attention diagram A to obtain a two-dimensional characteristic matrix S, wherein the dimension size of the two-dimensional characteristic matrix S is NxD'; then inputting the two-dimensional feature matrix S into a full connection layer to obtain a feature matrix F s The dimension size is N multiplied by D; finally, the characteristic matrix F s Multiplying by a scaling parameter alpha and connecting with the input feature F by residual l Adding to obtain final output F 'of the local image information interaction module' l The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F O =Linear1(F l )
F K =Linear2(F l )
F V =Ltnear3(F l )
F s =Linear4(S)
wherein, linear1 (, linear2 (, linear3 (, linear4 ()) represents four fully connected layers, softmax (·) represents the Softmax function, transose (·) represents the Transpose operation of the two-dimensional matrix,representing matrix multiplication, ++>Representing matrix addition operation, alpha represents a learnable proportion parameter for fusion, F' l represents output characteristics of the local image information interaction module, the size of the output characteristics is NxD, and the output characteristics are F l Is the same.
5. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 4, wherein step S4 is specifically implemented as follows:
step S41, designing a multi-region feature fusion-based non-reference image quality evaluation network, wherein the network comprises four self-adaptive feature extraction modules in step S25, a local image information interaction module in step S31 and a full-connection layer; taking four image blocks corresponding to each distorted screen content image in the training set obtained in step S13And->As input to the network, the dimensions are all h×w×3; specifically, firstly, four image blocks are respectively input into four self-adaptive feature extraction modules to extract multi-scale image-text features of each image block, and the ith input image block is recorded->The output characteristic after passing through the ith self-adaptive characteristic extraction module is F i The dimension sizes are all 1 xD; then four one-dimensional output features F i Splicing the two-dimensional feature vectors to obtain an initial fusion feature F, wherein the dimension size of the initial fusion feature F is N multiplied by D; then inputting the initial fusion characteristic F into a local image information interaction module to strengthen information interaction among the image blocks to obtain a final output characteristic F of the network out The size of the fusion feature is NxD, and the dimension of the fusion feature is the same as that of the initial fusion feature F;
step S42, outputting the characteristic F to the network obtained in step S41 out Performing dimension transformation operation, flattening the dimension transformation operation to form a one-dimensional feature vector, wherein the dimension size of the one-dimensional feature vector is changed from N multiplied by D to 1 multiplied by C, and C=N multiplied by D; then the flattened one-dimensional feature vector is input into a full connection layer to obtain the quality evaluation score F of the distorted screen content image score The method comprises the steps of carrying out a first treatment on the surface of the The specific calculation formula is as follows:
F score =Linear(Reshape(F out ))
wherein Linear represents a fully connected layer and Reshape represents dimension transformation operation;
step S43, designing a loss function of the non-reference image quality evaluation network based on multi-region feature fusion, which is specifically as follows:
wherein n is the number of samples in the training set, y i Representing the true quality score of the ith distorted screen content image,representing the predicted quality fraction of the ith distorted screen content image output through the network;
and S44, repeating the steps S41 to S43 by taking a batch as a unit until the loss value calculated in the step S43 converges and tends to be stable, saving network parameters, and completing the training process of the non-reference image quality evaluation network based on multi-region feature fusion.
6. The multi-region feature fusion-based referenceless screen content image quality assessment method according to claim 2, wherein step S5 is specifically implemented as follows:
Four image blocks corresponding to each distorted screen content image in the test set obtained in step S13And->And inputting the quality evaluation scores into a trained non-reference screen content image quality evaluation model based on multi-region feature fusion, and outputting corresponding quality evaluation scores.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310398032.0A CN116403063A (en) | 2023-04-14 | 2023-04-14 | No-reference screen content image quality assessment method based on multi-region feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310398032.0A CN116403063A (en) | 2023-04-14 | 2023-04-14 | No-reference screen content image quality assessment method based on multi-region feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116403063A true CN116403063A (en) | 2023-07-07 |
Family
ID=87010188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310398032.0A Pending CN116403063A (en) | 2023-04-14 | 2023-04-14 | No-reference screen content image quality assessment method based on multi-region feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116403063A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116738325A (en) * | 2023-08-16 | 2023-09-12 | 湖北工业大学 | Method and system for identifying lower limb exoskeleton movement pattern based on DenseNet-LSTM network |
-
2023
- 2023-04-14 CN CN202310398032.0A patent/CN116403063A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116738325A (en) * | 2023-08-16 | 2023-09-12 | 湖北工业大学 | Method and system for identifying lower limb exoskeleton movement pattern based on DenseNet-LSTM network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113240580B (en) | Lightweight image super-resolution reconstruction method based on multi-dimensional knowledge distillation | |
CN108510485B (en) | Non-reference image quality evaluation method based on convolutional neural network | |
Zhang et al. | Hierarchical feature fusion with mixed convolution attention for single image dehazing | |
CN108171701B (en) | Significance detection method based on U network and counterstudy | |
CN103996192B (en) | Non-reference image quality evaluation method based on high-quality natural image statistical magnitude model | |
CN107977932A (en) | It is a kind of based on can differentiate attribute constraint generation confrontation network face image super-resolution reconstruction method | |
Yan et al. | Deep objective quality assessment driven single image super-resolution | |
Chen et al. | Locally GAN-generated face detection based on an improved Xception | |
CN110728209A (en) | Gesture recognition method and device, electronic equipment and storage medium | |
CN111582044A (en) | Face recognition method based on convolutional neural network and attention model | |
CN113284100A (en) | Image quality evaluation method based on recovery image to mixed domain attention mechanism | |
Chen et al. | Naturalization module in neural networks for screen content image quality assessment | |
CN112232325B (en) | Sample data processing method and device, storage medium and electronic equipment | |
CN112257741B (en) | Method for detecting generative anti-false picture based on complex neural network | |
CN111882516B (en) | Image quality evaluation method based on visual saliency and deep neural network | |
CN116403063A (en) | No-reference screen content image quality assessment method based on multi-region feature fusion | |
US20220301106A1 (en) | Training method and apparatus for image processing model, and image processing method and apparatus | |
CN108492275B (en) | No-reference stereo image quality evaluation method based on deep neural network | |
CN111652238B (en) | Multi-model integration method and system | |
CN115936980B (en) | Image processing method and device, electronic equipment and storage medium | |
CN112785498B (en) | Pathological image superscore modeling method based on deep learning | |
CN110427892B (en) | CNN face expression feature point positioning method based on depth-layer autocorrelation fusion | |
Fan et al. | Image inpainting based on structural constraint and multi-scale feature fusion | |
CN111539420B (en) | Panoramic image saliency prediction method and system based on attention perception features | |
KR102340387B1 (en) | Method of learning brain connectivity and system threrfor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |