CN110636278A

CN110636278A - Stereo image quality assessment method based on sparse binocular fusion convolutional neural network

Info

Publication number: CN110636278A
Application number: CN201910568580.7A
Authority: CN
Inventors: 李素梅; 韩旭
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-06-27
Filing date: 2019-06-27
Publication date: 2019-12-31

Abstract

The invention discloses a stereo image quality evaluation method based on a sparse binocular fusion convolutional neural network, which comprises the following steps of: s1, constructing a stereo image quality evaluation network based on a binocular fusion convolutional neural network, wherein the network comprises a left branch, a right branch and a fusion branch; s2, applying a structured sparse constraint on each layer of the binocular fusion convolutional neural network, wherein an objective function of network optimization is shown as a formula (1):the stereo image quality evaluation method is more accurate and efficient, more fits human eye perception quality, has higher operation speed, and is suitable for certain conditionsAnd promotes the development of the stereo imaging technology to a certain extent.

Description

Stereo image quality assessment method based on sparse binocular fusion convolutional neural network

技术领域technical field

本发明属于图像处理领域，涉及到立体图像质量评价方法的改进优化，以及立体图像质量评价卷积神经网络的计算速度的优化，尤其涉及一种基于稀疏双目融合卷积神经网络的立体图像质量评价方法。The invention belongs to the field of image processing, and relates to the improvement and optimization of a stereoscopic image quality evaluation method, and the optimization of the calculation speed of a convolutional neural network for stereoscopic image quality evaluation, in particular to a stereoscopic image quality based on a sparse binocular fusion convolutional neural network evaluation method.

背景技术Background technique

由于观看降质的立体图像会造成视觉疲劳和晕眩，立体图像质量评价成为了亟待解决的事情[1]。立体图像质量评价要考虑深度信息、视差信息和双目竞争等因素，相比平面图像质量评价，立体图像质量评价更具有挑战性。通常立体图像质量评价可分为主观和客观评价两种方法。然而主观评价方法费事费力，因此立体图像客观质量评价成为一个研究的热点问题 [2]。Since viewing degraded stereoscopic images will cause visual fatigue and dizziness, stereoscopic image quality evaluation has become an urgent matter to be solved [1]. Stereoscopic image quality evaluation needs to consider factors such as depth information, disparity information, and binocular competition. Compared with planar image quality evaluation, stereoscopic image quality evaluation is more challenging. Generally, stereoscopic image quality evaluation can be divided into two methods: subjective evaluation and objective evaluation. However, the subjective evaluation method is time-consuming and laborious, so the objective quality evaluation of stereo images has become a hot research issue [2].

一般来说，立体图像质量客观评价可以分为传统的基于特征提取的方法[3-4]、基于稀疏表示的方法[5-9]和基于深度学习的方法[10-13]。稀疏表示模拟了人类视觉系统的感知机制，它可以将图像中大部分像素表示为零，去除冗余信息。因此，一些人使用基于稀疏表示的方法来评价立体图像的质量。例如，文献[5]对立体图像的左视图和右视图的结构和纹理特征进行了稀疏表示，分别计算了左视图和右视图的稀疏特征相似性指标，并将它们结合起来得到最终的质量分数。文献[6]联合稀疏表示DOG、HOG和LBP特征，并使用支持向量回归得到立体图像的质量分数。在[7]中，Lin等人将融合图像幅度图和融合图像相位图进行稀疏表示，并采用支持向量机回归。M等人稀疏表示了融合图像对比度图和融合图像相位图，并采用支持向量机回归得到立体图像质量分数[8]。在文献[9]中，Yang等人提出了一种基于学习梯度字典的彩色视觉特征的无参考立体图像质量评价方法。并将特征输入到训练后的支持向量机模型中，进行质量分数的预测。由于深度学习网络模拟了大脑分层处理图像的过程，近年来，许多人使用深度学习模型来评价立体图像的质量。例如，文献[10]提取了立体图像的自然场景统计特征，并采用得到的特征对DBN进行训练，得到立体图像质量分数。在文献[11]中， Ding等人提出了一种基于卷积神经网络(CNN)的无参考立体图像质量评价方法。将CNN 网络提取的特征和视差特征通过支持向量机回归，得到立体图像的客观质量分数。在文献[12] 中，LV等人提出了一种基于双目自相似度和深度神经网络(DNN)的立体图像质量评价方法。在文献[13]中，Sang等人通过主成分分析(PCA)融合了立体图像的左右视图。然后采用融合图像训练CNN网络，得到立体图像质量分数。In general, the objective evaluation of stereoscopic image quality can be divided into traditional feature extraction-based methods [3-4], sparse representation-based methods [5-9] and deep learning-based methods [10-13]. Sparse representation simulates the perception mechanism of the human visual system, which can represent most of the pixels in the image as zero and remove redundant information. Therefore, some people use methods based on sparse representation to evaluate the quality of stereo images. For example, literature [5] performs sparse representation of the structure and texture features of the left and right views of stereo images, calculates the sparse feature similarity indicators of the left and right views respectively, and combines them to obtain the final quality score . Literature [6] jointly sparsely represent DOG, HOG and LBP features, and use support vector regression to obtain the quality score of stereo images. In [7], Lin et al. sparsely represent the fused image magnitude map and fused image phase map, and use support vector machine regression. M et al. sparsely represented the fused image contrast map and fused image phase map, and used support vector machine regression to obtain the stereo image quality score [8]. In [9], Yang et al. proposed a no-reference stereo image quality assessment method based on the color visual features of the learned gradient dictionary. And input the features into the trained support vector machine model to predict the quality score. Since the deep learning network simulates the process of the brain's hierarchical processing of images, in recent years, many people have used deep learning models to evaluate the quality of stereoscopic images. For example, literature [10] extracted the natural scene statistical features of stereo images, and used the obtained features to train DBN to obtain stereo image quality scores. In [11], Ding et al. proposed a no-reference stereo image quality assessment method based on convolutional neural network (CNN). The features extracted by the CNN network and the disparity features are regressed by the support vector machine to obtain the objective quality score of the stereo image. In the literature [12], LV et al. proposed a stereoscopic image quality evaluation method based on binocular self-similarity and deep neural network (DNN). In [13], Sang et al. fused the left and right views of stereo images by principal component analysis (PCA). Then the CNN network is trained with the fused images to obtain the stereo image quality score.

上述文献中，基于稀疏表示方法可以找到图像的关键信息，但需要通过手工提取特征。如在文献[5]中，提取了结构和纹理特征。文献[6]中，手动提取了DOG、HOG和LBP特征。在文献[7-8]中提取了融合图像幅度和融合图像相位特征。文献[9]提取了梯度特征。基于深度学习的方法可以通过网络自身学习到综合特征，这使得提取的特征更加全面、合适。但深度学习网络通常计算复杂度高，同时网络对存储空间的要求大。由于神经网络具有高度的非凸性，在网络训练中，过度参数化和随机初始化是克服局部最小值的负面影响的必要手段[14]。也就是说，深度学习网络具有很高的冗余潜力。因此，一些人利用稀疏正则化来压缩DNN。例如，在文献[14]中，Liu等人提出了一种采用稀疏分解的稀疏卷积神经网络。这种稀疏卷积神经网络可以使90％以上的参数归零，并且在ILSVRC2012数据集的准确率下降小于1％。在文献[15]中，Wen等人提出了一种结构化稀疏的学习(structuredsparsity learning SSL)的方法来正则化DNN。并且SSL可以获得一个硬件友好的结构化稀疏的DNN，从而有效地加速 DNN的运算速度。但几乎没有人将SSL应用在立体图像质量评价的深度学习网络中。在文献 [15]的启发下，我们提出了一种稀疏的双目融合卷积神经网络来评价立体图像的质量，CNN 网络的利用避免了稀疏表示法中的手工特征提取，将SSL应用于卷积神经网络，减少了网络的计算量，加快了网络的运算速度。In the above literature, the key information of the image can be found based on the sparse representation method, but the features need to be extracted manually. As in [5], structural and texture features are extracted. In [6], DOG, HOG and LBP features were manually extracted. The fused image magnitude and fused image phase features are extracted in literature [7-8]. Literature [9] extracts gradient features. The method based on deep learning can learn comprehensive features through the network itself, which makes the extracted features more comprehensive and appropriate. However, deep learning networks usually have high computational complexity, and at the same time, the network requires a large storage space. Due to the highly non-convex nature of neural networks, over-parameterization and random initialization are necessary means to overcome the negative effects of local minima in network training [14]. That said, deep learning networks have a high potential for redundancy. Therefore, some people utilize sparse regularization to compress DNN. For example, in [14], Liu et al. proposed a sparse convolutional neural network with sparse decomposition. This sparse convolutional neural network can zero out more than 90% of the parameters, and the accuracy drop is less than 1% on the ILSVRC2012 dataset. In the literature [15], Wen et al. proposed a method of structured sparse learning (structuredsparsity learning SSL) to regularize DNN. And SSL can obtain a hardware-friendly structured sparse DNN, thereby effectively accelerating the operation speed of DNN. But almost no one has applied SSL to deep learning networks for stereo image quality evaluation. Inspired by literature [15], we propose a sparse binocular fusion convolutional neural network to evaluate the quality of stereo images. The use of CNN network avoids the manual feature extraction in sparse representation, and applies SSL to volumetric images. The integrated neural network reduces the calculation amount of the network and speeds up the operation speed of the network.

如何处理立体图像左右视点之间的关系是立体图像质量评价的关键，针对左右视点的处理方式，上述文献大致可分为两类。文献[5-6][10-12]首先对左右视点分别进行了处理，然后考虑双目融合和双目竞争机制将两个视点的特征进行融合。文献[7-9][13]首先将左右两个视点融合成融合图像，然后对融合图像进行处理。事实上，在人类视觉皮层中，左视图和右视图的融合是一个长期的过程，融合和处理同时发生，左右两个视图被分层的处理和融合[16]。因此，我们采用双目融合卷积神经网络，将两个视图通过四次concat进行四次融合，模拟视觉皮层的长期融合与信息的处理。How to deal with the relationship between the left and right viewpoints of stereoscopic images is the key to the quality evaluation of stereoscopic images. Regarding the processing methods of left and right viewpoints, the above documents can be roughly divided into two categories. Documents [5-6][10-12] first processed the left and right viewpoints separately, and then considered the binocular fusion and binocular competition mechanisms to fuse the features of the two viewpoints. References [7-9][13] first fused the left and right viewpoints into a fused image, and then processed the fused image. In fact, in the human visual cortex, the fusion of left and right views is a long-term process, fusion and processing occur simultaneously, and the left and right views are processed and fused hierarchically [16]. Therefore, we use the binocular fusion convolutional neural network to fuse the two views four times through four times of concat, simulating the long-term fusion and information processing of the visual cortex.

发明内容Contents of the invention

为了解决现有技术问题，本发明提出一种稀疏的双目融合卷积神经网络进行立体图像质量评价；采用卷积神经网络进行立体图像质量评价，避免了手动的特征提取。采用结构化稀疏正则化约束卷积神经网络，降低网络的计算复杂度，加快了网络计算速度，提高了网络性能；考虑到人眼的双目融合和双目竞争机制，模拟视觉皮层的长期融合过程，立体图像的左右两个视点通过四次concat进行四次融合，同时进行信息的处理；本专利的立体图像质量评价方法更加准确高效，更贴合人眼感知质量，运算速度更快，在一定程度上推动立体成像技术的发展。In order to solve the problems in the prior art, the present invention proposes a sparse binocular fusion convolutional neural network for stereoscopic image quality evaluation; the convolutional neural network is used for stereoscopic image quality evaluation, avoiding manual feature extraction. Using structured sparse regularization to constrain the convolutional neural network reduces the computational complexity of the network, speeds up the network computing speed, and improves the network performance; considering the binocular fusion and binocular competition mechanism of the human eye, simulates the long-term fusion of the visual cortex process, the left and right viewpoints of the stereo image are fused four times through four times of concat, and the information is processed at the same time; the stereo image quality evaluation method of this patent is more accurate and efficient, more suitable for human perception of quality, and faster in calculation speed. To a certain extent, it promotes the development of stereoscopic imaging technology.

解决现有技术存在的问题，本发明采用如下技术方案予以实施：To solve the problems existing in the prior art, the present invention adopts the following technical solutions to implement:

1、一种基于稀疏双目融合卷积神经网络的立体图像质量评价方法，其特征在于，包括如下步骤：1, a kind of stereoscopic image quality evaluation method based on sparse binocular fusion convolutional neural network, is characterized in that, comprises the steps:

S1、构建基于双目融合卷积神经网络的立体图像质量评价网络，网络包含左右分支和融合分支；S1. Construct a stereoscopic image quality evaluation network based on binocular fusion convolutional neural network, the network includes left and right branches and fusion branches;

S2、在双目融合卷积神经网络的每一层施加结构化稀疏约束，网络优化的目标函数如公式(1)所示：S2. Apply structured sparse constraints to each layer of the binocular fusion convolutional neural network. The objective function of network optimization is shown in formula (1):

其中，W代表网络中所有的权重；E_D(W)为网络的损失函数；R(W)为应用在所有权重上的非结构化正则约束；R_g(W^(l))为应用在每一层上的结构化稀疏正则化约束。Among them, W represents all the weights in the network; E _D (W) is the loss function of the network; R (W) is the unstructured regular constraint applied to all weights; R _g (W ^(l) ) is applied in each Structural sparse regularization constraints on one layer.

2、根据权利要求1所述的一种基于稀疏双目融合卷积神经网络的立体图像质量评价方法，其特征在于，所述S1中通过神经网络中左右视图构建左右分支步骤；2, a kind of stereoscopic image quality evaluation method based on sparse binocular fusion convolutional neural network according to claim 1, is characterized in that, constructs left and right branch step by left and right views in neural network in described S1;

2.1、分别将左右分支进行划分第一卷积层和第一池化层、第二卷积层和第二池化层、第三卷积层、第四卷积层；2.1. Divide the left and right branches into the first convolutional layer and the first pooling layer, the second convolutional layer and the second pooling layer, the third convolutional layer, and the fourth convolutional layer;

2.2、左右分支中第一卷积层进行结构稀疏约束后输入第一池化层；2.2. The first convolutional layer in the left and right branches is constrained by structural sparseness and then input to the first pooling layer;

2.3、第一池化层输出端与第二卷积层连接，将第二卷积层进行结构稀疏约束后输入第二池化层；2.3. The output end of the first pooling layer is connected to the second convolutional layer, and the second convolutional layer is subjected to structural sparse constraints and then input to the second pooling layer;

2.4、第二池化层输出端与第三卷积层连接，将第三卷积层进行结构稀疏约束后输入第四卷积层；所述第四卷积层输出端连接融合分支进行融合处理。2.4. The output end of the second pooling layer is connected to the third convolutional layer, and the third convolutional layer is subjected to structural sparse constraints and then input to the fourth convolutional layer; the output end of the fourth convolutional layer is connected to the fusion branch for fusion processing .

3、根据权利要求1所述的一种基于稀疏双目融合卷积神经网络的立体图像质量评价方法，其特征在于，所述S1中通过神经网络中左右视图构建融合分支步骤；3, a kind of stereoscopic image quality evaluation method based on sparse binocular fusion convolutional neural network according to claim 1, is characterized in that, constructs fusion branch step by left and right views in neural network in described S1;

3.1、融合分支划分为第一池化层和第一卷积层、第二池化层和第二卷积层、第三卷积层、第四卷积层和第三池化层、三层全连接层，共进行四次融合操作；3.1. The fusion branch is divided into the first pooling layer and the first convolutional layer, the second pooling layer and the second convolutional layer, the third convolutional layer, the fourth convolutional layer and the third pooling layer, and three layers Fully connected layer, a total of four fusion operations are performed;

3.2、来自左右分支结构稀疏约束后的第一卷积层的特征图通过‘concat’操作进行第一次融合操作，将融合后的特征图输入到融合分支第一池化层，然后送入融合分支第一卷积层进行信息处理，同时对融合分支第一卷积层进行结构化稀疏约束；3.2. The feature map from the first convolutional layer after the left and right branch structures are sparsely constrained performs the first fusion operation through the 'concat' operation, and the fused feature map is input to the first pooling layer of the fusion branch, and then sent to the fusion The first convolutional layer of the branch performs information processing, and at the same time performs structural sparse constraints on the first convolutional layer of the fusion branch;

3.3、来自左右分支结构稀疏约束后的第二卷积层的特征图与融合分支经第一次融合后的第一卷积层的特征图通过‘concat’操作进行第二次融合操作，将融合后的特征图输入到融合分支第二池化层，然后送入融合分支第二卷积层进行信息处理，同时对融合分支第二卷积层进行结构化稀疏约束；3.3. The feature map from the second convolutional layer after the sparse constraints of the left and right branch structures and the feature map of the first convolutional layer after the first fusion of the fusion branch are performed through the 'concat' operation for the second fusion operation, and the fusion The final feature map is input to the second pooling layer of the fusion branch, and then sent to the second convolution layer of the fusion branch for information processing, and at the same time, structural sparse constraints are imposed on the second convolution layer of the fusion branch;

3.4、来自左右分支结构稀疏约束后的第三卷积层的特征图与融合分支经第二次融合后的第二卷积层的特征图通过‘concat’操作进行第三次融合操作，将融合后的特征图输入到融合分支第三卷积层进行信息处理，同时对融合分支第三卷积层进行结构化稀疏约束；3.4. The feature map from the third convolutional layer after the sparse constraint of the left and right branch structures and the feature map of the second convolutional layer after the second fusion of the fusion branch perform the third fusion operation through the 'concat' operation, and the fusion The final feature map is input to the third convolutional layer of the fusion branch for information processing, and at the same time, structural sparse constraints are imposed on the third convolutional layer of the fusion branch;

3.5、来自左右分支结构稀疏约束后的第四卷积层的特征图与融合分支经第三次融合后的第三卷积层的特征图通过‘concat’操作进行第四次融合操作，将融合后的特征图输入到融合分支第四卷积层进行信息处理，同时对融合分支第四卷积层进行结构化稀疏约束；将融合后的第四卷积层送入第三池化层，然后将输出特征图送入三层全连接层进行立体图像质量的判断。3.5. The feature map from the fourth convolutional layer after the sparse constraints of the left and right branch structures and the feature map of the third convolutional layer after the third fusion of the fusion branch are performed through the 'concat' operation for the fourth fusion operation, and the fusion The final feature map is input to the fourth convolutional layer of the fusion branch for information processing, and at the same time, structural sparse constraints are imposed on the fourth convolutional layer of the fusion branch; the fourth convolutional layer after fusion is sent to the third pooling layer, and then The output feature map is sent to the three-layer fully connected layer to judge the quality of the stereo image.

有益效果Beneficial effect

本专利采用结构化稀疏学习SSL来优化所采用的卷积神经网络，使网络的权重结构化的稀疏，降低网络的计算复杂度，加快网络的运算速度，提升网络的评价性能，为实时的立体图像质量评价提供可能性。实验结果表明网络能够在性能有所提升的情况下达到超过2×的计算速度提升。采用卷积神经网络中的四次融合模拟人脑中的长期的双目融合过程，理论上和实验上都表明本发明所提出的模型适用于对称和非对称失真立体图像。This patent uses structured sparse learning SSL to optimize the convolutional neural network used, so that the weight of the network is structured and sparse, the computational complexity of the network is reduced, the computing speed of the network is accelerated, and the evaluation performance of the network is improved, which is a real-time three-dimensional Image quality evaluation offers possibility. Experimental results show that the network can achieve more than 2× increase in computing speed with improved performance. The four-fold fusion in the convolutional neural network is used to simulate the long-term binocular fusion process in the human brain. Theoretically and experimentally, it has been shown that the model proposed by the present invention is suitable for symmetrical and asymmetrical distorted stereo images.

附图说明Description of drawings

图1是本发明基于稀疏双目融合卷积神经网络结构图。Figure 1 is a structural diagram of the present invention based on sparse binocular fusion convolutional neural network.

图2(a)在LIVE I上每层卷积层的列稀疏度(column sparsity)和网络整体加速的关系(b)Figure 2(a) The relationship between the column sparsity of each convolutional layer and the overall acceleration of the network on LIVE I (b)

在LIVE I上每层卷积层的行稀疏度(row parsity)和网络整体加速的关系The relationship between the row parsity of each convolutional layer and the overall acceleration of the network on LIVE I

具体实施方式Detailed ways

本发明采用公开的立体图像库LIVE 3D Phase I和LIVE 3D Phase II进行实验。LIVE 3D Phase I图像库包含20张原始立体图像对和365张对称失真立体图像对，失真类型包含JPEG 压缩、JPEG 2000压缩、高斯模糊Gblur、高斯白噪声WN和快衰退FF，DMOS值分布在-10 到60。LIVE 3D Phase II图像库包含8张原始立体图像对和360张对称失真和非对称失真的立体图像对，其中120对为对称失真立体图像，240对为非对称失真立体图像，失真类型包含JPEG压缩、JPEG 2000压缩、高斯模糊Gblur、高斯白噪声WN和快衰退FF，DMOS 值分布在0到100。The present invention uses the public stereoscopic image library LIVE 3D Phase I and LIVE 3D Phase II to conduct experiments. The LIVE 3D Phase I image library contains 20 original stereoscopic image pairs and 365 symmetrically distorted stereoscopic image pairs. The distortion types include JPEG compression, JPEG 2000 compression, Gaussian blur Gblur, Gaussian white noise WN and fast decay FF. The DMOS values are distributed in - 10 to 60. The LIVE 3D Phase II image library contains 8 original stereoscopic image pairs and 360 symmetrically distorted and asymmetrically distorted stereoscopic image pairs, of which 120 pairs are symmetrically distorted stereoscopic images, and 240 pairs are asymmetrically distorted stereoscopic images, and the distortion types include JPEG compression , JPEG 2000 compression, Gaussian blur Gblur, Gaussian white noise WN and fast decay FF, the DMOS value is distributed from 0 to 100.

下面结合技术方案详细说明本方法：The method is described in detail below in conjunction with the technical scheme:

本发明质量评价方法模拟人脑处理立体图像的流程，采用卷积神经网络的四次concat模拟左右视点的长期融合与处理，使网络适用于对称和非对称失真立体图像。将SSL应用在网络每一卷积层，结构化稀疏约束网络滤波器个数和滤波器形状，降低网络计算复杂度，加快网络运算速度，提升网络评价性能。The quality evaluation method of the present invention simulates the process of processing stereoscopic images by the human brain, and uses four concats of convolutional neural networks to simulate long-term fusion and processing of left and right viewpoints, making the network suitable for symmetrical and asymmetrical distorted stereoscopic images. Applying SSL to each convolutional layer of the network, structured sparseness constrains the number and shape of network filters, reduces the complexity of network calculations, speeds up network operations, and improves network evaluation performance.

具体步骤如下：Specific steps are as follows:

1结构化稀疏学习SSL实施方式1 Structured Sparse Learning SSL Implementation

采用

表示第l(1≤l≤L)个卷积层中的所有权重，其中N_l,C_l,M_l和K_l代表第l个卷积层滤波器数量、通道数、滤波器的高度和宽度。L代表网络中卷积层的层数。带有结构化稀疏约束的卷积神经网络的目标函数可以表示为公式(1)use

Indicates all weights in the lth (1≤l≤L) convolutional layer, where N _l , C _l , M _l and K _l represent the number of filters, number of channels, filter height and width. L represents the number of convolutional layers in the network. The objective function of a convolutional neural network with a structured sparsity constraint can be expressed as Equation (1)

其中，W代表网络中所有的权重。E_D(W)为网络的损失函数。R(W)为应用在所有权重上的非结构化正则约束，本申请中使用l₂范数。R_g(W^(l))为应用在每一层上的结构化稀疏正则化约束。SSL中采用Group Lasso来实现结构化稀疏，Group Lasso正则化能使某些分组为零。应用在权重w上的Group Lasso可以表示为

其中G表示分组的组数， w^(g)为w中的第g组权重。其中|w^(g)|是分组w^(g)中的权重数量。Among them, W represents all weights in the network. E _D (W) is the loss function of the network. R(W) is an unstructured regular constraint applied to all weights, and the l ₂ norm is used in this application. R _g (W ^(l) ) is the structured sparse regularization constraint applied on each layer. SSL uses Group Lasso to achieve structured sparseness, and Group Lasso regularization can make certain groups zero. The Group Lasso applied to the weight w can be expressed as

Where G represents the group number of grouping, and w ^(g) is the weight of the gth group in w. where |w ^(g) | is the number of weights in grouping w ^(g) .

在SSL方法中w^(g)的分组方式可以分为按滤波器分组，按通道分组，按滤波器的形状分组和按网络的层数分组即filter-wise,channel-wise,filter shape-wise和depth-wise，表示为

W^(l)(1≤n_l≤N_l,1≤c_l≤C_l,1≤m_l≤M_l,1≤k_l≤K_l)。其中 n_l,c_l,m_l,k_l是第l层的第n_l个滤波器，第l层的第c_l个通道，滤波器的第m_l行和滤波器的第k_l列。本申请中我们采用filter-wise和filter shape-wise来惩罚每一个卷积层中不重要的滤波器并且学习任意形状的滤波器。在caffe中，每一层的所有滤波器被变形成为一个矩阵，矩阵的每一行是一个滤波器

矩阵的列数是滤波器的个数。因此本申请中结合filter-wise和 shape-wise稀疏规则化，通过将权重矩阵的行或列变为零来直接的降低权重矩阵的维度。SSL 中filter-wise和shape-wise可以叫做row-wise和column-wise。在加入row-wise和 column-wise稀疏规则化后，网络的目标函数可以表示为公式(2)。其中λ,λ_n和λ_s是l₂范数、 row-wise和column-wise的惩罚系数。In the SSL method, the grouping method of w ^(g) can be divided into grouping by filter, grouping by channel, grouping by filter shape and grouping by network layer, namely filter-wise, channel-wise, filter shape-wise and depth-wise, expressed as

W ^(l) (1≤n _l ≤N _l , 1≤c _l ≤C _l , 1≤m _l ≤M _l , 1≤k _l ≤K _l ). where n _l , c _l , m _l , k _l are the n _lth filter of layer l, the c _lth channel of lth layer, m _lth row of filter and k _lth column of filter. In this application we use filter-wise and filter shape-wise to penalize unimportant filters in each convolutional layer and learn filters of arbitrary shape. In caffe, all filters of each layer are transformed into a matrix, and each row of the matrix is a filter

The number of columns of the matrix is the number of filters. Therefore, in this application, filter-wise and shape-wise sparse regularization are combined to directly reduce the dimension of the weight matrix by changing the rows or columns of the weight matrix to zero. Filter-wise and shape-wise in SSL can be called row-wise and column-wise. After adding row-wise and column-wise sparse regularization, the objective function of the network can be expressed as formula (2). where λ, λ _n and λ _s are penalty coefficients for _l2 norm, row-wise and column-wise.

2双目融合卷积神经网络的构建2 Construction of Binocular Fusion Convolutional Neural Network

本申请采用的双目融合卷积神经网络如图1所示。双目融合网络模仿人脑立体视觉处理机制，将左右视点进行长期的融合。融合网络分为三部分：左分支、右分支和融合分支。左分支和右分支中都有四个卷积层和两个池层。融合分支包含四个卷积层、三个池化层和三个全连接层。网络滤波器尺寸和滤波器个数如图1所示。为了模拟视觉皮层中左右视图的长期融合和处理，两个视点通过网络中的四次concat(如图1中的①②③④)融合四次，同时通过卷积操作实现信息的处理。实现了图像的边融合边处理，模拟了人眼的视觉机制。考虑到双目组合和双目竞争机制，需要给左右视图分配不同的权重，得到最终的融合图像[17]。本申请中通过融合网络自主学习得到左右视图的权重。同时在每个卷积层上使用SSL对滤波器和滤波器形状进行结构化稀疏约束。The binocular fusion convolutional neural network used in this application is shown in Figure 1. The binocular fusion network imitates the stereoscopic vision processing mechanism of the human brain, and fuses the left and right viewpoints for a long time. The fusion network is divided into three parts: left branch, right branch and fusion branch. There are four convolutional layers and two pooling layers in both the left and right branches. The fusion branch contains four convolutional layers, three pooling layers and three fully connected layers. The network filter size and the number of filters are shown in Figure 1. In order to simulate the long-term fusion and processing of the left and right views in the visual cortex, the two viewpoints are fused four times through four times of concat in the network (①②③④ in Figure 1), and the information is processed through convolution operations. It realizes image fusion and processing, and simulates the visual mechanism of human eyes. Considering the binocular combination and binocular competition mechanism, it is necessary to assign different weights to the left and right views to obtain the final fused image [17]. In this application, the weights of the left and right views are obtained through autonomous learning of the fusion network. At the same time, SSL is used on each convolutional layer to impose structural sparse constraints on filters and filter shapes.

双目融合网络中的卷积操作被定义为公式(3)。The convolution operation in the stereo fusion network is defined as Equation (3).

F_l＝RELU(W_l*F_{lth_input}+B_l) (3)F _l ＝RELU(W _l *F _{lth_input} +B _l ) (3)

其中，W_l与B_l分别代表第l层卷积层的权重与偏执。F_l代表第l层卷积层输出的特征图， F_{lth_input}代表第l层卷积层的输入。RELU为激活函数，*代表卷积操作。Among them, W _l and B _l represent the weight and bias of the l-th convolutional layer, respectively. F _l represents the feature map output by the l-th convolutional layer, and F _{lth_input} represents the input of the l-th convolutional layer. RELU is the activation function, and * represents the convolution operation.

双目融合网络中的所有池化层都为最大池化。在利用反向传播算法训练网络时，通过最小化损失函数来学习卷积层、池化层与全连接层的参数。损失函数使用欧几里得函数，如公式(4)所示。All pooling layers in the binocular fusion network are max pooling. When using the backpropagation algorithm to train the network, the parameters of the convolutional layer, pooling layer, and fully connected layer are learned by minimizing the loss function. The loss function uses the Euclidean function, as shown in formula (4).

其中，Y_i与y_i分别代表样本i的期望输出与真实输出。n代表批处理的大小。Among them, Y _i and y _i represent the expected output and real output of sample i respectively. n represents the batch size.

3立体图像质量评价结果与分析3 Stereo image quality evaluation results and analysis

本专利的实验在公开的LIVE 3D Phase I和LIVE 3D Phase II上进行。LIVE 3DPhase I和 LIVE 3D Phase II均包含了5种失真类型，JPEG压缩、JPEG 2000压缩、高斯模糊Gblur、高斯白噪声WN和快衰退FF。LIVE 3D Phase I图像库包含20张原始立体图像对和365张对称失真立体图像对。LIVE 3D Phase II图像库包含8张原始立体图像对和360张对称失真和非对称失真的立体图像对，其中120对为对称失真，240对为非对称失真。本专利采用Pearson 相关系数(PLCC)和Spearman等级相关系数(SROCC)作为主客观评价结果一致性的度量方法。PLCC和SROCC越接近于1，评价效果越好。The experiments of this patent are carried out on the published LIVE 3D Phase I and LIVE 3D Phase II. Both LIVE 3DPhase I and LIVE 3D Phase II include 5 distortion types, JPEG compression, JPEG 2000 compression, Gaussian blur Gblur, Gaussian white noise WN and fast decay FF. The LIVE 3D Phase I image library contains 20 original stereo image pairs and 365 symmetrically distorted stereo image pairs. The LIVE 3D Phase II image library contains 8 original stereoscopic image pairs and 360 symmetrically distorted and asymmetrically distorted stereoscopic image pairs, of which 120 pairs are symmetrically distorted and 240 pairs are asymmetrically distorted. This patent adopts Pearson's correlation coefficient (PLCC) and Spearman rank correlation coefficient (SROCC) as the measurement method of the consistency of subjective and objective evaluation results. The closer PLCC and SROCC are to 1, the better the evaluation effect.

在表1中，我们将所提出的方法与八种立体图像质量评价方法进行了比较。将最好的结果突出显示为粗体。其中，论文[6-9]是基于稀疏表示的方法，论文[10-13]是基于深度学习的方法，我们的方法结合了稀疏和CNN。针对立体图像左右视点之间的关系，文献[5-6][11-12] 首先对两个视点进行处理，然后融合左右视图的特征；在文献[7-9][13]中，首先融合两个视点，然后将融合图像作为平面图像进行处理；我们的方法对两个视点采用长期的融合，边融合边处理。从表1可以看出，本专利的网络评价效果大大优于其他方法。只有PLCC在LIVE II上略低于M[8]。然而，但SROCC和RMSE在LIVE II上均超过了M[8]。我们的PLCC和 SROCC在LIVE I上均超过了0.96，在LIVEI II上均超过了0.95。我们的方法比稀疏表示法和深度学习法都具有更好的性能。同时无论是先融合后处理的方法还是先处理后融合的方法，我们的稀疏融合网络都超越了它们。本专利的网络在对称和非对称失真立体图像上都有很好的处理效果。In Table 1, we compare the proposed method with eight stereoscopic image quality assessment methods. Highlight the best result in bold. Among them, papers [6-9] are methods based on sparse representation, papers [10-13] are methods based on deep learning, and our method combines sparseness and CNN. For the relationship between the left and right viewpoints of stereo images, literature [5-6][11-12] firstly processes the two viewpoints, and then fuses the features of the left and right views; in literature [7-9][13], first fuses two viewpoints, and then process the fused image as a planar image; our method uses long-term fusion for two viewpoints, which is processed on the fly. It can be seen from Table 1 that the network evaluation effect of this patent is much better than other methods. Only PLCC is slightly lower than M[8] on LIVE II. However, both SROCC and RMSE surpassed M[8] on LIVE II. Both our PLCC and SROCC exceeded 0.96 on LIVE I, and both exceeded 0.95 on LIVEI II. Our method outperforms both sparse representation and deep learning methods. At the same time, whether it is the method of fusion first and then processing or the method of processing first and then fusion, our sparse fusion network surpasses them. The network of this patent has a good processing effect on both symmetric and asymmetric distorted stereo images.

为了证明SSL对所提出的网络的影响，我们对在不同结构化稀疏强度下的网络进行了比较。net0(baseline)是不使用结构化稀疏正则化的网络。图2显示了在LIVE I上行稀疏度、列稀疏度和网络加速之间的关系，在LIVE II上稀疏度和网络加速的关系和LIVEI具有相同的趋势。我们将基准网络net0的加速设置为1，net1、net2(proposed method)和net3上的稀疏度逐渐增大。我们可以看到，网络越稀疏，网络加速越大。在表2中，我们比较了在不同结构化稀疏强度的网络性能。当稀疏度较低时，如net1。由于To demonstrate the impact of SSL on the proposed network, we compare the network under different structural sparsity strengths. net0 (baseline) is a network that does not use structured sparse regularization. Figure 2 shows the relationship between row sparsity, column sparsity, and network acceleration on LIVE I, and the relationship between sparsity and network acceleration on LIVE II has the same trend as LIVEI. We set the speedup to 1 on the baseline net0, and gradually increase the sparsity on net1, net2 (proposed method) and net3. We can see that the sparser the network, the greater the network speedup. In Table 2, we compare the network performance at different structured sparsity strengths. When the sparsity is low, such as net1. because

网络的稀疏性，速度略有加快，性能略有下降。在net3中，当稀疏度较高时，性能下降的幅度大于net1。但net3的加速比net1要大很多。当稀疏度合适时，性能反而有所改善，如 net2(proposedmethod)。这可能是由于网络中不重要的冗余权重被约束为0，也就是说，结构稀疏正则化有助于提高网络性能。此外，我们提出的方法的速度也有很大的提升。在LIVE I 上加速为2.0倍，在LIVE II上加速为2.3倍。当行稀疏度和列稀疏度较高时(如net3)，网络评价效果仅降低了0.01左右，但网络有3倍左右的加速，同时net3的评价效果仍然高于大多数方法。The sparsity of the network, the speed is slightly increased, and the performance is slightly decreased. In net3, when the sparsity is high, the performance drops more than net1. But the acceleration of net3 is much greater than that of net1. When the sparsity is appropriate, the performance is improved, such as net2 (proposed method). This may be due to the unimportant redundant weights in the network being constrained to be 0, that is, structural sparsity regularization helps to improve network performance. In addition, the speed of our proposed method is also greatly improved. The speedup is 2.0 times on LIVE I and 2.3 times on LIVE II. When the row sparsity and column sparsity are high (such as net3), the evaluation effect of the network is only reduced by about 0.01, but the network is accelerated by about 3 times, and the evaluation effect of net3 is still higher than most methods.

应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。It should be noted that, for those skilled in the art, without departing from the concept of the present invention, several modifications and improvements can be made, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.

参考文献references

[1]L.Xing,J.You,T.Ebrahimi and A.Perkis,"Assessment of StereoscopicCrosstalk Perception,"in IEEE Transactions on Multimedia,vol.14,no.2,pp.326-337,April 2012.[1] L.Xing, J.You, T.Ebrahimi and A.Perkis, "Assessment of Stereoscopic Crosstalk Perception," in IEEE Transactions on Multimedia, vol.14, no.2, pp.326-337, April 2012.

[2]M.Chen,L.K.Cormack and A.C.Bovik,"No-Reference Quality Assessmentof Natural Stereopairs,"in IEEE Transactions on Image Processing,vol.22,no.9,pp. 3379-3391,Sept.2013.[2] M.Chen, L.K.Cormack and A.C.Bovik, "No-Reference Quality Assessment of Natural Stereopairs," in IEEE Transactions on Image Processing, vol.22, no.9, pp. 3379-3391, Sept.2013.

[3]Xu,Xiaogang,Y.Zhao,and Y.Ding,“No-reference stereoscopic imagequality assessment based on saliency-guided binocular feature consolidation,”Electronics Letters vol.53,no.22,pp.1468-1470,2017.[3] Xu, Xiaogang, Y.Zhao, and Y.Ding, "No-reference stereoscopic imagequality assessment based on saliency-guided binocular feature consolidation," Electronics Letters vol.53, no.22, pp.1468-1470, 2017 .

[4]J.Ma,P.An,L.Shen and K.Li,"Reduced-Reference Stereoscopic ImageQuality Assessment Using Natural Scene Statistics and StructuralDegradation,"in IEEE Access,vol.6,pp.2768-2780,2018.[4] J.Ma, P.An, L.Shen and K.Li, "Reduced-Reference Stereoscopic Image Quality Assessment Using Natural Scene Statistics and Structural Degradation," in IEEE Access, vol.6, pp.2768-2780, 2018.

[5]K.Li,F.Shao,G.Jiang and M.Yu,"Joint structure–texture sparsecoding for quality prediction of stereoscopic images,"in Electronics Letters,vol.51,no. 24,pp.1994-1995,19 11 2015.[5] K.Li, F.Shao, G.Jiang and M.Yu, "Joint structure–texture sparsecoding for quality prediction of stereoscopic images," in Electronics Letters, vol.51, no. 24, pp.1994-1995 ,19 11 2015.

[6]F.Shao,K.Li,W.Lin,G.Jiang and Q.Dai,"Learning Blind QualityEvaluator for Stereoscopic Images Using Joint Sparse Representation,"in IEEETransactions on Multimedia,vol.18,no.10,pp.2104-2114,Oct.2016.[6] F.Shao, K.Li, W.Lin, G.Jiang and Q.Dai, "Learning Blind Quality Evaluator for Stereoscopic Images Using Joint Sparse Representation," in IEEE Transactions on Multimedia, vol.18, no.10, pp .2104-2114, Oct. 2016.

[7]Y.Lin,J.Yang,W.Lu,Q.Meng,Z.Lv and H.Song,"Quality Index forStereoscopic Images by Jointly Evaluating Cyclopean Amplitude and CyclopeanPhase,"in IEEE Journal of Selected Topics in Signal Processing,vol.11,no.1,pp.89-101,Feb. 2017.[7] Y.Lin, J.Yang, W.Lu, Q.Meng, Z.Lv and H.Song,"Quality Index for Stereoscopic Images by Jointly Evaluating Cyclopean Amplitude and CyclopeanPhase,"in IEEE Journal of Selected Topics in Signal Processing ,vol.11,no.1,pp.89-101,Feb. 2017.

[8]M.Karimi,M.Nejati,S.M.R.Soroushmehr,S.Samavi,N.Karimi andK.Najarian, "Blind Stereo Quality Assessment Based on Learned Features FromBinocular Combined Images,"in IEEE Transactions on Multimedia,vol.19,no.11,pp.2475-2489,Nov. 2017.[8] M.Karimi, M.Nejati, S.M.R.Soroushmehr, S.Samavi, N.Karimi and K.Najarian, "Blind Stereo Quality Assessment Based on Learned Features From Binocular Combined Images," in IEEE Transactions on Multimedia, vol.19, no .11, pp.2475-2489, Nov. 2017.

[9]J.Yang,P.An,J.Ma,K.Li and L.Shen,"No-reference stereo imagequality assessment by learning gradient dictionary-based color visualcharacteristics," 2018 IEEE International Symposium on Circuits and Systems(ISCAS),Florence,2018, pp.1-5.[9] J.Yang, P.An, J.Ma, K.Li and L.Shen, "No-reference stereo image quality assessment by learning gradient dictionary-based color visual characteristics," 2018 IEEE International Symposium on Circuits and Systems (ISCAS ), Florence, 2018, pp.1-5.

[10]J.Yang,B.Jiang,H.Song,X.Yang,W.Lu and H.Liu,"No-ReferenceStereoimage Quality Assessment for Multimedia Analysis Towards Internet-of-Things,"in IEEE Access,vol.6,pp.7631-7640,2018.[10] J. Yang, B. Jiang, H. Song, X. Yang, W. Lu and H. Liu, "No-Reference Stereo image Quality Assessment for Multimedia Analysis Towards Internet-of-Things," in IEEE Access, vol. 6, pp.7631-7640, 2018.

[11]Y.Ding et al.,"No-Reference Stereoscopic Image Quality AssessmentUsing Convolutional Neural Network for Adaptive Feature Extraction,"in IEEEAccess, vol.6,pp.37595-37603,2018.[11] Y.Ding et al., "No-Reference Stereoscopic Image Quality Assessment Using Convolutional Neural Network for Adaptive Feature Extraction," in IEEE Access, vol.6, pp.37595-37603, 2018.

[12]Lv Y,Yu M,Jiang G et al.,“No-reference Stereoscopic Image QualityAssessment Using Binocular Self-similarity and Deep Neural Network,”SignalProcessing:Image Communication,vol.47,pp.346-357,2016.[12]Lv Y, Yu M, Jiang G et al., "No-reference Stereoscopic Image Quality Assessment Using Binocular Self-similarity and Deep Neural Network," Signal Processing: Image Communication, vol.47, pp.346-357, 2016.

[13]Q.Sang,T.Gu,C.Li and X.Wu,"Stereoscopic image quality assessmentvia convolutional neural networks,"2017 International Smart Cities Conference(ISC2), Wuxi,2017,pp.1-2.[13]Q.Sang,T.Gu,C.Li and X.Wu,"Stereoscopic image quality assessment via convolutional neural networks,"2017 International Smart Cities Conference(ISC2), Wuxi,2017,pp.1-2.

[14]Baoyuan Liu,Min Wang,H.Foroosh,M.Tappen and M.Penksy,"SparseConvolutional Neural Networks,"2015 IEEE Conference on Computer Vision andPattern Recognition(CVPR),Boston,MA,2015,pp.806-814.[14] Baoyuan Liu, Min Wang, H.Foroosh, M.Tappen and M.Penksy, "Sparse Convolutional Neural Networks," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp.806-814 .

[15]Wen,Wei&Wu,Chunpeng&Wang,Yandan&Chen,Yiran&Li,Hai.(2016).LearningStructured Sparsity in Deep Neural Networks.[15]Wen,Wei&Wu,Chunpeng&Wang,Yandan&Chen,Yiran&Li,Hai.(2016).LearningStructured Sparsity in Deep Neural Networks.

[16]Hubel D H,Wiesel T N,“Receptive fields of single neurones in thecat\"s striate cortex,”The Journal of Physiology,vol.148,no.3,pp.574-591,1959.[16]Hubel D H, Wiesel T N, "Receptive fields of single neurones in the cat\"s striate cortex," The Journal of Physiology, vol.148, no.3, pp.574-591, 1959.

Claims

1. The stereo image quality evaluation method based on the sparse binocular fusion convolutional neural network is characterized by comprising the following steps of:

s1, constructing a stereo image quality evaluation network based on a binocular fusion convolutional neural network, wherein the network comprises a left branch, a right branch and a fusion branch;

s2, applying a structured sparse constraint on each layer of the binocular fusion convolutional neural network, wherein an objective function of network optimization is shown as a formula (1):

wherein W represents all weights in the network; e_D(W) is a loss function of the network; r (W) is an unstructured regularization constraint applied over all weights; r_g(W^(l)) The constraint is sparsely regularized for application to each layer.

2. The stereoscopic image quality evaluation method based on the sparse binocular fusion convolutional neural network of claim 1, wherein in S1, a left-right branch step is constructed through left and right views in the neural network;

2.1, dividing the left branch and the right branch into a first convolution layer and a first pooling layer, a second convolution layer and a second pooling layer, a third convolution layer and a fourth convolution layer;

2.2, inputting the first pooling layer after the first convolution layer in the left and right branches is subjected to structure sparsity constraint;

2.3, connecting the output end of the first pooling layer with the second convolution layer, and inputting the second pooling layer after performing structure sparsity constraint on the second convolution layer;

2.4, connecting the output end of the second pooling layer with the third convolution layer, performing structure sparse constraint on the third convolution layer, and inputting the third convolution layer into the fourth convolution layer; and the output end of the fourth convolution layer is connected with the fusion branch for fusion processing.

3. The stereoscopic image quality evaluation method based on the sparse binocular fusion convolutional neural network of claim 1, wherein in S1, a fusion branch step is constructed through left and right views in the neural network;

3.1, the fusion branch is divided into a first pooling layer and a first rolling layer, a second pooling layer and a second rolling layer, a third rolling layer, a fourth rolling layer and a third pooling layer and three full-connection layers, and four times of fusion operation is carried out;

3.2, performing a first fusion operation on the feature map from the first convolution layer after the sparsity constraint of the left and right branch structures through a 'concat' operation, inputting the fused feature map into a first fusion branch pooling layer, then sending the fused feature map into the first fusion branch convolution layer for information processing, and simultaneously performing the structuralization sparsity constraint on the first fusion branch convolution layer;

3.3, performing second fusion operation on the feature graph of the second convolution layer after sparse constraint of the left and right branch structures and the feature graph of the first convolution layer after the first fusion of the fusion branch through 'concat' operation, inputting the fused feature graph into the second fusion branch pooling layer, then sending the fused feature graph into the second fusion branch convolution layer for information processing, and simultaneously performing structured sparse constraint on the second fusion branch convolution layer;

3.4, performing a third fusion operation on the feature graph of the second convolution layer after the feature graph from the third convolution layer after the sparsity constraint of the left and right branch structures and the feature graph of the second convolution layer after the second fusion of the fusion branch through a 'concat' operation, inputting the fused feature graph into the third convolution layer of the fusion branch for information processing, and simultaneously performing the structuralization sparsity constraint on the third convolution layer of the fusion branch;

3.5, performing fourth fusion operation on the feature graph of the fourth convolution layer after the sparsity constraint of the left and right branch structures and the feature graph of the third convolution layer after the third fusion of the fusion branch through a 'concat' operation, inputting the fused feature graph into the fourth convolution layer of the fusion branch for information processing, and performing the structuralization sparsity constraint on the fourth convolution layer of the fusion branch; and sending the fused fourth convolution layer into a third pooling layer, and sending the output characteristic diagram into three full-connected layers to judge the quality of the stereo image.