WO2020252746A1 - 一种利用共基胶囊投影进行图像分类的方法 - Google Patents

一种利用共基胶囊投影进行图像分类的方法 Download PDF

Info

Publication number
WO2020252746A1
WO2020252746A1 PCT/CN2019/092109 CN2019092109W WO2020252746A1 WO 2020252746 A1 WO2020252746 A1 WO 2020252746A1 CN 2019092109 W CN2019092109 W CN 2019092109W WO 2020252746 A1 WO2020252746 A1 WO 2020252746A1
Authority
WO
WIPO (PCT)
Prior art keywords
projection
capsule
vector
subspace
feature
Prior art date
Application number
PCT/CN2019/092109
Other languages
English (en)
French (fr)
Inventor
邹文斌
彭文韬
向灿群
徐晨
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2019/092109 priority Critical patent/WO2020252746A1/zh
Publication of WO2020252746A1 publication Critical patent/WO2020252746A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present invention belongs to the technical field of image classification, and more specifically, relates to a method for image classification using common base capsule projection.
  • CNN Convolutional Neural Networks
  • the convolutional neural network has a fundamental flaw.
  • the performance of the convolutional neural network will be very good, but if the image has flipping, tilting or any other directionality problems At that time, the performance of the convolutional neural network is worse. This is because the convolutional neural network cannot consider the spatial relationship between the underlying objects.
  • the neuron of the previous layer is passed to the neuron of the next layer is a scalar.
  • the scalar has only a size but no direction, so It cannot show the pose relationship between high-level features and low-level features.
  • the pooling layer in the convolutional neural network ensures that the features are not deformed in translation and rotation, it also loses a lot of valuable information and reduces the spatial resolution, which leads to small changes to the input , Its output is almost constant, so the convolutional neural network has greater limitations.
  • the purpose of the present invention is to provide a method for image classification using common base capsule projection, which aims to solve the problem of inaccurate classification caused by the loss of a large amount of valuable information in the convolutional neural network used in the prior art The problem.
  • the present invention provides a method for image classification using common base capsule projection, which includes the following steps:
  • the convolved features are in the form of scalar, and the scalar has only size but no direction, which means that the feature lacks spatial information.
  • the capsule projection network in this application is classified in the form of vectors.
  • the feature processed by the capsule projection network is a vector, which not only has the size but also the direction, and can save spatial information to a certain extent, so it is more conducive to classification and can improve Classification accuracy.
  • the number of capsule subspaces is L.
  • each group of vectors in the characteristic matrix [x 1 ,x 2 ,...,x d ] is projected using the same set of basis.
  • the present invention proposes a "common base" capsule projection idea, and uses the common base capsule projection idea to project its characteristics to multiple capsules Subspace, and then predict the classification task, so it is not easy to be interfered by the overlap of multi-class objects, and can effectively deal with crowded scenes with overlapping objects; it can improve the accuracy of the classification task.
  • step (3) the vector dimension of the feature vector X is d, and each group of vector dimensions in the feature matrix is d/N,
  • step (4) by learning a set of projection base matrices W l ⁇ R d/N ⁇ c , using the base matrix to project the eigenvectors into the capsule subspace S corresponding to each class, the projected Each capsule subspace vector ⁇ v 1 ,v 2 ,...v L ⁇ of, the dimension is c.
  • the capsule subspace search model is:
  • the formula means finding an optimal projection vector v l based on the subspace span(W l ) so that the error between v l and the projection vector x is minimized.
  • v y is the input vector x projection vector in the correct category in the subspace S y.
  • the gradient of the basis in the subspace is calculated using the following formula:
  • the capsule network in the prior art has a large amount of parameters and a slow processing speed, and it is difficult to generalize to a very deep structure, and its performance is not good enough for large-scale images.
  • the present invention avoids directly aggregating several channels in the feature layer into several capsules (the current capsule network approach). Instead, it uses the idea of co-base capsule projection to project its features into multiple capsule subspaces, and then predict the classification task. After experiments, it is found that the network can adapt to large and small images, even if a smaller data set is used. Training can also achieve very good results.
  • the method of using feature vector grouping and then performing common base projection also reduces the complexity of the network, reduces the amount of network parameters, and increases the speed of network training and prediction.
  • Figure 1 is a flow chart of the implementation of a method for image classification using common base capsule projection provided by the present invention
  • FIG. 2 is a schematic diagram of the implementation of a method for image classification using a common base capsule projection provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of projection of a capsule space provided by an embodiment of the present invention.
  • Fig. 4 is a schematic diagram of orthogonal component guided gradient update provided by an embodiment of the present invention.
  • Capsule Common-base Projection Network Capsule Common-base Projection Network
  • the network enables the detailed attribute information (position, rotation, size, etc.) of the input object to be retained in the network, so even the same object that has been translated, rotated, or scaled can still be correctly identified.
  • the vectorized features of the capsule projection network are strongly correlated and contain spatially related information such as the posture and deformation of the extracted features, it is not susceptible to interference from overlapping objects of multiple categories and can effectively handle crowded scenes with overlapping objects.
  • the network can also be extended to text classification tasks.
  • the performance of the capsule network far exceeds the convolutional neural network (CNN) and long short-term memory network (LSTM); Alipay found the application of the capsule network
  • CNN convolutional neural network
  • LSTM long short-term memory network
  • Alipay found the application of the capsule network The overall performance of the complaint text model is better than previous networks (such as LSTM, Bi-LSTM and CNN-rand, etc.).
  • the network adopts the idea of common base, divides the feature vector into several groups, and uses the same base to project into multiple subspaces, so there is no need to use huge training data to learn how to effectively identify target objects in various situations. You can get good generalization ability by training with only a small amount of data.
  • the network based on common base capsule projection can accurately reconstruct objects even in the case of multiple occlusions in the scene.
  • the capsule network is in the actual scene but its development is still in its infancy. However, based on its unparalleled characteristics, the future capsule network will have broader application prospects in the fields of computer vision and natural language processing.
  • the current deep learning method uses convolutional layers to extract features, maps the feature map generated by the convolutional layer into a fixed-length feature vector, and then connects several fully connected layers for classification.
  • AlexNet's ImageNet model outputs a 1000-dimensional vector representing the probability of the input image belonging to each category (softmax normalization).
  • the features extracted by the convolutional neural network lack spatial relevance.
  • the present invention does not pass through the fully-connected network of the convoluted features of the image, and avoids directly aggregating several channels in the feature layer into several capsules (the current capsule network The approach taken) instead of using the idea of co-base capsule projection, but divide its features into several groups of vectors and then perform co-base capsule projection, so that the features are projected into multiple capsule subspaces, and then the classification task is predicted.
  • the network can further improve the accuracy of classification tasks.
  • the classification accuracy of the capsule common base projection network of the present invention can exceed other mainstream network structures, which also points out a new idea for improving the performance of the deep network.
  • FIGS. 1 and 2 respectively show the implementation process of a method for image classification using common base capsule projection provided by an embodiment of the present invention. For ease of description, only the parts related to the embodiment of the present invention are shown, which are now combined.
  • the drawings are detailed as follows:
  • the feature is the feature map extracted by the convolutional layer + pooling layer of the convolutional neural network.
  • the basic architecture of the convolutional neural network includes Vgg, GoogleNet, ResNet, DenseNet, etc., and the specific network framework used can be selected according to needs.
  • the feature map extracted from the image through the convolutional neural network is a four-dimensional tensor (B, C, W, H), where B is the batch size of the sample, and C is the channel. W is the width of the image and H is the height of the image.
  • the feature map has detailed feature information of the image, which is helpful for the prediction of classification tasks.
  • CNN uses convolutional layers to extract rich semantic features of the image, then uses a pooling layer to reduce network parameters, and finally uses a fully connected layer to explain features.
  • other methods can also be used to extract feature maps, such as traditional machine learning methods (decision tree classification, random forest classification, K-nearest neighbor classifier, multi-layer perceptron MLP, etc.), and RNN (cyclic Neural network), but the method of deep learning for image classification is often CNN.
  • traditional machine learning methods decision tree classification, random forest classification, K-nearest neighbor classifier, multi-layer perceptron MLP, etc.
  • RNN cyclic Neural network
  • the feature map of the image after the convolutional neural network is a four-dimensional tensor (B, C, W, H), where B is the batch size of the sample, and C is the channel. W is the width of the image, and H is the height of the image.
  • B the batch size of the sample
  • C the channel
  • W the width of the image
  • H the height of the image.
  • the classification task will expand the four-dimensional tensor into a one-dimensional vector, and then perform classification prediction through a fully connected network.
  • the feature matrix is orthogonally projected to multiple capsule subspaces (if N categories are predicted, then the number of subspaces is N). There is no loss of information in the projection process, and the capsule subspace will contain more new feature information, so that the network structure can be trained more effectively.
  • each group of vectors in the feature matrix [x 1 , x 2 , ... x n ] is projected using the same set of basis, which can reduce the parameters, thereby reducing the complexity of the network, and speeding up the network training and convergence speed.
  • the use of a common base capsule projection network can not only increase the accuracy of prediction, but also reduce the amount of parameters, thereby speeding up the recognition.
  • the feature matrix is orthogonally projected to multiple capsule subspaces (if L categories are predicted, then the number of subspaces is L). There is only a very small part of the information loss during the projection process, and the capsule subspace will contain more new feature information, so as to train the network structure more effectively.
  • the same set of basis is used for projection, which can reduce the parameters (reflected in the projection basis matrix with fewer parameters), Thereby reducing the complexity of the network, and speeding up the speed of network training and convergence. Since the capsule network retains the detailed spatial information of the image, it has its application prospects in various computer vision fields such as localization, object detection, semantic segmentation or instance segmentation.
  • the “base” here refers to the "base vector”.
  • a set of base vectors can be found in any space to express all the vectors in this space.
  • the network is used to optimize and reduce this loss, so that the final projection result maintains the original information as much as possible.
  • Fig. 3 shows a schematic diagram of projection of a capsule space provided by an embodiment of the present invention.
  • N is 4, which means that the feature vector is divided into 4 groups, and then the common base capsule projection is performed.
  • X is the eigenvector after the feature map transformation.
  • the vector dimension is d.
  • the eigenvectors are divided into N groups to form a feature matrix ⁇ x 1 ,x 2 ,...x n ⁇ , each group of vector dimensions in the matrix is d/N, d
  • the value is a parameter, usually an integer greater than 1. You can set how many groups you want to divide the feature into.
  • the final network will learn a set of capsule subspaces ⁇ S 1 , S 2 ,...S L ⁇ ; where L is the final predefined number of categories.
  • L is the final predefined number of categories.
  • the orthogonal basis of the capsule subspace is maximized to retain the original feature information through constrained optimization .
  • the length of the projection subspace vector v 1 represents the probability of the category, and the direction represents the attribute of the category.
  • the capsule subspace search model is as follows:
  • ⁇ l (W l T W l ) -1 , which can be regarded as a weight regular term.
  • v y is the input vector x projection vector in the correct category in the subspace S y.
  • the gradient of the basis in the subspace is calculated as follows:
  • Figure 4 shows a schematic diagram of the orthogonal component-guided gradient update provided by an embodiment of the present invention; when searching for the optimal basis in the capsule subspace, the update of the basis vector is guided by the orthogonal component.
  • the orthogonal component tends to 0, the network To the optimal basis.
  • the sum of the vector modulus lengths after the optimal basis projection is calculated, and the number indicates the probability of the final classification.
  • the present invention avoids directly aggregating several channels in the feature layer into several capsules (the current capsule network approach). Instead, it uses the idea of co-base capsule projection to project its features into multiple capsule subspaces, and then predict the classification task. After experiments, it is found that the network can adapt to large and small images, even if a smaller data set is used. Training can also achieve very good results. Moreover, the method of using feature vector grouping and then performing common base projection also reduces the complexity of the network, reduces the amount of network parameters, and increases the speed of network training and prediction.
  • Table 1 shows the experimental results tested on the CIFAR10 and CIFAR100 data sets.
  • the capsule co-base projection network of the present invention not only improves the accuracy of classification task prediction, but also reduces the amount of network parameters, and improves network training and Forecast speed.
  • the current deep learning method is to use convolutional layers to extract features, map the feature map generated by the convolutional layer into a fixed-length feature vector, and then connect several full-length feature maps.
  • the connection layer is classified.
  • AlexNet's ImageNet model outputs a 1000-dimensional vector representing the probability of the input image belonging to each category (softmax normalization).
  • the features extracted by the convolutional neural network lack spatial relevance.
  • the present invention does not pass through the fully-connected network of the convoluted features of the image, and avoids directly aggregating several channels in the feature layer into several capsules (the current capsule network The approach taken), but using the idea of common-base capsule projection, divide its features into several groups of vectors and then perform common-base capsule projection, so that the features are projected into multiple capsule subspaces, and then the classification task is predicted.
  • the network can further improve the accuracy of classification tasks.

Landscapes

  • Image Analysis (AREA)

Abstract

本发明属于图像分类技术领域,公开了一种利用共基胶囊投影进行图像分类的方法,包括下述步骤:(1)利用多层卷积网络提取输入图像的特征,获得特征图;(2)将特征图映射成一个一维的特征向量X;(3)对特征向量X进行特征变换,将特征向量X分为N组,并组合向量为特征矩阵;(4)将特征矩阵进行共基胶囊投影,投影到多个胶囊子空间,计算每个子空间投影后的向量模长和,根据模长和的大小进行图像分类预测。本发明利用共基胶囊投影思想将其特征投影到多个胶囊子空间,然后再进行图像分类任务的预测,经过实验发现该网络对大小规模的图像都能够适应,并且即使采用较小的数据集训练也能达到非常好的分类效果。

Description

一种利用共基胶囊投影进行图像分类的方法 技术领域
本发明属于图像分类技术领域,更具体地,涉及一种利用共基胶囊投影进行图像分类的方法。
背景技术
近年来,深度学习中的卷积神经网络已经广泛运用到各个领域,如计算机视觉、自然语言处理、大数据分析等领域,相关成果也远远超过人们的预想。尤其在计算机视觉领域,卷积神经网络(Convolutional Neural Networks,CNN)因其在目标识别、目标分类等任务中的优异表现,受到许多研究人员与工作者的青睐。
但是研究中发现卷积神经网络存在一个根本性的缺陷,当图像数据集非常接近的图像时,卷积神经网络的性能效果会非常好,但如果图像存在翻转、倾斜或任何其它等方向性问题时,卷积神经网络的表现就比较糟糕了。这是因为卷积神经网络无法考虑到底层对象之间的空间关系,在卷积神经网络中,上一层神经元传递给下一层神经元中的是个标量,标量只有大小而没有方向,所以不能表示出高层特征与低层特征之间的位姿关系。同时,卷积神经网络中的池化层虽然保证了特征在平移和旋转上的不变形,但同时也丢失了大量有价值的信息,降低了空间的分辨率,这就导致对于输入的微小变化,其输出几乎是不变的,因此卷积神经网络存在较大的局限性。
针对这一局限性,2017年年底Hinton发表了论文《Dynamic routing between capsules》,提出更加深刻的算法及胶囊网络架构。 胶囊网络采用到神经胶囊单元,使得上一层神经胶囊输出到下一层神经胶囊中的是个向量,向量不仅有大小,还有方向属性,可以表示出特征的朝向,从而建立起空间上特征之间的对应关系,这极大地弥补了卷积神经网络存在的不足。相比于CNN特征的弱空间关联性,胶囊网络的矢量化特征则被认为能很好地表达特征之间的空间关联。
技术问题
针对现有技术的缺陷,本发明的目的在于提供一种利用共基胶囊投影进行图像分类的方法,旨在解决现有技术中采用的卷积神经网络丢失了大量有价值的信息导致分类不准确的问题。
技术解决方案
本发明提供了一种利用共基胶囊投影进行图像分类的方法,包括下述步骤:
(1)利用多层卷积网络提取输入图像的特征,获得特征图;
(2)将所述特征图映射成一个一维的特征向量X;
(3)对所述特征向量X进行特征变换,将特征向量X分为N组,并组合向量为特征矩阵[x 1,x 2,……x n];
(4)将所述特征矩阵进行共基胶囊投影,投影到多个胶囊子空间,计算每个子空间投影后的向量模长和,根据模长和的大小进行图像分类预测。
由于目前做图像分类任务,大部分网络都采用卷积神经网络来提取图像特征,然后经过全连接层做分类预测。但卷积出来的特征是标量形式的,标量只有大小而没有方向,也就是说特征缺少空间信息。而本申请中的胶囊投影网络,以向量的形式来分类,经过将囊投影网络处理的特征是个矢量,不仅有大小,还有方向,能够一定程度保存 空间信息,因此更有利于分类,能提高分类的精度。
其中,当需要做L个类别的预测时,胶囊子空间的数量为L。
更进一步地,对特征矩阵[x 1,x 2,…,x d]中的每组向量采用同一组基进行投影。
本发明针对现有技术中胶囊网络参数量大,训练预测速度慢,难以推广到深层网络等缺陷,提出“共基”胶囊投影思想,利用共基胶囊投影思想,将其特征投影到多个胶囊子空间,然后再进行分类任务的预测,因此不易受多类别物体重叠的干扰,能够有效地处理存在重叠对象的拥挤场景;可以提高分类任务的准确性。
更进一步地,在步骤(3)中,特征向量X的向量维度为d,特征矩阵中每组向量维度为d/N,
更进一步地,在步骤(4)中,通过学习一组投影基矩阵W l∈R d/N×c,利用基矩阵将特征向量投影到每个类对应的胶囊子空间S中,获得投影后的每个胶囊子空间向量{v 1,v 2,…v L},维度为c。
其中,胶囊子空间搜索模型为:
Figure PCTCN2019092109-appb-000001
式子表示基于子空间span(W l)中找到一个最优的投影向量v l,使得v l与投影向量x的误差最小。
其中,为了找到一组合适的基W l,采用如下约束:
v l=P lx,P l=W lW l +……(2)
其中,P l为胶囊子空间S l(S l=span(W l))的投影矩阵,W W +是W l的广义逆矩阵,当W l列空间线性无关时,有W l +=(W l TW l) -1W l T
其中,投影后胶囊v l长度通过如下公式进行计算:
Figure PCTCN2019092109-appb-000002
Figure PCTCN2019092109-appb-000003
其中,Σ l=(W l TW l) -1,可视为权值正则项。
其中,当获得在子空间中投影向量的长度‖v l2后,使用交叉熵损失来寻找每个类别的最优子空间:
Figure PCTCN2019092109-appb-000004
其中,v y为输入向量x在正确类别子空间S y中的投影向量。
其中,子空间中基的梯度采用如下公式计算:
Figure PCTCN2019092109-appb-000005
其中,x =x-V=x-P lx=(I-P l)x,
Figure PCTCN2019092109-appb-000006
子空间的基的更新受投影向量在子空间中正交分量的引导,当正交分量x 为0时,基的梯度为0,此时的基W l最优,能够保留原始输入x的所有信息。
有益效果
现有技术中胶囊网络的参数量大、处理速度慢,难以推广到非常深的结构,其性能对于大规模图像表现不够好。本发明避免直接将特征层中的若干通道聚合成几个胶囊(目前的胶囊网络做法)。而是利用共基胶囊投影思想,将其特征投影到多个胶囊子空间,然后再进行分类任务的预测,经过实验发现该网络对大小规模的图像都能够适应,并且即使采用较小的数据集训练也能达到非常好的效果。而且利用特征向量分组然后再进行共基投影的方法,还降低了网络的复杂度,减小了网络的参数量,增快了网络训练和预测的速度。
附图说明
图1是本发明提供的一种利用共基胶囊投影进行图像分类的方 法的实现流程图;
图2是本发明实施例提供的利用共基胶囊投影进行图像分类的方法的实现示意图;
图3是本发明实施例提供的一个胶囊空间的投影示意图;
图4是本发明实施例提供的正交分量引导梯度更新示意图。
本发明的实施方式
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
本发明针对现有胶囊网络的不足,参数量大、处理速度慢,难以推广到非常深的结构的缺陷,提出一种胶囊共基投影(Capsule Common-base Projection Network)网络。该网络能使得输入对象的详细的属性信息(位置、旋转、大小等等)在网络中得到保留,因此即使对发生平移、旋转、缩放的同一对象仍然可以正确的识别出来。而且,由于胶囊投影网络的矢量化特征呈强相关性,包含了所提取特征的姿态、形变等空间相关信息,因此不易受多类别物体重叠的干扰,能够有效地处理存在重叠对象的拥挤场景。
还可以将该网络推广到文本分类任务,对在多标签分类的任务上,胶囊网络的性能远远地超过了卷积神经网络(CNN)和长短期记忆网络(LSTM);支付宝发现胶囊网络应用到投诉文本模型上的整体表现优于之前的网络(如LSTM,Bi-LSTM和CNN-rand等)。
此外,网络采用共基的思想,将特征向量分为若干组,并采用同一组基投影到多个子空间,因此不需要通过巨大的训练数据来学习如何在各种情况下有效识别目标对象。仅仅使用较少的数据量训练,就 能得到良好的泛化能力。
在视觉重建方面,基于共基胶囊投影的网络即使在场景多遮挡情况下,也能准确重建出物体。
目前胶囊网络在实际场景但是它的发展仍处于初级阶段,但基于自身无与伦比的特点,未来胶囊网络在计算机视觉、自然语言处理等领域将会有着更广阔的应用前景。
对于图像分类任务,目前的深度学习方法是采用卷积层提取特征,将卷积层产生的特征图(feature map)映射成一个固定长度的特征向量,然后接上若干个全连接层进行分类。比如AlexNet的ImageNet模型输出一个1000维的向量表示输入图像属于每一类的概率(softmax归一化)。然而卷积神经网络提取的特征缺少空间上的关联性,本发明将图像卷积出来的特征不经过全连接网络,同时避免直接将特征层中的若干通道聚合成几个胶囊(目前的胶囊网络采取的做法)、而是利用共基胶囊投影思想,而是将其特征分为若干组向量然后进行共基胶囊投影,使得特征投影到多个胶囊子空间,然后再进行分类任务的预测。经过实验证明,该网络能够进一步提高分类任务的准确性。
同时,本发明的胶囊共基投影网络的分类准确率能超过其他的主流网络结构,这也为提高深度网络的性能指出了一条新思路。
图1和图2分别示出了本发明实施例提供的一种利用共基胶囊投影进行图像分类的方法的实现流程,为了便于说明,仅示出了与本发明实施例相关的部分,现结合附图详述如下:
本发明实施例提供的利用共基胶囊投影进行图像分类的方法包括下述步骤:
(1)利用多层卷积网络提取输入图像的特征,获得特征图 (feature map);
其中特征就是经过卷积神经网络的卷积层+池化层提取出来的特征图。在本发明实施例中,卷积神经网络基础构架有Vgg,GoogleNet,ResNet,DenseNet等等,具体用到的网络框架可以根据需要选取。
图像经过卷积神经网络被提取出来的特征图是一个四维的张量(B,C,W,H),其中B是样本的批量大小,C是通道。W是图像的宽,H是图像的高。特征图拥有图像的细节特征信息,这些信息有助于做分类任务的预测。
采用CNN来提取特征,具有先天优越性,它用卷积层来提取图像丰富的语义特征,然后用池化层降低网络参数,最后用全连接层来解释特征。
在本发明实施例中,也可以采用其他的方法提取特征图,如传统的机器学习方法(决策树分类、随机森林分类、K近邻分类器、多层感知器MLP等),还有RNN(循环神经网络),但是深度学习做图像分类的方法常用CNN。
(2)将卷积层产生的特征图(feature map)映射成一个固定长度的特征向量X;
图像经过卷积神经网络出来的特征图是一个四维的张量(B,C,W,H),其中B是样本的批量大小,C是通道。W是图像的宽,H是图像的高。通常做分类任务会把这四维的张量,先展开拉长成一个一维的向量,然后经过全连接网络做分类预测。
(3)对特征向量X做特征变换,将特征向量X分为N组,然后组合向量为特征矩阵[x 1,x 2,……x n];
(4)将特征矩阵进行共基胶囊投影,投影到多个胶囊子空间,计算每个子空间投影后的向量模长和,根据模长和的大小进行图像分 类预测。
把特征矩阵进行正交投影到多个胶囊子空间(如果做N个类别的预测,那么子空间数量就为N)。该投影的过程并没有信息的损失,而且胶囊子空间会包含更多新的特征信息,从而更有效地对网络结构进行训练。在投影的过程中,对特征矩阵中[x 1,x 2,……x n]的每组向量采用同一组基进行投影,这样可以减少参数,从而降低网络的复杂度,加快网络训练和收敛的速度。
本发明中对于图像分类任务,利用共基胶囊投影网络不仅能增加预测的准确性,还能减少参数量,进而加快识别的速度。
在本发明实施例中,将特征矩阵进行正交投影到多个胶囊子空间(如果做L个类别的预测,那么子空间数量就为L)。该投影的过程中仅有非常小的一部分信息损失,而且胶囊子空间会包含更多新的特征信息,从而更有效地对网络结构进行训练。在投影的过程中,对于特征矩阵[x 1,x 2,……x n]中的每组向量都采用同一组基进行投影,这样可以减少参数(体现在投影基矩阵的参数较少),从而降低网络的复杂度,并加快网络训练和收敛的速度。由于胶囊网络保留了图像详细的空间信息,因此在定位、物体检测、语义分割或实例分割等各种计算机视觉领域都有其应用的前景。
这里的“基”指“基向量”,在任何空间中都能找到一组基向量来表达这个空间中的所有向量。本发明中通过网络来优化、减小这种损失,使得最终投影结果尽可能保持原有信息。
图3示出了本发明实施例提供的一个胶囊空间的投影示意图,图 中N为4,表示将特征向量分为4组,然后进行共基胶囊投影。下面将进行具体的介绍。
在本发明实施例中,具体的投影过程如下:
X是特征图变换之后的特征向量,向量维度为d,将特征向量分成N组,构成特征矩阵{x 1,x 2,…x n},矩阵中的每组向量维度为d/N,d值是一个参数,通常为大于1的整数,想把特征分为多少组可以自行设定。
为了学习到每个类别的特征,最终网络将学习到一组胶囊子空间{S 1,S 2,…S L};其中,L为最终预定义的类别数量。通过学习一组投影基矩阵W l∈R d/N×c,利用基矩阵将特征向量投影到每个类对应的胶囊子空间S中,最终得到投影后的每个胶囊子空间向量{v 1,v 2,…v L},维度为c。为了学习差异性特征,通过约束优化使得胶囊子空间的正交基能最大化保留原始特征信息,投影子空间向量v l的长度表示该类别出现的概率,方向表示该类别的属性。胶囊子空间搜索模型如下:
Figure PCTCN2019092109-appb-000007
式子表示基于子空间span(W l)中找到一个最优的投影向量v l,使得v l与投影向量x的误差最小,换言之,投影到子空间中的向量应尽量保存原始输入的信息。为了找到一组合适的基W l满足上式,我们做如下约束:
v l=P lx,P l=W lW l +……(2)
式中P l为胶囊子空间S l(S l=span(W l))的投影矩阵,W l +是W l的广义逆矩阵。当W l列空间线性无关时,有W l +=(W l TW l) -1W l T。因此,投影后胶囊v l长度可以直接通过下式计算:
Figure PCTCN2019092109-appb-000008
式中Σ l=(W l TW l) -1,可视为权值正则项。得到在子空间中投影向量的长度‖v l2后,使用交叉熵损失来寻找每个类别的最优子空间:
Figure PCTCN2019092109-appb-000009
式中,v y为输入向量x在正确类别子空间S y中的投影向量。子空间中基的梯度计算如下:
Figure PCTCN2019092109-appb-000010
如图4所示,x =x-V=x-P lx=(I-P l)x,因此
Figure PCTCN2019092109-appb-000011
意味着子空间的基的更新受投影向量在子空间中正交分量的引导,当正交分量x 为0时,基的梯度为0,此时的基W l最优,能够保留原始输入x的所有信息。
图4示出了本发明实施例提供的正交分量引导梯度更新示意图;在寻找胶囊子空间中最优基时,基向量的更新受正交分量引导,正交分量趋于0时,网络学到最优基。对于每个子空间胶囊,得到最优基后,计算最优基投影后的向量模长和,其数字表示最终分类的概率。
本发明避免直接将特征层中的若干通道聚合成几个胶囊(目前的胶囊网络做法)。而是利用共基胶囊投影思想,将其特征投影到多个胶囊子空间,然后再进行分类任务的预测,经过实验发现该网络对大小规模的图像都能够适应,并且即使采用较小的数据集训练也能达到非常好的效果。而且利用特征向量分组然后再进行共基投影的方法,还降低了网络的复杂度,减小了网络的参数量,增快了网络训练和预 测的速度。
表1:部分实验结果展示
Figure PCTCN2019092109-appb-000012
表1是在CIFAR10和CIFAR100数据集上测试的实验结果展示,经实验分析,本发明的胶囊共基投影网络不仅提升了分类任务预测的精度,还降低了网络的参数量,提高了网络训练和预测的速度。
综上所述,对于图像分类任务,目前的深度学习方法是采用卷积层提取特征,将卷积层产生的特征图(feature map)映射成一个固定长度的特征向量,然后接上若干个全连接层进行分类。比如AlexNet的ImageNet模型输出一个1000维的向量表示输入图像属于每一类的概率(softmax归一化)。然而卷积神经网络提取的特征缺少空间上的关联性,本发明将图像卷积出来的特征不经过全连接网络,同时避免直接将特征层中的若干通道聚合成几个胶囊(目前的胶囊网络采取的做法)、而是利用共基胶囊投影思想,将其特征分为若干组向量然后进行共基胶囊投影,使得特征投影到多个胶囊子空间,然后再进行分类任务的预测。经过实验证明,该网络能够进一步提高分类任务的准确性。
本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。

Claims (10)

  1. 一种利用共基胶囊投影进行图像分类的方法,其特征在于,包括下述步骤:
    (1)利用多层卷积网络提取输入图像的特征,获得特征图;
    (2)将所述特征图映射成一个一维的特征向量X;
    (3)对所述特征向量X进行特征变换,将特征向量X分为N组,并组合向量为特征矩阵[x 1,x 2,……x n];
    (4)将所述特征矩阵进行共基胶囊投影,投影到多个胶囊子空间,计算每个子空间投影后的向量模长和,根据模长和的大小进行图像分类预测。
  2. 如权利要求1所述的方法,其特征在于,当需要做L个类别的预测时胶囊子空间的数量为L。
  3. 如权利要求1或2所述的方法,其特征在于,对特征矩阵[x 1,x 2,…,x d]中的每组向量采用同一组基进行投影。
  4. 如权利要求1-3任一项所述的方法,其特征在于,在步骤(3)中,特征向量X的向量维度为d,特征矩阵中每组向量维度为d/N。
  5. 如权利要求1-4任一项所述的方法,其特征在于,在步骤(4)中,通过学习一组投影基矩阵W l∈R d/N×c,利用基矩阵将特征向量投影到每个类对应的胶囊子空间S中,获得投影后的每个胶囊子空间向量{v 1,v 2,…v L},维度为c。
  6. 如权利要求5所述的方法,其特征在于,在步骤(4)中,胶囊子空间搜索模型为:
    Figure PCTCN2019092109-appb-100001
    式子表示基于子空间span(W l)中找到一个最优的投影向量v l,使得v l与投影向量x的误差最小。
  7. 如权利要求5或6所述的方法,其特征在于,在步骤(4)中,为了找到一组合适的基W l,采用如下约束:
    v l=P lx,P l=W lW l +……(2)
    其中,P l为胶囊子空间S l(S l=span(W l))的投影矩阵,W l +是W l的广义逆矩阵,当W l列空间线性无关时,有W l +=(W l TW l) -1W l T
  8. 如权利要求5-7任一项所述的方法,其特征在于,在步骤(4)中,投影后胶囊v l长度通过如下公式进行计算:
    Figure PCTCN2019092109-appb-100002
    其中,Σ l=(W l TW l) -1,可视为权值正则项。
  9. 如权利要求5-8任一项所述的方法,其特征在于,在步骤(4)中,当获得在子空间中投影向量的长度‖v l2后,使用交叉熵损失来寻找每个类别的最优子空间:
    Figure PCTCN2019092109-appb-100003
    其中,v y为输入向量x在正确类别子空间S y中的投影向量。
  10. 如权利要求5-9任一项所述的方法,其特征在于,在步骤(4)中,子空间中基的梯度采用如下公式计算:
    Figure PCTCN2019092109-appb-100004
    其中,x =x-V=x-P lx=(I-P l)x,
    Figure PCTCN2019092109-appb-100005
    子空间的基的更新受投影向量在子空间中正交分量的引导,当正交分量x 为0时,基的梯度为0,此时的基W l最优,能够保留原始输入x的所有信息。
PCT/CN2019/092109 2019-06-20 2019-06-20 一种利用共基胶囊投影进行图像分类的方法 WO2020252746A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/092109 WO2020252746A1 (zh) 2019-06-20 2019-06-20 一种利用共基胶囊投影进行图像分类的方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/092109 WO2020252746A1 (zh) 2019-06-20 2019-06-20 一种利用共基胶囊投影进行图像分类的方法

Publications (1)

Publication Number Publication Date
WO2020252746A1 true WO2020252746A1 (zh) 2020-12-24

Family

ID=74037611

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/092109 WO2020252746A1 (zh) 2019-06-20 2019-06-20 一种利用共基胶囊投影进行图像分类的方法

Country Status (1)

Country Link
WO (1) WO2020252746A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205137A (zh) * 2021-04-30 2021-08-03 中国人民大学 一种基于胶囊参数优化的图像识别方法及系统
CN114187506A (zh) * 2021-11-22 2022-03-15 武汉科技大学 视点意识的动态路由胶囊网络的遥感图像场景分类方法
CN114528407A (zh) * 2022-02-23 2022-05-24 安徽理工大学 一种基于正交投影的bi-lstm-cnn的情感特征抽取方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109345575A (zh) * 2018-09-17 2019-02-15 中国科学院深圳先进技术研究院 一种基于深度学习的图像配准方法及装置
CN109376636A (zh) * 2018-10-15 2019-02-22 电子科技大学 基于胶囊网络的眼底视网膜图像分类方法
CN109840560A (zh) * 2019-01-25 2019-06-04 西安电子科技大学 基于胶囊网络中融入聚类的图像分类方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109345575A (zh) * 2018-09-17 2019-02-15 中国科学院深圳先进技术研究院 一种基于深度学习的图像配准方法及装置
CN109376636A (zh) * 2018-10-15 2019-02-22 电子科技大学 基于胶囊网络的眼底视网膜图像分类方法
CN109840560A (zh) * 2019-01-25 2019-06-04 西安电子科技大学 基于胶囊网络中融入聚类的图像分类方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QI, GUOJUN: "From Capsule Projection Network to High-dimensional Extension of Weight Normalization", HTTPS://ZHUANLAN.ZHIHU.COM/P/53224814, 7 January 2019 (2019-01-07), DOI: 20200226145709X *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205137A (zh) * 2021-04-30 2021-08-03 中国人民大学 一种基于胶囊参数优化的图像识别方法及系统
CN114187506A (zh) * 2021-11-22 2022-03-15 武汉科技大学 视点意识的动态路由胶囊网络的遥感图像场景分类方法
CN114187506B (zh) * 2021-11-22 2024-08-06 武汉科技大学 视点意识的动态路由胶囊网络的遥感图像场景分类方法
CN114528407A (zh) * 2022-02-23 2022-05-24 安徽理工大学 一种基于正交投影的bi-lstm-cnn的情感特征抽取方法

Similar Documents

Publication Publication Date Title
WO2023273290A1 (zh) 基于多特征信息捕捉和相关性分析的物品图像重识别方法
WO2021227726A1 (zh) 面部检测、图像检测神经网络训练方法、装置和设备
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
CN110263855B (zh) 一种利用共基胶囊投影进行图像分类的方法
WO2020252746A1 (zh) 一种利用共基胶囊投影进行图像分类的方法
CN112348036A (zh) 基于轻量化残差学习和反卷积级联的自适应目标检测方法
CN110751027B (zh) 一种基于深度多示例学习的行人重识别方法
WO2021169160A1 (zh) 图像归一化处理方法及装置、存储介质
CN111738355A (zh) 注意力融合互信息的图像分类方法、装置及存储介质
CN114170410A (zh) 基于PointNet的图卷积与KNN搜索的点云零件级分割方法
Deng A survey of convolutional neural networks for image classification: Models and datasets
CN108537109B (zh) 基于OpenPose的单目相机手语识别方法
Sahu et al. Dynamic routing using inter capsule routing protocol between capsules
CN111368733B (zh) 一种基于标签分布学习的三维手部姿态估计方法、存储介质及终端
CN115457332A (zh) 基于图卷积神经网络和类激活映射的图像多标签分类方法
Liu et al. Bilaterally normalized scale-consistent sinkhorn distance for few-shot image classification
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
CN111144469B (zh) 基于多维关联时序分类神经网络的端到端多序列文本识别方法
CN117671666A (zh) 一种基于自适应图卷积神经网络的目标识别方法
CN116977265A (zh) 缺陷检测模型的训练方法、装置、计算机设备和存储介质
CN113688864B (zh) 一种基于分裂注意力的人-物交互关系分类方法
CN113705731A (zh) 一种基于孪生网络的端到端图像模板匹配方法
Liu et al. Application of object detection algorithm in identification of rice weevils and maize weevils
CN111666956A (zh) 一种多尺度特征提取及融合方法及装置
Yue et al. Study on the deep neural network of intelligent image detection and the improvement of elastic momentum on image recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933610

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933610

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.03.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19933610

Country of ref document: EP

Kind code of ref document: A1