CN111507410B

CN111507410B - Construction method of rolling capsule layer and classification method and device of multi-view images

Info

Publication number: CN111507410B
Application number: CN202010309310.7A
Authority: CN
Inventors: 宁欣; 李卫军; 田伟娟; 孙琳钧; 李爽
Original assignee: Institute of Semiconductors of CAS
Current assignee: Zhongke Shangyi Health Technology Beijing Co ltd
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2021-02-12
Anticipated expiration: 2040-04-17
Also published as: CN111507410A

Abstract

The application provides a construction method of a convolution capsule layer, the convolution capsule layer at least comprises an input layer and an output layer, the input layer and the output layer are provided with a plurality of capsules, and the method comprises the following steps: s1, performing inner product on the Gabor filter and the convolution kernel to obtain a convolution Gabor filter from the input layer capsule to the output layer capsule; s2, convolving the convolution gabor filter with the convolution characteristic diagram input by the input layer to obtain a prediction vector; s3, constructing a self-attention route to acquire the distribution probability of the capsules of the input layer to the capsules of the output layer; s4, obtaining the input of the capsule of the output layer according to the distribution probability; and S5, activating the input of the capsules of the output layer through a Squash activation function to obtain the output of the capsules of the output layer. In addition, a construction device of the convolution capsule layer, a multi-view image classification method and device and an electronic device are also provided.

Description

Construction method of rolling capsule layer and classification method and device of multi-view images

Technical Field

The application relates to the technical field of pattern recognition, in particular to a construction method of a rolling capsule layer and a classification method and device of multi-view images.

Background

Convolutional Neural Networks (CNNs) have made a breakthrough in many computer vision tasks in recent years and are significantly superior to many traditional tactical feature-driven models. Two common topics for improving CNN performance are increasing the depth and width of the network (e.g., the number of levels of the network and the number of cells per level), and using as much training data as possible. Although CNN has been successful, it also has many limitations, such as invariance caused by merging and an inability to understand the spatial relationship between elements, and to address these limitations, a dynamic routing based CapsNet network has been proposed, comprising only one convolutional layer and one fully connected capsule layer, which has shown comparable results to CNN in several standard datasets. In addition to dynamic routing, the use of EM routing to represent the matrix capsule of each entity by a gesture matrix has many extensions, such as data enhancement using mixed hit and miss layers. Attempts by existing algorithms to create a depth CapsNet by simply stacking fully connected capsule layers will result in an architecture similar to the MLP model, but with some limitations. First, dynamic routing used in capsule networks is an extremely computationally expensive process, and having multiple routing layers results in increased training and reasoning times. Second, it has recently been shown that stacking fully connected capsule layers together can lead to poor learning of the middle layer. This is because when there are too many capsules, the coupling coefficient tends to be too small, thereby attenuating gradient flow and inhibiting learning. Third, it has been shown that, particularly in the lower layers, the relevant cells tend to concentrate in local areas. Although local routing can make explicit use of this observation, such local routing cannot be implemented in fully connected capsules.

Disclosure of Invention

Technical problem to be solved

The application provides a construction method of a rolling capsule layer and a classification method and device of multi-view images, which at least solve the technical problems.

(II) technical scheme

In a first aspect, the present application provides a method of constructing a layer of convoluted capsules, the convoluted capsule layer comprising at least an input layer having a plurality of capsules and an output layer, the method comprising: s1, performing inner product on the Gabor filter and the convolution kernel to obtain a convolution Gabor filter from the input layer capsule to the output layer capsule; s2, convolving the convolution gabor filter with the convolution characteristic diagram input by the input layer to obtain a prediction vector; s3, constructing a self-attention route to acquire the distribution probability of the capsules of the input layer to the capsules of the output layer; s4, obtaining the input of the capsule of the output layer according to the distribution probability; and S5, activating the input of the capsules of the output layer through a Squash activation function to obtain the output of the capsules of the output layer.

Optionally, the method for constructing the self-attention route includes: will predict vector [ w_j，h_j，n_i，n_j，d_j]Transpose to obtain [ w_j，h_j，n_j，n_i，d_j]So as to output the number n of capsules corresponding to the jth capsule of the output layer_jAs heads of a multi-headed attention mechanism, along n_iCalculating the correlation between initial prediction vectors of the ith capsule of the input layer after affine transformation in the dimension where w_jFor the width of the convolved feature map, h_jFor the height of the convolved feature map, n_iFor the number of elements of the i-th capsule of the input layer, n_jNumber of elements of jth capsule of output layer, d_jIs the dimension of the capsule.

Optionally, the assignment probability calculation process is as follows:

obtaining attention value head of prediction vector_hWherein, in the step (A),

x is a query vector

Y is a key value vector

Z is a vector of values

Taking the attention value as a weight coefficient from the input layer capsule to the output layer capsule;

splicing the weight coefficients to obtain the probability value c from the input layer capsule to the output layer capsule_ij＝Concat(head₁，...，head_h，…，head_H)，H＝n_j。

Optionally, the input of the capsule of the output layer is calculated by:

wherein s is_jIs the input of the capsules of the output layer,c_ijthe probability value of an input layer capsule to an output layer capsule,

for the prediction vector, i is the ith capsule of the input layer and j is the jth capsule of the output layer.

Optionally, the calculation formula of the output of the capsule of the output layer is:

wherein v is_jIs the output vector of the jth capsule of the output layer.

In a second aspect, the present application provides a method for classifying multi-view images based on the above rolling capsule layer, including: inputting the image into a convolutional neural network to obtain a main characteristic image; and inputting the main characteristic image into two convolution capsule layers to obtain a classification result of the multi-view image.

Optionally, the convolutional neural network comprises an input layer, a plurality of convolutional layers, a ReLU layer for making part of neuron outputs 0 to cause sparsity, and a max-firing layer for compressing the feature image to obtain a main feature image.

In a third aspect, the present application provides an apparatus for constructing a layer of convoluted capsules, the layer of convoluted capsules comprising at least an input layer having a plurality of capsules and an output layer, the apparatus comprising: the inner product module is used for carrying out inner product on the Gabor filter and the convolution kernel so as to obtain a convolution Gabor filter from the input layer capsule to the output layer capsule; the convolution module is used for convolving the convolution gabor filter with the convolution characteristic diagram input by the input layer to obtain a prediction vector; the building module is used for building a white attention route so as to obtain the distribution probability of the capsules of the input layer to the capsules of the output layer; an obtaining module for obtaining an input of a capsule of the output layer according to the distribution probability; and the activation module is used for activating the input of the capsules of the output layer by the Squash activation function to obtain the output of the capsules of the output layer.

In a fourth aspect, the present application provides a device for classifying multi-view images, comprising: the first input module is used for inputting the image into the convolutional neural network to obtain a main characteristic image; and the second input module is used for inputting the main characteristic image into two layers of the convolution capsule layers to obtain a classification result of the multi-view image.

In a fifth aspect, the present application provides an electronic device, comprising: a processor; and a memory having computer readable instructions stored thereon, which when executed by the processor, cause the processor to perform the above-described method.

(III) advantageous effects

The application provides a building method of a rolling capsule layer and a classification method and device of a multi-view image, the traditional capsule building method based on common convolution is replaced by a 3d convolution method based on gabor convolution, the complexity of an algorithm can be greatly reduced, the building of a deep capsule network is realized, the modulation of a gabor filter for convolution can be used for guiding the learning of convolution characteristics, and finally the deep capsule network based on sausage coverage learning is built, so that the problems of gradient disappearance caused by deep stacking and excessive coupling of capsules in the traditional capsule network building process are solved. The method can be used for multi-view image classification, such as image retrieval, intelligent monitoring, intelligent transportation, monitoring security and the like.

Drawings

FIG. 1 schematically illustrates a step diagram of a method of building a rolled layer of capsules according to an embodiment of the disclosure;

FIG. 2 schematically illustrates a flow chart of a method of building a rolled layer of capsules according to an embodiment of the disclosure;

fig. 3 schematically shows a step diagram of a classification method of multi-view images according to an embodiment of the present disclosure;

FIG. 4 schematically illustrates a block diagram of a build apparatus that rolls a layer of capsules according to an embodiment of the disclosure;

fig. 5 schematically shows a block diagram of a classification apparatus of a multi-view image according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Embodiments of the present disclosure will be described below with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.

Embodiments of the present disclosure provide a method of constructing a layer of convoluted capsules comprising at least an input layer having a plurality of capsules and an output layer, as shown in fig. 1, the method comprising: s1, performing inner product on the Gabor filter and the convolution kernel to obtain a convolution Gabor filter from the input layer capsule to the output layer capsule; s2, convolving the convolution gabor filter with the convolution characteristic diagram input by the input layer to obtain a prediction vector; s3, constructing a self-attention route to acquire the distribution probability of the capsules of the input layer to the capsules of the output layer; s4, obtaining the input of the capsule of the output layer according to the distribution probability; and S5, activating the input of the capsules of the output layer through a Squash activation function to obtain the output of the capsules of the output layer.

The facial image attribute editing method in the present disclosure will be described in detail below with reference to the accompanying drawings. The input layer of the rolled capsule layer of the disclosed embodiments includes a plurality of capsules and the output layer also includes a plurality of capsules.

S1, performing inner product on the Gabor filter and the convolution kernel to obtain a convolution Gabor filter from the input layer capsule to the output layer capsule;

as shown in FIG. 2, firstly, a gabor filter with 4 directions can be initialized, the parameters are fixed, secondly, a convolution kernel with fixed size and learnable parameters is initialized, and the two are subjected to inner product to obtain a convolution gabor filter w_ijWherein i is inputThe ith capsule of the layer, j being the jth capsule of the output layer.

S2, convolving the convolution gabor filter with the convolution characteristic diagram input by the input layer to obtain a prediction vector;

convolving the above-mentioned convolved gabor filter with the i-th input feature map u_iPerforming convolution operation to obtain a prediction vector

The calculation formula is as follows:

s3, constructing a self-attention route to acquire the distribution probability of the capsules of the input layer to the capsules of the output layer;

the self-attention route can be constructed by using a prediction vector [ w_j，h_j，n_i，n_j，d_j]Transpose to obtain [ w_j，h_j，n_j，n_i，d_j]So as to output the number n of capsules corresponding to the jth capsule of the output layer_jAs heads of a multi-headed attention mechanism, along n_iCalculating the correlation between initial prediction vectors of the ith capsule of the input layer after affine transformation in the dimension where w_jFor the width of the convolved feature map, h_jFor the height of the convolved feature map, n_iFor the number of elements of the i-th capsule of the input layer, n_jNumber of elements of jth capsule of output layer, d_jIs the dimension of the capsule.

Obtaining attention value head of prediction vector_hWherein, in the step (A),

x is a query vector

Y is a key value vector

Z is a vector of values

X, Y and Z can be obtained by linear mapping of parameter matrix, firstly, the similarity between query vector X and key value vector Y is calculated in the form of inner product, then the scale factor

Is to adjust to avoid excessive inner product values, dim is the dimension of the query vector and the key-value vector.

Using the above attention value as the weight coefficient head from the input layer capsule to the output layer capsule_h；

Splicing the weight coefficients to obtain the probability value c from the input layer capsule to the output layer capsule_ij＝Concat(head₁，...，head_h，…，head_H)，H＝n_j. Specifically, the distribution probability from the capsule of the input layer to the capsule of the output layer can be obtained by splicing the weight coefficients corresponding to the attention heads

S4, obtaining the input of the capsule of the output layer according to the distribution probability;

the formula for the input of the capsule of the output layer is:

wherein s is_jInput of capsules as output layer, c_ijThe probability value of an input layer capsule to an output layer capsule,

And S5, activating the input of the capsules of the output layer through a Squash activation function to obtain the output of the capsules of the output layer.

The formula for the output of the capsules of the output layer is:

wherein v is_jIs the output vector of the jth capsule of the output layer.

The method is based on the self-attention mechanism and the intensive study of gabor convolution, a novel method for constructing a gabor convolution capsule based on a self-attention route is provided, and on the basis, a convolution capsule network of the attention route is constructed, so that parameter decrement and local routing can be realized, and an application mechanism of a dynamic routing mechanism in a convolution neural network is realized to construct a deeper network structure. The method solves the problems of gradient disappearance caused by deep stacking and excessive coupling of capsules in the traditional capsule network construction process, and ensures the accuracy of feature representation in multi-view image classification.

The present disclosure further discloses a method for classifying multi-view images based on the above rolling capsule layer, as shown in fig. 3, the method includes:

s31, inputting the image into a convolutional neural network to obtain a main characteristic image;

the convolutional neural network includes an input layer, a plurality of convolutional layers, a ReLU layer, and a max-firing layer. Converting the multi-view image X into (X)₁，x₂，……x_m) And inputting an input layer, wherein the ReLU layer is used for enabling partial neuron output to be 0, sparseness is caused, and the max-posing layer is used for compressing the characteristic image to obtain a main characteristic image.

And S32, inputting the main characteristic image into two layers of the convolution capsule layers to obtain a classification result of the multi-perspective image.

Based on the same inventive concept, the embodiment of the present disclosure further provides a device for constructing a convolution capsule layer, and the following introduces a facial image attribute editing device according to the embodiment of the present disclosure with reference to fig. 4.

Fig. 4 schematically illustrates a block diagram of a build apparatus 400 for rolling a layer of capsules, in accordance with an embodiment of the disclosure.

As shown in fig. 4, the building apparatus 400 for rolling a layer of capsules includes an inner product module 410, a convolution module 420, a building module 430, an obtaining module 440, and an activation module 450. The build device 400 may perform the various methods described above with reference to fig. 1 and 2.

The convolutional capsule layer includes at least an input layer having a plurality of capsules and an output layer, the apparatus comprising:

the inner product module 410 performs, for example, operation S1 described with reference to fig. 1 above, for inner-product the Gabor filter with the convolution kernel to obtain a convolved Gabor filter of input layer capsules to output layer capsules;

the convolution module 420 performs, for example, operation S2 described with reference to fig. 1 above, for convolving the convolved gabor filter with the convolved feature map input by the input layer to obtain a prediction vector;

the building module 430 performs, for example, operation S3 described with reference to fig. 1 above, for building a self-attention route to obtain an assignment probability of a capsule of the input layer to a capsule of the output layer;

the obtaining module 440 performs, for example, operation S4 described with reference to fig. 1 above, for obtaining an input of a capsule of the output layer according to the assigned probability;

the activation module 450 performs, for example, operation S5 described with reference to fig. 1 above, for activating the input of the capsules of the output layer via the Squash activation function to obtain the output of the capsules of the output layer.

The embodiment of the present disclosure further provides a device for classifying multi-view images, and the device 500 for classifying multi-view images according to the embodiment of the present disclosure is described below with reference to fig. 5.

Fig. 5 schematically shows a block diagram of a classification apparatus 500 of a multi-view image according to an embodiment of the present disclosure.

As shown in fig. 5, the apparatus 500 for classifying multi-view images includes a first input module 510 and a second input module 520. The apparatus 500 for classifying a multi-view image may perform various methods described above with reference to fig. 3.

The first input module 510 performs, for example, operation S31 described with reference to fig. 3 above, for inputting the image into a convolutional neural network to obtain a main feature image;

the second input module 520 performs, for example, operation S32 described with reference to fig. 3 above, for inputting the main feature image into two layers of the convolution capsule layer to obtain a classification result of the multi-perspective image.

Fig. 6 schematically shows a block diagram of an electronic device adapted to implement the methods of the present disclosure, in accordance with an embodiment of the present disclosure. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device 600 includes a processor 610, a computer-readable storage medium 620. The computer system 600 may perform a method according to an embodiment of the disclosure.

In particular, the processor 610 may comprise, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor 610 may also include onboard memory for caching purposes. The processor 610 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

Computer-readable storage medium 620 may be, for example, any medium that can contain, store, communicate, propagate, or transport the instructions. For example, a readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. Specific examples of the readable storage medium include: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and/or wired/wireless communication links.

The computer-readable storage medium 620 may include a computer program 621, which computer program 621 may include code/computer-executable instructions that, when executed by the processor 610, cause the processor 610 to perform a method according to an embodiment of the disclosure, or any variation thereof.

The computer program 621 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 621 may include one or more program modules, including 621A, 621B, … …, for example. It should be noted that the division and number of the modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 610 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 610.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of constructing a layer of convoluted capsules comprising at least an input layer having a plurality of capsules and an output layer, the method comprising:

s3, constructing a self-attention route to acquire the distribution probability of the capsules of the input layer to the capsules of the output layer; the construction method of the self-attention route comprises the following steps: the prediction vector [ w ]_j，h_j，n_i，n_j，d_j]Transpose to obtain [ w_j，h_j，n_j，n_i，d_j]So as to output the number n of capsules corresponding to the jth capsule of the output layer_jAs heads of a multi-headed attention mechanism, along n_iCalculating the correlation between initial prediction vectors of the ith capsule of the input layer after affine transformation in the corresponding dimensionWherein, w_jFor the width of the convolved feature map, h_jFor the height of the convolved feature map, n_iFor the number of elements of the i-th capsule of the input layer, n_jNumber of elements of jth capsule of output layer, d_jIs the dimension of the capsule; the calculation process of the distribution probability comprises the following steps: obtaining attention value head of the prediction vector_hWherein, in the step (A),

x is a query vector

Y is a key value vector

Z is a vector of values

Is a prediction vector; taking the attention value as a weight coefficient of an input layer capsule to an output layer capsule; splicing the weight coefficients to obtain the probability value c from the input layer capsule to the output layer capsule_ij＝Concat(head₁，...，head_h，…，head_H)，H＝n_j；

and S5, activating the input of the capsule of the output layer through a Squash activation function to obtain the output of the capsule of the output layer.

2. The building method according to claim 1, wherein the input of the capsule of the output layer is calculated by the formula:

3. The building method according to claim 2, wherein the output of the capsule of the output layer is calculated by:

wherein v is_jIs the output vector of the jth capsule of the output layer.

4. A method for classifying multi-view images based on the method for constructing a rolling capsule layer according to any one of claims 1 to 3, comprising:

inputting the image into a convolutional neural network to obtain a main characteristic image;

and inputting the main characteristic image into two layers of the convolution capsule layers to obtain a classification result of multi-view images.

5. The method according to claim 4, wherein the convolutional neural network comprises an input layer, a plurality of convolutional layers, a ReLU layer for making a partial neuron output 0, resulting in sparsity, and a max-posing layer for compressing the feature image to obtain a main feature image.

6. An apparatus for constructing a convolutional capsule layer comprising at least an input layer having a plurality of capsules and an output layer, the apparatus comprising:

the inner product module is used for carrying out inner product on the Gabor filter and the convolution kernel so as to obtain a convolution Gabor filter from the input layer capsule to the output layer capsule;

the convolution module is used for convolving the convolution gabor filter with the convolution characteristic diagram input by the input layer to obtain a prediction vector;

a construction module, configured to construct a self-attention route to obtain an assignment probability of a capsule of the input layer to a capsule of the output layer, wherein the construction method of the self-attention route includes: the prediction vector [ w ]_j，h_j，n_i，n_j，d_j]Transpose to obtain [ w_j，h_j，n_j，n_i，d_j]So as to output the number n of capsules corresponding to the jth capsule of the output layer_jAs heads of a multi-headed attention mechanism, along n_iCalculating the correlation between initial prediction vectors of the ith capsule of the input layer after affine transformation in the dimension where w_jFor the width of the convolved feature map, h_jFor the height of the convolved feature map, n_iFor the number of elements of the i-th capsule of the input layer, n_jNumber of elements of jth capsule of output layer, d_jIs the dimension of the capsule; the calculation process of the distribution probability comprises the following steps: obtaining attention value head of the prediction vector_hWherein, in the step (A),

x is a query vector

Y is a key value vector

Z is a vector of values

Is a prediction vector; taking the attention value as a weight coefficient of an input layer capsule to an output layer capsule; splicing the weight coefficients to obtain the input layerProbability value c of capsule to output layer capsule_ij＝Concat(head₁，...，head_h，…，head_H)，H＝n_j；

An obtaining module for obtaining an input of a capsule of an output layer according to the distribution probability;

and the activation module is used for activating the input of the capsule of the output layer by a Squash activation function to obtain the output of the capsule of the output layer.

7. A multi-view image classification device based on the method for constructing a rolling capsule layer according to any one of claims 1 to 3, comprising:

the first input module is used for inputting the image into the convolutional neural network to obtain a main characteristic image;

and the second input module is used for inputting the main characteristic image into two layers of the convolution capsule layers to obtain a classification result of the multi-view image.

8. An electronic device, comprising:

a processor; and

a memory having computer-readable instructions stored thereon that, when executed by the processor, cause the processor to perform the method of any of claims 1-5.