CN111967522A

CN111967522A - Image sequence classification method based on funnel convolution structure

Info

Publication number: CN111967522A
Application number: CN202010834656.9A
Authority: CN
Inventors: 黄新俊; 陈建炜; 陈阳
Original assignee: Nanjing Tuge Medical Technology Co ltd
Current assignee: Nanjing Tuge Medical Technology Co ltd
Priority date: 2020-08-19
Filing date: 2020-08-19
Publication date: 2020-11-20
Anticipated expiration: 2040-08-19
Also published as: CN111967522B

Abstract

The invention discloses an image sequence classification method based on a funnel convolution structure, which comprises the following steps: step 1: extracting spatial features of the image sequence by adopting a convolution kernel of 1 × n; step 2: extracting short-term time characteristics of the image sequence by adopting a funnel convolution kernel; and step 3: extracting long-term time characteristics of the image sequence by using convlstm; and 4, step 4: adding weights to the features obtained in the steps 1-3 after the channel dimensions are connected; the method improves the 3D convolution kernel, replaces the 3D convolution kernel in the original network with the funnel convolution structure, and the funnel convolution structure completely separates the extraction of the time characteristic and the extraction of the space characteristic, so that the decoupling is better, the physical significance is more definite, the training parameters are reduced, the characteristics are extracted independently, the parameters have less mutual influence, and the effect is improved.

Description

Image sequence classification method based on funnel convolution structure

Technical Field

The invention belongs to the technical field of computer image processing, and particularly relates to an image sequence classification method based on a funnel convolution structure.

Background

Deep learning results from the stacking of sensors in machine learning. Convolutional neural networks, cyclic neural networks, etc. in deep learning may be used to solve problems including, but not limited to, classification, object detection, segmentation. In video classification, it is common to extract some frames, extract temporal features and spatial features for the frames, and then classify, i.e., classify the image sequence. There are three general categories of image sequence classification: 3D convolutional neural networks, convolutional neural networks + LSTM, dual-flow optical flow based networks. In the 3D convolution neural network, a 3X 3 convolution kernel is usually used, and the time feature and the space feature can be simultaneously extracted by using the 3D convolution kernel, so that the effect is better than that of extracting the space feature by using a single frame and a traditional method. The problem with the 3D convolution kernel is also apparent, i.e. the amount of parameters increases exponentially, leading to severe overfitting. The classical approach used in recent years to solve the problem of too large a number of 3D convolution kernels is to decompose the 3 x 3 convolution kernels into 1 x 3 and 3 x 1 convolution kernels, which are used to extract spatial and temporal features, respectively, to alleviate the overfitting problem.

However, temporal features cannot be extracted with only one convolution kernel 3 x 1. Since elements at the same location do not necessarily have the same semantics. For example, the position of the swing may not be in the same position in the next frame when the swing is swung, and the significance of extracting the time feature is lost even in the same position. Therefore, the 3 × 1 convolution kernel must extract features from the 1 × 3 convolution kernel before considering the pixels around the target position of the previous and subsequent frames. A problem arises in that spatial feature extraction affects temporal feature extraction, and parameter training is more difficult.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an image sequence classification method based on a funnel convolution structure, which adopts a convolution structure that completely separates temporal features and spatial features, uses a funnel convolution kernel to extract short-term temporal features, uses a 1 × 3 convolution kernel to extract spatial features, extracts long-term temporal features, and uses a channel attention mechanism to give weights to different feature channels.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

an image sequence classification method based on a funnel convolution structure,

performing image sequence classification by replacing a 3D convolution kernel in a 3D convolution neural network with a funnel convolution structure, the method comprising:

step 1: extracting the spatial features of the image sequence from the output of the previous layer of network through a convolution layer with the convolution kernel size of 1 x n;

step 2: extracting short-term time characteristics of an image sequence, namely relationship characteristics of a certain frame and surrounding frames thereof, from the output of the previous layer of network through a convolution layer of a funnel convolution kernel;

and step 3: the output of the previous layer network is passed through a convlstm network structure proposed by Xingjian SHI, zhoouring Chen et al to extract the long-term temporal features of the image sequence, i.e. the relationship features from the first frame to this frame.

And 4, step 4: and (3) adding weights to the features obtained in the steps 1-3 after the channel dimensions are connected, and then classifying through a full connection layer.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the sum of the number of the characteristic channels extracted in the steps 1 to 3 is equal to the number of the convolutional layer channels of the original 3D convolutional neural network.

The funnel convolution kernel described in step 2 above is obtained by modifying a 3D convolution kernel of n x n by: the spatial convolution size at the center of the 3D convolution kernel convolution is changed to 1 x 1, and the other positions are unchanged.

The above-mentioned 3D convolution kernel is a 3 x 3D convolution kernel, and the funnel convolution kernel improved from the 3 x 3D convolution kernel is: 3D convolution kernels stacked from 3 x 3, 1 x 1, 3 x 3, these 3 2D convolution kernels.

The step 4 is specifically:

and (3) adding weight to the features obtained in the steps (1) to (3) by adopting a channel attention mechanism, namely connecting the features obtained in the steps (1) to (3) on the channel, performing global pooling outside the channel, and multiplying the features after the features are connected element by element after passing through a full connection layer to realize image sequence classification.

The invention has the following beneficial effects:

a 3D convolutional neural network is a network used to extract features of a time-series image, typically using a convolution kernel of n x n, i.e., the size of the convolution kernel is n in time, image length, and width, so that temporal and spatial features can be extracted simultaneously. In order to extract time and space characteristics independently in 3D convolution, the convolution kernel is improved, the 3D convolution kernel in the original network is replaced by a funnel convolution structure, so that the funnel convolution structure can extract space characteristics, short-term time characteristics and long-term time characteristics independently, the weights of the characteristics are measured by an attention mechanism, the network can process the time characteristics and the space characteristics independently, the funnel convolution structure completely separates the extraction of the time characteristics and the extraction of the space characteristics, the decoupling is better, the physical significance is more clear, training parameters are reduced, the characteristics are extracted independently, the parameters have less mutual influence, and the effect is improved.

Drawings

FIG. 1 is a schematic diagram of a funnel convolution structure;

fig. 2 is a schematic structural diagram of an I3D convolutional neural network.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

The invention discloses an image sequence classification method based on a funnel convolution structure, which is characterized by comprising the following steps of:

step 1: extracting spatial features of the image sequence by adopting a convolution kernel of 1 × n;

step 2: extracting short-term time characteristics of the image sequence by adopting a funnel convolution kernel;

the funnel convolution kernel results from the following modifications to a n x n 3D convolution kernel: the spatial convolution size at the center of the 3D convolution kernel convolution is changed to 1 x 1, and the other positions are unchanged.

Referring to fig. 1, the 3D convolution kernel is a 3D convolution kernel of 3 × 3, and the funnel convolution improved by the 3D convolution kernel of 3 × 3 is: 3D convolution kernels stacked from 3 x 3, 1 x 1, 3 x 3, these 3 2D convolution kernels.

The left diagram of fig. 1 is a funnel convolution structure, which can replace the 3D convolutional layer in the original network, and at this time, N1, N2, N3 and the number of channels of the 3D convolutional layer in the original network should be ensured. The right graph is a funnel convolution, i.e. the convolution center size of the original 3D convolution kernel is changed to 1 x 1.

In addition to convolving the center pixel, changes in other pixels will only have an effect on one of the short-term temporal or spatial signatures.

And step 3: extracting long-term time characteristics of the image sequence by using convlstm, and connecting the characteristics of the steps 1-3 in a channel dimension;

referring to fig. 1 and 2, in the embodiment, all 3 × 3 convolution kernels of the I3D network are replaced with the funnel convolution structure proposed by the present invention. Fig. 2 shows the left diagram of I3D network structure, in which there are several inclusion modules, and the right diagram shows the network structure of the inclusion modules.

And 4, step 4: and (3) adding weights to the features obtained in the steps 1-3 by adopting a channel attention mechanism, and then classifying through a full connection layer.

In an example, comparing the accuracy of I3D and improved I3D on a test set, I3D network and improved I3D network results pairs are shown in table 1, trained ab initio on a UCF101 data set.

TABLE 1I 3D and improved I3D accuracy and parameter variables

	Rate of accuracy	Amount of ginseng
			I3D	42.59％	12.4M
Improved I3D	45.02％	10.99M

The improved I3D has less parameter quantity and higher accuracy, and the decoupling and operation achieves obvious results.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. An image sequence classification method based on a funnel convolution structure is characterized by comprising the following steps:

and step 3: and (3) extracting long-term time characteristics of the image sequence, namely relationship characteristics from the first frame to the frame, from the output of the upper layer network through a convlstm network structure.

2. The method for classifying image sequences based on the funnel convolution structure as claimed in claim 1, wherein the sum of the number of the extracted feature channels from step 1 to step 3 is equal to the number of convolution layer channels of the original 3D convolution neural network.

3. The method of claim 1, wherein the funnel convolution kernel of step 2 is obtained by modifying a n x n 3D convolution kernel by: the spatial convolution size at the center of the 3D convolution kernel convolution is changed to 1 x 1, and the other positions are unchanged.

4. The method according to claim 1, wherein the 3D convolution kernel is a 3 x 3D convolution kernel, and the funnel convolution kernel modified from the 3 x 3D convolution kernel is: 3D convolution kernels stacked by 3 x 3, 1 x 1, 3 x 3 2D convolution kernels.

5. The method for classifying image sequences based on the funnel convolution structure according to claim 1, wherein the step 4 specifically comprises: